CN118859113A

CN118859113A - Convolutional neural network sound source localization method and system based on arrival time difference

Info

Publication number: CN118859113A
Application number: CN202411038961.1A
Authority: CN
Inventors: 王�琦; 肖松; 涂山川; 李学龙
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2024-02-04
Filing date: 2024-07-31
Publication date: 2024-10-29

Abstract

The present invention relates to the field of intelligent positioning technology, and specifically to a convolutional neural network sound source positioning method and system based on arrival time difference, the method comprising obtaining a sound source data set through a sound source array receiver; extracting the time delay characteristics of the sample data set according to a weighted phase change generalized cross-correlation algorithm; using the time delay characteristics as input, the coordinate value of the sound source as output, and using the mean square error as the loss function to train the deep residual convolutional network; obtaining the trained deep residual convolutional network; inputting the time delay characteristics corresponding to the sound source to be located into the trained deep residual convolutional network to obtain the coordinate value of the sound source. The present invention fully learns the time difference characteristics of audio data by introducing a deep residual neural network, and then comprehensively utilizes the time difference relationship, position space characteristics and data change law between multiple microphone arrays, so as to achieve better sound source positioning effect and higher accuracy.

Description

Convolutional neural network sound source localization method and system based on arrival time difference

技术领域Technical Field

本发明涉及智能定位技术领域，具体涉及一种基于到达时间差的卷积神经网络声源定位方法及系统。The present invention relates to the field of intelligent positioning technology, and in particular to a convolutional neural network sound source positioning method and system based on arrival time difference.

背景技术Background Art

声源定位是评价声学系统性能的关键指标之一，广泛应用于各种领域，包括无人机导航、智能监控系统和军事应用。深入研究声源定位技术，涉及通过在有限的传感器位置数据中重构声源位置、通过高精度信号处理和算法预测声源在空间中的动态变化，对目标定位和跟踪具有重要的应用价值。声源定位技术是高级音频信号处理技术之一。声源定位技术可以分为两大类，即声阵列（也叫传声器阵列或麦克风阵列）声源定位和声强探头声场测试。目前，声源定位常用麦克风作为声源接收传感器，因此基于麦克风阵列的声源定位在语音会议、安防以及机器人听觉等领域得到了广泛应用，是智能信号处理系统的重要组成部分，特别在声学信号处理、信息检测与定位、无线通信等领域的研究中，声源定位技术的发展对提高系统性能和拓展应用领域具有重要意义。Sound source localization is one of the key indicators for evaluating the performance of acoustic systems and is widely used in various fields, including drone navigation, intelligent monitoring systems, and military applications. In-depth research on sound source localization technology involves reconstructing the sound source position in limited sensor position data and predicting the dynamic changes of the sound source in space through high-precision signal processing and algorithms, which has important application value for target positioning and tracking. Sound source localization technology is one of the advanced audio signal processing technologies. Sound source localization technology can be divided into two categories, namely, sound array (also called microphone array or microphone array) sound source localization and sound intensity probe sound field test. At present, sound source localization often uses microphones as sound source receiving sensors. Therefore, sound source localization based on microphone arrays has been widely used in voice conferencing, security, and robot hearing. It is an important part of intelligent signal processing systems. Especially in the research of acoustic signal processing, information detection and positioning, wireless communication, etc., the development of sound source localization technology is of great significance to improving system performance and expanding application fields.

随着深度学习的发展，深度神经网络的强大表征能力在声源定位方面展开了研究。本领域技术人员利用深度神经网络进行声源定位，首先通过卷积神经网络(Convolutional Neural Network，CNN)提取环境中声源的空间特征。将这些特征与环境的声学信息结合，然后通过全连接神经网络对声源位置进行准确预测。现有技术中公开了基于卷积神经网络和长短时记忆神经网络(Long-short-term Memory Neural Network，LSTM)的混合神经网络，使用混合网络通过学习声音信号的时序演化和频谱特征，实现了对声源位置的准确预测，提升方法在处理复杂声源和多变环境下的有效性。然而，以上方法忽略了声阵元接收信号间的幅度差，未全面考虑多个麦克风阵列的时差关系、位置空间特征以及数据变化规律，导致声源定位效果不够理想。With the development of deep learning, the powerful representation ability of deep neural networks has been studied in the field of sound source localization. Those skilled in the art use deep neural networks to localize sound sources. First, the spatial features of the sound source in the environment are extracted through a convolutional neural network (CNN). These features are combined with the acoustic information of the environment, and then the sound source position is accurately predicted through a fully connected neural network. The prior art discloses a hybrid neural network based on a convolutional neural network and a long-short-term memory neural network (LSTM). The hybrid network is used to learn the temporal evolution and spectral characteristics of the sound signal to achieve accurate prediction of the sound source position, thereby improving the effectiveness of the method in processing complex sound sources and changing environments. However, the above method ignores the amplitude difference between the signals received by the sound array elements, and does not fully consider the time difference relationship, positional spatial characteristics and data change laws of multiple microphone arrays, resulting in unsatisfactory sound source localization effects.

发明内容Summary of the invention

为了解决上述技术问题，本发明主要针对现有声源定位方法未能全面考虑多个麦克风的时差关系、位置空间特征和数据变化规律导致的声源定位效果不够理想的困难；本发明提供一种基于到达时间差的卷积神经网络声源定位方法及其系统，该方法可以有效地在有限声源音频数据预测更高精度的声源位置坐标。In order to solve the above technical problems, the present invention mainly aims at the difficulty that the existing sound source localization method fails to fully consider the time difference relationship, position space characteristics and data change law of multiple microphones, resulting in less than ideal sound source localization effect; the present invention provides a convolutional neural network sound source localization method based on arrival time difference and its system, which can effectively predict higher-precision sound source position coordinates in limited sound source audio data.

本发明第一个目的是提供一种基于到达时间差的卷积神经网络声源定位方法，包括：The first object of the present invention is to provide a convolutional neural network sound source localization method based on arrival time difference, comprising:

通过声源阵列接收器获取声源数据集，将声源数据集进行归一化处理获取样本数据集；Acquire a sound source data set through a sound source array receiver, and perform normalization processing on the sound source data set to obtain a sample data set;

根据加权相变广义互相关算法提取样本数据集的时延特征；Extract the time delay characteristics of the sample data set based on the weighted phase change generalized cross-correlation algorithm;

将时延特征作为输入，将声源的坐标值作为输出，利用均方误差为损失函数，对深度残差卷积网络进行训练；获取训练好的深度残差卷积网络；Taking the time delay feature as input, the coordinate value of the sound source as output, and using the mean square error as the loss function, the deep residual convolutional network is trained; obtaining the trained deep residual convolutional network;

将待定位的声源对应的时延特征输入至训练好的深度残差卷积网络中，获取声源的坐标值。The time delay features corresponding to the sound source to be located are input into the trained deep residual convolutional network to obtain the coordinate value of the sound source.

优选的，深度残差卷积网络训练过程中，将时延特征输入到深度残差卷积网络中，首先通过特征提取模块进行时域特征提取，接着利用卷积模块挖掘声阵列空间信息，经过残差模块得到声源信号的高阶语义信息，再传输至全连接模块后，输出声源的坐标值。Preferably, during the training process of the deep residual convolutional network, the time delay features are input into the deep residual convolutional network. First, the time domain features are extracted through the feature extraction module, and then the convolution module is used to mine the spatial information of the sound array. The high-order semantic information of the sound source signal is obtained through the residual module, and then transmitted to the fully connected module to output the coordinate value of the sound source.

优选的，所述全连接模块是由最大池化层和全连接层构成。Preferably, the fully connected module is composed of a maximum pooling layer and a fully connected layer.

优选的，所述特征提取模块是由长为512的汉明窗组成的分帧处理层和分帧后的短时傅里叶变换层构成；Preferably, the feature extraction module is composed of a framing processing layer composed of a Hamming window with a length of 512 and a short-time Fourier transform layer after framing;

所述卷积模块是由卷积层、归一化层和最大池化层构成；The convolution module is composed of a convolution layer, a normalization layer and a maximum pooling layer;

所述残差模块是由多个基本残差块构成。The residual module is composed of multiple basic residual blocks.

优选的，所述卷积模块中，所述卷积层大小为(64,7,7)，步长为2的二维卷积；所述归一化层对输入的四维数组进行批量标准化处理，不改变输入维度；所述最大池化层池化核为3*3，步长为2。Preferably, in the convolution module, the convolution layer size is (64,7,7), and the step size is 2 for the two-dimensional convolution; the normalization layer performs batch normalization on the input four-dimensional array without changing the input dimension; the maximum pooling layer pooling kernel is 3*3, and the step size is 2.

优选的，所述声源的坐标值为三维笛卡尔坐标。Preferably, the coordinate values of the sound source are three-dimensional Cartesian coordinates.

优选的，获取声源数据集时，将异常数据进行清洗，其中，异常数据包括测量点坐标异常、测量点数值异常以及数据维度异常。Preferably, when acquiring the sound source data set, the abnormal data is cleaned, wherein the abnormal data includes abnormal measurement point coordinates, abnormal measurement point values, and abnormal data dimensions.

本发明第二个目的是提供一种基于到达时间差的卷积神经网络声源定位系统，包括：The second object of the present invention is to provide a convolutional neural network sound source localization system based on arrival time difference, comprising:

数据采集模块，用于通过声源阵列接收器获取声源数据集，将声源数据集进行归一化处理获取样本数据集；A data acquisition module is used to obtain a sound source data set through a sound source array receiver, and normalize the sound source data set to obtain a sample data set;

模型获取模块，用于根据加权相变广义互相关算法提取样本数据集的时延特征；A model acquisition module, used for extracting the time delay characteristics of the sample data set according to the weighted phase change generalized cross-correlation algorithm;

声源定位模块，用于将待定位的声源对应的时延特征输入至训练好的深度残差卷积网络中，获取声源的坐标值。The sound source localization module is used to input the time delay features corresponding to the sound source to be located into the trained deep residual convolutional network to obtain the coordinate value of the sound source.

本发明至少具有如下有益效果：The present invention has at least the following beneficial effects:

本发明提供了一种基于到达时间差的卷积神经网络声源定位方法及系统，该方法通过引入深度残差卷积神经网络训练学习并捕捉音频信号的时域特征和声阵列空间信息，将样本数据映射至更加有效的特征空间，使得模型泛能力和适应性高于传统手工设计的特征提取器。The present invention provides a convolutional neural network sound source localization method and system based on arrival time difference. The method introduces deep residual convolutional neural network training to learn and capture the time domain characteristics of audio signals and the spatial information of sound arrays, and maps sample data to a more effective feature space, so that the model's generalization ability and adaptability are higher than those of traditional manually designed feature extractors.

本发明设计了简单有效的浅层特征提取网络和网络训练策略，其训练速度远比其他深层网络特征提取算法。The present invention designs a simple and effective shallow feature extraction network and network training strategy, and its training speed is much faster than other deep network feature extraction algorithms.

本发明通过引入深度残差神经网络充分学习音频数据时差特征，进而综合利多个麦克风阵列间的时差关系、位置空间特征和数据变化规律，使得在声源定位效果更好、精度更高。The present invention introduces a deep residual neural network to fully learn the time difference characteristics of audio data, and then comprehensively utilizes the time difference relationship, position space characteristics and data change rules among multiple microphone arrays, so as to achieve better sound source positioning effect and higher accuracy.

本发明提出的特征选择通过去除特征噪声，减小了噪声对模型的影响，在噪声环境下的鲁棒性相对更好。The feature selection proposed in the present invention reduces the influence of noise on the model by removing feature noise, and has relatively better robustness in a noisy environment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提供了一种基于到达时间差的卷积神经网络声源定位方法的流程图；FIG1 is a flow chart of a convolutional neural network sound source localization method based on time difference of arrival provided by the present invention;

图2为本发明深度残差卷积网络训练过程图；FIG2 is a diagram of the deep residual convolutional network training process of the present invention;

图3为本发明声源定位过程图；FIG3 is a diagram of the sound source localization process of the present invention;

图4为本发明麦克风的二维布局图，其中图4(a)是多个麦克风阵列测量点的二维布局图，图4(b)是单个麦克风阵列各测量点内麦克风阵列布局图；FIG. 4 is a two-dimensional layout diagram of microphones of the present invention, wherein FIG. 4( a ) is a two-dimensional layout diagram of multiple microphone array measurement points, and FIG. 4( b ) is a layout diagram of microphone arrays within each measurement point of a single microphone array;

图5为本发明麦克风阵列模拟结果的估计误差图，其中图5(a)是单个麦克风阵列模拟结果的估计误差图，图5(b)是八个麦克风阵列模拟结果的估计误差图；FIG. 5 is an estimated error diagram of the microphone array simulation results of the present invention, wherein FIG. 5( a ) is an estimated error diagram of the simulation results of a single microphone array, and FIG. 5( b ) is an estimated error diagram of the simulation results of eight microphone arrays;

图6为本发明在测试数据集上的声源定位预测结果可视化图，其中图6(a)是声源定位预测结果的二维可视化图，图6(b)是声源定位预测结果的三维可视化图；FIG6 is a visualization diagram of the sound source localization prediction result on the test data set of the present invention, wherein FIG6(a) is a two-dimensional visualization diagram of the sound source localization prediction result, and FIG6(b) is a three-dimensional visualization diagram of the sound source localization prediction result;

图7是本发明麦克风阵列数的累计误差图。FIG. 7 is a diagram showing the cumulative error of the number of microphone arrays of the present invention.

具体实施方式DETAILED DESCRIPTION

为了阐述本发明为达成预定发明目的所采取的技术手段及效果，以下结合实施例，进行详细说明。In order to illustrate the technical means and effects adopted by the present invention to achieve the predetermined purpose of the invention, the following is a detailed description in conjunction with embodiments.

本发明为了克服现有声源定位方法未能全面考虑多个麦克风的时差关系、位置空间特征和数据变化规律导致的声源定位效果不够理想的困难，本发明提供一种基于到达时间差(Time-Delay of Arrival, TDOA)的深度残差卷积神经网络声源定位方法及其系统。该方法首先采用加权相变广义互相关(Generalized Cross-Correlation with PhaseTransform, GCC-PHAT)时延算法提取声阵音频信号的时延特征；其次，与原始声阵音频信号的语谱图幅度信息和相位信息相结合作为输入，引入深度残差卷积神经网络(DeepResidual Convolution Network, DRCN)学习音频数据位置空间特征，进而综合利用时差空间关系、数据变化规律等特征信息，预测定位声源位置。最后，该方法通过构建预测损失，并通过输入多个麦克风的音频数据对声源定位模型进行训练，使得其能够快速预测不同场景的更高精度声源位置坐标。本发明方法可以有效地在有限声源音频数据预测更高精度的声源位置坐标。In order to overcome the difficulty that the existing sound source localization method fails to fully consider the time difference relationship, position space characteristics and data change law of multiple microphones, the sound source localization effect is not ideal. The present invention provides a deep residual convolutional neural network sound source localization method and system based on time difference of arrival (TDOA). The method first uses the weighted phase change generalized cross-correlation (Generalized Cross-Correlation with PhaseTransform, GCC-PHAT) delay algorithm to extract the time delay characteristics of the sound array audio signal; secondly, the deep residual convolutional neural network (Deep Residual Convolution Network, DRCN) is introduced to learn the position space characteristics of the audio data, and then the feature information such as the time difference space relationship and the data change law is comprehensively used to predict the location of the sound source. Finally, the method constructs a prediction loss and trains the sound source localization model by inputting the audio data of multiple microphones, so that it can quickly predict the higher precision sound source position coordinates of different scenes. The method of the present invention can effectively predict the higher precision sound source position coordinates in limited sound source audio data.

参见图1所示，一种基于到达时间差的卷积神经网络声源定位方法，包括：As shown in FIG1 , a convolutional neural network sound source localization method based on arrival time difference includes:

S1、通过声源阵列接收器获取声源数据集，将声源数据集进行归一化处理获取样本数据集；S1, obtaining a sound source data set through a sound source array receiver, and normalizing the sound source data set to obtain a sample data set;

需要说明的是，通过声源阵列接收器接收声源传播的音频信号，并记录对应声源的坐标位置，制作为声源数据集。It should be noted that the audio signal transmitted by the sound source is received by the sound source array receiver, and the coordinate position of the corresponding sound source is recorded to produce a sound source data set.

获取声源数据集时，将异常数据进行清洗，其中，异常数据包括测量点坐标异常、测量点数值异常以及数据维度异常。When acquiring the sound source data set, the abnormal data is cleaned, where the abnormal data includes abnormal measurement point coordinates, abnormal measurement point values, and abnormal data dimensions.

在本实施例中，数据预处理过程是，首先对数据集进行异常清洗，其次将清洗过后的数据划分为训练集、测试集和验证集用于后续模型的训练、测试和验证，最后进行归一化处理。In this embodiment, the data preprocessing process is to first clean the data set for anomalies, then divide the cleaned data into a training set, a test set, and a validation set for subsequent model training, testing, and validation, and finally perform normalization processing.

数据清洗。将原始数据中的异常数据点进行清洗，异常数据点包括但不限于测量点坐标异常、测量点数值异常以及数据维度异常；Data cleaning: Clean the abnormal data points in the original data, including but not limited to abnormal measurement point coordinates, abnormal measurement point values, and abnormal data dimensions;

数据划分。按照7：2：1的比例将清洗过后的数据划分为训练集、测试集和验证集用于后续模型的训练、测试和验证；Data division: Divide the cleaned data into training set, test set and validation set in a ratio of 7:2:1 for subsequent model training, testing and validation;

数据归一化处理。具体地，通过归一化处理操作将所有读取的音频参数值归一化到区间，该方式有助于加快网络模型的收敛速度，具体归一化表示为，其中，分别表示音频信号记录的开始时间和结束时间，将与其对应的实际三维笛卡尔坐标作为一个样本制作成样本数据集。Data normalization processing. Specifically, all the read audio parameter values are normalized to This method helps to speed up the convergence of the network model. The specific normalization is expressed as ,in, Respectively represent the start time and end time of the audio signal recording, and the actual three-dimensional Cartesian coordinates corresponding to them As a sample to create a sample data set .

S2、根据加权相变广义互相关算法提取样本数据集的时延特征；S2, extracting the time delay characteristics of the sample data set according to the weighted phase change generalized cross-correlation algorithm;

时延特征提取过程是首先将训练样本结合加权相变广义互相关算法（GCC-PHAT）得到时延特征，再将时延特征输入到深度残差卷积网络中学习得到时域特征和声阵列空间信息等高阶语义特征。The delay feature extraction process is to firstly transform the training samples The time delay features are obtained by combining the weighted phase change generalized cross-correlation algorithm (GCC-PHAT), and then the time delay features are input into the deep residual convolutional network to learn and obtain high-order semantic features such as time domain features and acoustic array spatial information.

时延特征提取，将输入训练样本结合加权相变广义互相关算法（GCC-PHAT）得到时延特征。具体地，GCC-PHAT表示为Delay feature extraction, input training samples The delay characteristics are obtained by combining the weighted phase change generalized cross-correlation algorithm (GCC-PHAT). Specifically, GCC-PHAT is expressed as

式中，表示PHAT加权窗函数，是和的互功率谱函数，即对所有样本进行时延特征提取完成后得到其时延特征集合；是频率；是时延，表示两个信号之间的时间差；该公式将其中一个音频信号作为基础，其他音频信号先进行共轭运算，然后依次和基础信号进行加权卷积得到与各麦克风信号之间的互功率谱，最后经过短时傅里叶逆变换得到所需要的时延特征。In the formula, represents the PHAT weighted window function, yes and The cross power spectrum function is After the delay feature extraction of all samples is completed, the delay feature set is obtained. ; is the frequency; is the time delay, which represents the time difference between the two signals; this formula converts one of the audio signals As a basis, other audio signals First, a conjugate operation is performed, and then weighted convolution is performed with the basic signal in turn to obtain the cross-power spectrum between the microphone signals, and finally the required time delay characteristics are obtained through short-time inverse Fourier transform.

S3、将时延特征作为输入，将声源的坐标值作为输出，利用均方误差为损失函数，对深度残差卷积网络进行训练；获取训练好的深度残差卷积网络；S3, taking the time delay feature as input, taking the coordinate value of the sound source as output, using the mean square error as the loss function, training the deep residual convolutional network; obtaining the trained deep residual convolutional network;

深度残差卷积网络训练过程中，将时延特征输入到深度残差卷积网络中，首先通过特征提取模块进行时域特征提取，接着利用卷积模块挖掘声阵列空间信息，经过残差模块得到声源信号的高阶语义信息，再传输至全连接模块后，输出声源的坐标值。During the training process of the deep residual convolutional network, the time delay features are input into the deep residual convolutional network. First, the time domain features are extracted through the feature extraction module, and then the convolution module is used to mine the spatial information of the sound array. The high-order semantic information of the sound source signal is obtained through the residual module, and then transmitted to the fully connected module to output the coordinate value of the sound source.

所述声源的坐标值为三维笛卡尔坐标。The coordinate values of the sound source are three-dimensional Cartesian coordinates.

所述全连接模块是由最大池化层和全连接层构成。The fully connected module is composed of a maximum pooling layer and a fully connected layer.

所述特征提取模块是由长为512的汉明窗组成的分帧处理层（且帧移为窗长的一半）和分帧后的短时傅里叶变换层构成。The feature extraction module is composed of a frame processing layer composed of a Hamming window with a length of 512 (and a frame shift of half the window length) and a short-time Fourier transform layer after framing.

所述卷积模块是由卷积层、归一化层和最大池化层构成，其中卷积层大小为(64,7,7)，步长为2的二维卷积，即64维且卷积核大小为7*7的二维卷积，连接批归一化层及最大池化层；归一化层对输入的四维数组进行批量标准化处理，不改变输入维度；最后最大池化层池化核为3*3，步长为2。The convolution module is composed of a convolution layer, a normalization layer and a maximum pooling layer, wherein the convolution layer size is (64,7,7), a two-dimensional convolution with a step size of 2, that is, a two-dimensional convolution with 64 dimensions and a convolution kernel size of 7*7, connected to a batch normalization layer and a maximum pooling layer; the normalization layer performs batch normalization on the input four-dimensional array without changing the input dimension; finally, the maximum pooling layer has a pooling kernel of 3*3 and a step size of 2.

所述残差模块是由多个基本残差块构成。每个基本残差块内部是一个普通的卷积神经网络结构，但是在每个基本块的输入和输出之间，引入了跨层连接，使得信息可以直接从输入传递到输出。The residual module is composed of multiple basic residual blocks. Each basic residual block has an ordinary convolutional neural network structure, but a cross-layer connection is introduced between the input and output of each basic block so that information can be directly transferred from the input to the output.

参见图2所示，时域特征和声阵列空间信息提取过程是，首先将训练样本时延特征输入到深度残差卷积神经网络中，首先利用特征提取模块对输入进行时域特征提取并得到时域特征，其次将时域特征与声阵列空间信息融合并依次输入到卷积模块与残差模块中得到高阶语义特征，最后将高阶语义特征输入到全连接层得到声源定位预测输出，并通过最小化预测损失目标函数进行优化求解，使得网络所提取的特征更加有效。As shown in FIG2 , the process of extracting time domain features and acoustic array spatial information is to firstly transform the time delay features of the training samples into Input into the deep residual convolutional neural network, first use the feature extraction module to extract the time domain features of the input and obtain the time domain features Secondly, the time domain features are fused with the spatial information of the acoustic array and input into the convolution module and the residual module in turn to obtain high-order semantic features. Finally, the high-order semantic features are input into the fully connected layer to obtain the sound source localization prediction output. , and optimizes and solves it by minimizing the prediction loss objective function, making the features extracted by the network more effective.

在本实施例中，时域特征和声阵列空间信息提取。将得到的时延特征输入到深度残差卷积神经网络中，首先通过特征提取模块进行时域特征提取，接着利用卷积模块挖掘声阵列空间信息同时减少参数和计算量，最后经过残差模块得到声源信号的高阶语义信息。具体地，将该过程表示为，其中、、分别表示特征提取模块、卷积模块和残差模块。In this embodiment, the time domain features and acoustic array spatial information are extracted. The obtained time delay features are input into the deep residual convolutional neural network. First, the time domain features are extracted through the feature extraction module, and then the convolution module is used to mine the acoustic array spatial information while reducing parameters and calculations. Finally, the high-order semantic information of the sound source signal is obtained through the residual module. Specifically, the process is expressed as ,in , , They represent feature extraction module, convolution module and residual module respectively.

深度残差卷积网络训练，通过网络训练有效捕捉麦克风阵列间的时空间依赖关系，并提升模型泛化能力。Deep residual convolutional network training effectively captures the temporal and spatial dependencies between microphone arrays through network training and improves the model's generalization ability.

网络输出，经过残差模块后，相关特征输入到一个最大池化层和全连接层构成的全连接模块，将池化层的结果拉平（flatten）成一个长向量，汇总之前卷积层和池化层得到的底层的信息和特征，输出表示x，y，z轴的三维笛卡尔坐标值。具体表示为，其中表示全连接层模块。The network output, after passing through the residual module, the relevant features are input into a fully connected module consisting of a maximum pooling layer and a fully connected layer. The result of the pooling layer is flattened into a long vector, summarizing the underlying information and features obtained by the previous convolutional layer and pooling layer, and outputting the three-dimensional Cartesian coordinate values representing the x, y, and z axes. Specifically expressed as ,in Represents a fully connected layer module.

损失函数。利用均方误差（MSE）作为深度残差卷积神经网络的损失函数完成对网络的训练，具体表示为，其中与分别表示声源信号的预测的和真实的笛卡尔坐标值。Loss function. The mean square error (MSE) is used as the loss function of the deep residual convolutional neural network to complete the training of the network, which is specifically expressed as ,in and Represents the sound source signal The predicted and true Cartesian coordinates of .

S4、将待定位的声源对应的时延特征输入至训练好的深度残差卷积网络中，获取声源的坐标值。S4. Input the time delay features corresponding to the sound source to be located into the trained deep residual convolutional network to obtain the coordinate value of the sound source.

参见图3所示，声源定位预测过程是，首先通过传声器阵列来提取多个“传声器对”信号的到达时间差，计算它们的广义互相关函数并由此获取多个麦克风数据间的时差关系，其次通过训练的网络对特征进行空间推断，推断未知声源空间位置，得到麦克风阵列间的空间依赖关系，最后将训练完成的网络用于声源定位预测。As shown in Figure 3, the sound source localization prediction process is to first extract the arrival time difference of multiple "microphone pair" signals through the microphone array, calculate their generalized cross-correlation function and thereby obtain the time difference relationship between multiple microphone data, then use the trained network to spatially infer the features, infer the spatial position of the unknown sound source, and obtain the spatial dependency relationship between the microphone arrays, and finally use the trained network for sound source localization prediction.

在本实施例中，定位误差指标，定义坐标测量的绝对误差来描述定位的精度，设声源位置实际坐标为，声源估计位置为，的定义为：In this embodiment, the positioning error index defines the absolute error of the coordinate measurement To describe the accuracy of positioning, let the actual coordinates of the sound source be , the estimated location of the sound source is , is defined as:

本发明是一种基于到达时间差的深度残差卷积神经网络声源定位算法研究方法，能够综合利多个麦克风的时差关系、位置空间特征和数据变化规律等特征信息，合理地预测声源位置。在特征提取方面，通过引入深度残差卷积神经网络学习并捕捉时域特征和声阵列空间信息，在声阵列空间依赖关系方面，通过引入广义互相关函数和深度残差卷积神经网络进行空间推断，使得该方法在有限测点数据准确预测声源位置。The present invention is a research method for a deep residual convolutional neural network sound source localization algorithm based on arrival time difference, which can comprehensively utilize characteristic information such as the time difference relationship, position space characteristics and data change law of multiple microphones to reasonably predict the sound source position. In terms of feature extraction, a deep residual convolutional neural network is introduced to learn and capture time domain features and sound array spatial information. In terms of the sound array spatial dependency, a generalized cross-correlation function and a deep residual convolutional neural network are introduced for spatial inference, so that the method can accurately predict the sound source position with limited measurement point data.

为了说明本发明提供的方法相关效果，结合附图通过以下仿真实验做进一步的说明。In order to illustrate the relevant effects of the method provided by the present invention, further explanation is given through the following simulation experiment in conjunction with the accompanying drawings.

1、仿真条件1. Simulation conditions

本发明是在中央处理器为 Intel® i5-11500 2.70GHz CPU、内存16G、WINDOWS11操作系统上，运用Python编程语言进行的仿真。The present invention is a simulation performed using the Python programming language on an Intel® i5-11500 2.70GHz CPU, 16G memory, and a WINDOWS11 operating system.

实验中使用的数据是基于真实数据，在200m×200m×100m实验区域内，采用模拟数据生成实验，得到2000个训练样本集和500个测试样本集的音频信号数据和声源坐标位置，以供后续网络的训练和测试。The data used in the experiment is based on real data. In the 200m×200m×100m experimental area, simulated data generation experiments were used to obtain audio signal data and sound source coordinate positions of 2000 training sample sets and 500 test sample sets for subsequent network training and testing.

参见图4所示，其中图4(a)是在该区域内选取八个麦克风阵列测量点的二维布局图，图4(b)是每个测量点内采用9个麦克风摆成一个L型麦克风阵列的麦克风阵列布局图。See FIG. 4 , where FIG. 4 (a) is a two-dimensional layout diagram of eight microphone array measurement points selected in the area, and FIG. 4 (b) is a microphone array layout diagram of an L-shaped microphone array with 9 microphones arranged in each measurement point.

2、仿真内容2. Simulation content

在测量点围成的区域内随机生成一个位置，将声源音频数据放在该位置上，模拟声源真实位置。为了验证声源定位实验仿真结果的稳定性，本实验仿真均随机生成声源位置10次，并进行10次声源定位模拟实验。A position is randomly generated in the area surrounded by the measurement points, and the sound source audio data is placed at this position to simulate the real position of the sound source. In order to verify the stability of the simulation results of the sound source localization experiment, this experimental simulation randomly generates the sound source position 10 times and conducts 10 sound source localization simulation experiments.

参见图5所示，其中图5(a)是仅使用单个麦克风阵列的模拟结果，图5(b)是使用8个麦克风阵列测量点声源定位的模拟结果，使用多个麦克风阵列信号比仅使用单个麦克风阵列信号具有更小的误差，且声源位置估计的结果更稳定。As shown in Figure 5, Figure 5(a) is the simulation result of using only a single microphone array, and Figure 5(b) is the simulation result of measuring point sound source localization using an 8-microphone array. Using multiple microphone array signals has smaller errors than using only a single microphone array signal, and the sound source position estimation result is more stable.

为了进一步验证利用多个麦克风阵列信号的有效性，将上述多麦克风阵列实验模拟10次的随机声源真实位置和算法估计的声源位置，进行可视化。In order to further verify the effectiveness of using multiple microphone array signals, the actual position of the random sound source and the sound source position estimated by the algorithm in the above multi-microphone array experiment simulated 10 times are visualized.

参见图6所示，其中图6(a)是声源定位预测结果的二维可视化图，图6(b)是声源定位预测结果的三维可视化图，本发明所预测的声源位置与实际位置基本吻合。Referring to FIG. 6 , FIG. 6( a ) is a two-dimensional visualization of the sound source localization prediction result, and FIG. 6( b ) is a three-dimensional visualization of the sound source localization prediction result. The sound source position predicted by the present invention is substantially consistent with the actual position.

声源定位预测。计算测试样本的绝对误差，用于对比声源估计位置与实际位置之间的误差，结果如表1所示。其中，序号1-8是麦克风阵列估计的声源位置与实际位置的绝对误差，每个数量的麦克风阵列分别进行10次模拟实验，为了便于每个数量的麦克风阵列实验结果的量化展示，分别取每个数量麦克风阵列的十次实验中最大绝对误差作为实验展示结果。Sound source localization prediction. Calculate the absolute error of the test sample , used to compare the error between the estimated sound source position and the actual position, and the results are shown in Table 1. Among them, serial numbers 1-8 are the absolute errors between the sound source position estimated by the microphone array and the actual position. Each number of microphone arrays conducts 10 simulation experiments. In order to facilitate the quantitative display of the experimental results of each number of microphone arrays, the maximum absolute error in the ten experiments of each number of microphone arrays is taken as the experimental display result.

参见表1和图7所示，随着麦克风阵列的数量增多，实验累计误差在不断减小并趋于稳定。结合表1实验结果，可以将麦克风数量设置为5或6，在保证声源定位精度和控制计算开销的同时，减小麦克风实验开销成本。As shown in Table 1 and Figure 7, as the number of microphone arrays increases, the experimental cumulative error continues to decrease and tends to be stable. Combined with the experimental results in Table 1, the number of microphones can be set to 5 or 6, which can reduce the cost of microphone experiment while ensuring the accuracy of sound source localization and controlling the calculation overhead.

表1 实际和估计的声源位置及误差结果Table 1 Actual and estimated sound source locations and error results

综上所述，本发明在声源定位预测时满足定位误差精度指标并取得了更好的结果。本发明将传统音频信号处理算法与深度学习结合，引入加权相变广义互相关算法对音频信号进行初步的时延特征提取，后引入深度残差卷积神经网络与之结合，学习并捕捉音频信号中的时域特征和声阵列空间信息等高阶语义信息。本发明能够充分提取声阵音频信号中的关键特征，在声源定位预测更加快速准确，并通过以上仿真实验验证了本发明的有效性。In summary, the present invention meets the positioning error accuracy index and achieves better results when predicting sound source localization. The present invention combines the traditional audio signal processing algorithm with deep learning, introduces the weighted phase change generalized cross-correlation algorithm to perform preliminary time delay feature extraction on the audio signal, and then introduces the deep residual convolutional neural network to combine it to learn and capture high-order semantic information such as time domain features and sound array spatial information in the audio signal. The present invention can fully extract the key features in the sound array audio signal, and is faster and more accurate in sound source localization prediction, and the effectiveness of the present invention has been verified through the above simulation experiments.

本发明提供一种基于到达时间差的卷积神经网络声源定位系统，包括：The present invention provides a convolutional neural network sound source localization system based on arrival time difference, comprising:

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the principles of the present invention should be included in the protection scope of the present invention.

Claims

1. The convolutional neural network sound source localization method based on the arrival time difference is characterized by comprising the following steps of:

Acquiring a sound source data set through a sound source array receiver, and carrying out normalization processing on the sound source data set to acquire a sample data set;

extracting time delay characteristics of a sample data set according to a weighted phase change generalized cross correlation algorithm;

Taking the time delay characteristic as input, taking the coordinate value of the sound source as output, and training the depth residual convolution network by taking the mean square error as a loss function; acquiring a trained depth residual convolution network;

And inputting the time delay characteristics corresponding to the sound source to be positioned into a trained depth residual convolution network, and obtaining the coordinate value of the sound source.

2. The arrival time difference-based convolutional neural network sound source localization method according to claim 1, wherein in the training process of the depth residual convolutional network, time delay characteristics are input into the depth residual convolutional network, time domain characteristic extraction is firstly carried out through a characteristic extraction module, then acoustic array space information is mined through a convolutional module, high-order semantic information of a sound source signal is obtained through a residual module, and after the high-order semantic information is transmitted to a full-connection module, coordinate values of the sound source are output.

3. The arrival time difference based convolutional neural network sound source localization method of claim 2, wherein the fully connected module is comprised of a max pooling layer and a fully connected layer.

4. The arrival time difference-based convolutional neural network sound source localization method of claim 2, wherein the feature extraction module is composed of a framing processing layer consisting of hamming windows with length of 512 and a short-time fourier transform layer after framing;

the convolution module consists of a convolution layer, a normalization layer and a maximum pooling layer;

the residual block is composed of a plurality of basic residual blocks.

5. The arrival time difference based convolutional neural network sound source localization method of claim 4, wherein in said convolutional module, said convolutional layer is a two-dimensional convolution of size (64,7,7) with a step size of 2; the normalization layer performs batch normalization processing on the input four-dimensional array without changing the input dimension; the maximum pooling layer pooling core is 3*3, and the step length is 2.

6. The arrival time difference based convolutional neural network sound source localization method of claim 1, wherein the coordinate values of the sound source are three-dimensional cartesian coordinates.

7. The arrival time difference based convolutional neural network sound source localization method of claim 1, wherein the anomaly data is cleaned when a sound source data set is acquired, wherein the anomaly data comprises a measurement point coordinate anomaly, a measurement point numerical anomaly, and a data dimension anomaly.

8. A convolutional neural network sound source localization system based on time differences of arrival, comprising:

The data acquisition module is used for acquiring a sound source data set through the sound source array receiver, and carrying out normalization processing on the sound source data set to acquire a sample data set;

The model acquisition module is used for extracting the time delay characteristics of the sample data set according to a weighted phase change generalized cross correlation algorithm;

And the sound source positioning module is used for inputting the time delay characteristics corresponding to the sound source to be positioned into the trained depth residual convolution network to acquire the coordinate value of the sound source.