Disclosure of Invention
The invention provides a speaker recognition method and system based on voice enhancement, which are used for solving or at least partially solving the technical problem of poor voiceprint recognition effect in the prior art.
In order to solve the above technical problem, a first aspect of the present invention provides a speaker recognition method based on speech enhancement, including:
s1: collecting a large amount of original voice data;
s2: removing interference noise and irrelevant speaker voice contained in original voice data to obtain enhanced voice data;
s3: extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice;
s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
s5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker, and storing the speaker feature; obtaining voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
In one embodiment, step S1 is performed by recording raw voice data.
In one embodiment, step S2 is implemented by removing the interference noise and the irrelevant speaker voice contained in the original voice data by using the generation countermeasure network to achieve end-to-end voice enhancement.
In one embodiment, step S3 includes:
s3.1: carrying out voice activity endpoint detection on the enhanced voice data to eliminate a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
s3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;
s3.4: enabling the power spectrum obtained by the fast Fourier transform to pass through a group of Mel-scale triangular filters to obtain the energy value of each frame of data in the frequency band corresponding to the triangular filters;
s3.5: logarithm is taken on the energy value of each frame data in the frequency band corresponding to the triangular filter, and the logarithmic energy output by each filter bank is calculated;
s3.6: substituting the logarithmic energy into discrete cosine transform to obtain L-order Mel cepstrum coefficient;
s3.7: the power spectrum obtained by fast Fourier transform is processed by index compression and discrete cosine transform through a Gamma atom filter to obtain the GFCC characteristic of the voice signal;
s3.8: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the acoustic characteristics of the voice signal.
In one embodiment, step S4 includes:
performing voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data to be used as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model;
in one embodiment, the step S5 of registering data including h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the characteristics of the speaker to be identified and the characteristics of the registered speaker comprises:
after voice enhancement and feature extraction are carried out on each voice sample in the registration data, the depth feature of each voice sample is extracted from the obtained acoustic feature through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker to serve as the speaker feature of each speaker, and storing the speaker feature in a database;
after voice data of a speaker to be recognized is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be recognized;
and calculating the cosine similarity cos of the characteristics of the speaker to be identified and all the characteristics of the speakers stored in the database, wherein if the maximum cosine similarity is greater than a set threshold, the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, and otherwise, the speaker is rejected.
Based on the same inventive concept, the second aspect of the present invention provides a speaker recognition system based on speech enhancement, comprising:
the voice acquisition module is used for acquiring a large amount of original voice data;
the voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;
the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a Gamma filter from the enhanced voice data and fusing the MFCC features and the GFCC features to obtain the acoustic features of the voice;
the model training module is used for constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
the speaker recognition module is used for collecting the registered voice samples, inputting the trained model to obtain the depth characteristic of each registered voice sample after voice enhancement and characteristic extraction are carried out by adopting the methods of the voice enhancement module and the voice characteristic extraction module, taking the depth characteristic as the speaker characteristic of each speaker, and storing the speaker characteristic of each speaker; obtaining the voice data of the speaker to be recognized, inputting the trained model to obtain the characteristics of the speaker to be recognized after performing voice enhancement and characteristic extraction by adopting the methods of a voice enhancement module and a voice characteristic extraction module, and recognizing the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a speaker recognition method based on voice enhancement, which uses an end-to-end voice enhancement method to remove noise in voice and irrelevant speaker voice, uses GFCC (noise robust character) characteristics with more noise robustness in the voiceprint recognition process, fuses the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice, can improve the noise robustness, constructs a speaker recognition model based on a convolutional neural network, trains the model by using training data, collects registered voice samples, extracts and stores the speaker characteristics of each registered speaker, and recognizes the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker. The problem of among the prior art because the noise that contains in the pronunciation leads to the voiceprint recognition effect not good is solved, improve the discernment rate of accuracy of voiceprint discernment.
Detailed Description
The invention aims to provide a speaker recognition method based on voice enhancement, which solves the problem of poor recognition effect caused by the fact that the voice to be recognized contains noise and accurate feature extraction cannot be carried out in the prior art.
The main concept of the invention is as follows:
firstly, collecting a large amount of original voice data, and then removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data; extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma-tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice; then, constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model; collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker, and storing the speaker feature; and then acquiring voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment of the invention provides a speaker recognition method based on voice enhancement, which comprises the following steps:
s1: collecting a large amount of original voice data;
s2: removing interference noise and irrelevant speaker voice contained in original voice data to obtain enhanced voice data;
s3: extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice;
s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
s5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample as the speaker feature of each speaker, and storing the speaker feature of each speaker; obtaining voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
Specifically, in the speaker recognition model training module, the network model uses a convolutional neural network, the classifier uses softmax, and the trained model is an offline model. The registered voice data includes a plurality of speakers, each speaker including h voice samples.
Please refer to fig. 1, which is a flowchart of a speaker recognition method based on speech enhancement according to an embodiment of the present invention.
In one embodiment, step S1 is performed by recording raw voice data.
In one embodiment, step S2 is implemented by removing the interference noise and the irrelevant speaker voice contained in the original voice data by using the generation countermeasure network to achieve end-to-end voice enhancement.
The generation countermeasure network is a complete convolution structure of a coder-decoder and is used for removing noise in voice to generate a clean voice waveform; the countermeasure network sets a threshold value on the basis of the clean voice waveform and the noise voice waveform for judging whether the generated voice waveform is clean or not, and when the values of the generated voice waveform and the noise voice waveform reach the threshold value, the generated voice waveform is sufficiently clean.
The invention realizes an end-to-end voice enhancement method in a generation countermeasure framework to remove interference noise and irrelevant speaker voice in voice.
In the specific implementation process, pure voice and common noise in life are mixed by a random signal-to-noise ratio to obtain noise voice corresponding to the pure voice, and then the pure voice data set and the corresponding noise voice data set are used for training to obtain a generation countermeasure network for realizing end-to-end voice enhancement.
The speech model training process is described in detail below by taking as an example a model for training a data set containing 1000 clean speeches.
The clean speech set and the live noise data set are mixed with a random signal-to-noise ratio (typically between-10 dB to 10 dB) to obtain a noise speech set corresponding to the clean speech set. The method comprises the following steps of obtaining generated pure voice by noise voice through a generation network, and judging whether the generated pure voice and real pure voice are real pure voice through a discrimination network: if the generated clean speech is obtained, the discriminator should output 0, and if the true clean speech is obtained, 1. And then, parameters are updated by obtaining the reverse propagation of the error gradient through the loss function until the generated pure voice and the real pure voice cannot be accurately judged by the discriminator, and the generated network is the trained voice enhancement network. Intuitively, it is: the discriminator has to tell the generator how to adjust so that the clean speech it generates becomes more realistic.
In one embodiment, step S3 includes:
s3.1: performing voice activity endpoint detection on the enhanced voice data and eliminating a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
s3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;
s3.4: enabling the power spectrum obtained by the fast Fourier transform to pass through a group of Mel-scale triangular filters to obtain the energy value of each frame of data in the frequency band corresponding to the triangular filters;
s3.5: logarithm is taken on the energy value of each frame data in the frequency band corresponding to the triangular filter, and the logarithmic energy output by each filter bank is calculated;
s3.6: substituting the logarithmic energy into discrete cosine transform to obtain L-order Mel cepstrum coefficient;
s3.7: the power spectrum obtained by fast Fourier transform is processed by index compression and discrete cosine transform through a Gamma atom filter to obtain the GFCC characteristic of the voice signal;
s3.8: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the acoustic characteristics of the voice signal.
In a specific implementation process, the preprocessing includes pre-emphasis, framing, and windowing. The specific steps of feature extraction are as follows:
s301: performing voice activity endpoint detection (VAD) on the enhanced voice to eliminate a long mute period;
s302: the speech signal is pre-emphasized by passing it through a high-pass filter: h (z) ═ 1-. mu.z-1H (z) is a high-pass filter; μ pre-emphasis factor, typically taken as 0.97; z is a speech signal.
S303: the sampling frequency of the voice signal is 16KHz, 512 sampling points are firstly grouped into a frame, and the corresponding time length is 512/16000 × 1000 ═ 32 ms. An overlap region is formed between two adjacent frames, and the overlap region includes 256 sampling points, 1/2 of sampling point 512.
S304: assuming that the signal after framing is s (N), N is 0,1, and N-1, where N is the total number of frames, each frame is multiplied by a hamming window:
x(n)=s(n)×W(n),
w (n) is a Hamming window; n is the total frame number; n-1, 0, 1.
S305: and performing fast Fourier transform on each frame signal x (n) after the framing and windowing to obtain the frequency spectrum of each frame, and performing modular squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal. The discrete fourier transform of a speech signal (the speech signal is stored in discrete form) is:
x (n) is the input speech signal, and T represents the number of points of the Fourier transform.
S306: performing fast Fourier transform to obtain a power spectrum | X (k) | non-conducting
2Triangular filter H through a set of Mel scales
m(k) M is more than or equal to 0 and less than or equal to M, and M is the number of filters: respectively multiplying and accumulating the power spectrum with each filter to obtain the energy value of the frame data in the corresponding frequency band of the filter
S307: taking log of the energy values, the log energy output by each filter bank is calculated as:
t represents the number of points of Fourier transform; m is the number of the filters; | X (k) messaging2The power spectrum obtained for S4; hm(k) M is more than or equal to 0 and less than or equal to M is a group of triangular filters with the Mel scale.
S308: substituting the logarithmic energy of S307 into discrete cosine transform to obtain L-order Mel cepstrum coefficient MFCC:
l refers to the order of the MFCC coefficient, and is usually 12-16; m is the number of the triangular filters, and M is more than or equal to 0 and less than or equal to M.
S309: and (3) passing the power spectrum obtained by the fast Fourier transform through a Gamma atom filter, and then performing index compression and Discrete Cosine Transform (DCT) to obtain the GFCC characteristics of the voice signal.
S310: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the GMCC characteristics of the voice signal.
Fig. 2 and fig. 3 are a flow chart of voice feature MFCC extraction and a flow chart of voice feature GFCC extraction, respectively, in the implementation of the present invention.
In one embodiment, step S4 includes:
and performing voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data to be used as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model.
Specifically, the training model is an off-line process, and the training of the speaker recognition model:
collecting training samples in a recording mode; the collected voice samples pass through a voice preprocessing module (a voice enhancement module and a voice feature extraction module) to obtain the GMCC features of the voice; and taking the GMCC characteristics as the input of the model, and training the speaker recognition model by adopting a convolutional neural network structure and softmax classification.
The following describes the speaker recognition model training process by taking training a model containing 1000 speakers as an example.
Collecting a sample of each speaker, wherein each speaker collects 100 samples; obtaining GMCC characteristics of voice of all voice samples through a voice preprocessing module (a voice enhancement module and a voice characteristic extraction module) to be used as training data of a convolutional neural network (a speaker recognition model), wherein all the training data are randomly divided into 5:1 and respectively used as a training set and a verification set; training the convolutional network by using a training set, and finishing the training of the convolutional network when the identification precision of the trained convolutional network on a verification set is basically kept unchanged; otherwise, continuing training. The trained convolutional network is the speaker recognition offline model.
In one embodiment, the step S5 of registering data including h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the characteristics of the speaker to be identified and the characteristics of the registered speaker comprises:
after voice enhancement and feature extraction are carried out on each voice sample in the registration data, the depth feature of each voice sample is extracted from the obtained acoustic feature through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker to serve as the speaker feature of each speaker, and storing the speaker feature in a database;
after voice data of a speaker to be recognized is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be recognized;
and calculating the cosine similarity cos of the characteristics of the speaker to be identified and all the characteristics of the speakers stored in the database, wherein if the maximum cosine similarity is greater than a set threshold, the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, and otherwise, the speaker is rejected.
A registration mode:
collecting registration samples in a recording mode; obtaining GMCC characteristics of voice through a voice preprocessing module by the collected registration samples; extracting Deep Feature (depth Feature) of each voice sample from GMCC features of voice through a speaker recognition offline model; enrollment data (i.e., speaker characteristics for each speaker) is generated and stored in a database.
For example, samples of 10 speakers (20 speech samples per person) are taken; the voice preprocessing module processes all voice samples to obtain GMCC characteristics of voice; obtaining Deep features of 200 voice samples by using GMCC characteristics of voice through a speaker recognition offline model; then averaging 20 Deep features of each speaker as the characteristics of each speaker; save 10 speaker profiles in the database: a spaker 0, spaker 1, a., spaker 9.
Identifying a mode:
collecting a sample to be identified by adopting a recording mode; obtaining GMCC characteristics of a sample to be recognized through a voice preprocessing module; obtaining Deep Feature of a sample to be identified through GMCC features through a speaker identification offline model, and taking the Deep Feature as the features of the speaker to be identified; calculating the cosine similarity cos of the characteristics of the speaker to be identified and the characteristics of all speakers in the database, wherein if the maximum cosine similarity is greater than a certain threshold, the speaker in the database corresponding to the cosine similarity is the identified speaker; otherwise, rejecting.
For example, a piece of voice data of the speaker is collected; obtaining GMCC characteristics through a voice preprocessing module; obtaining Deep Feature of the voice data by using the GMCC Feature through a speaker recognition offline model as the speaker Feature; calculating cosine similarity of the speaker characteristics and the 10 speaker characteristics stored in the database to obtain cos0, cos1 and cos9, finding the maximum value cos _ max of the 10 cosine similarity and the number speaker _ x of the corresponding speaker, if the maximum value is greater than a set threshold value, accepting the speaker as the speaker _ x, and otherwise, identifying the speaker as an unregistered speaker.
In summary, the invention realizes a speaker recognition method based on speech enhancement through speech acquisition, speech enhancement, speech feature extraction, speaker model training, speaker registration and speaker recognition.
Compared with the prior art, the invention has the beneficial effects that:
the end-to-end voice enhancement method is used to remove noise in voice and irrelevant speaker voice, GFCC features with noise robustness are used in the voiceprint recognition process, the noise robustness of the whole system is improved, the problem of poor voiceprint recognition effect caused by noise contained in voice can be solved, and the recognition accuracy of the voiceprint recognition system is improved.
Example two
Based on the same inventive concept, the present embodiment provides a speaker recognition system based on speech enhancement, please refer to fig. 4, the system includes:
a voice collecting module 201, configured to collect a large amount of original voice data;
the voice enhancement module 202 is configured to remove interference noise and irrelevant speaker voice included in the original voice data to obtain enhanced voice data;
a speech feature extraction module 203, configured to extract MFCC features and cepstrum coefficient GFCC features based on a Gammatone filter from the enhanced speech data, and fuse the MFCC features and the GFCC features to obtain acoustic features of the speech;
the model training module 204 is used for constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
the speaker recognition module 205 is used for registering and recognizing speakers, collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of the voice enhancement module and the voice feature extraction module, inputting the trained model to obtain the depth feature of each registered voice sample, and storing the depth feature as the speaker feature of each speaker; obtaining the voice data of the speaker to be recognized, inputting the trained model to obtain the characteristics of the speaker to be recognized after performing voice enhancement and characteristic extraction by adopting the methods of a voice enhancement module and a voice characteristic extraction module, and recognizing the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker.
Since the system described in the second embodiment of the present invention is a system adopted for implementing the speaker recognition method based on speech enhancement according to the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the system based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.