US20170169828A1 - System and method for improved audio consistency - Google Patents
System and method for improved audio consistency Download PDFInfo
- Publication number
- US20170169828A1 US20170169828A1 US15/071,258 US201615071258A US2017169828A1 US 20170169828 A1 US20170169828 A1 US 20170169828A1 US 201615071258 A US201615071258 A US 201615071258A US 2017169828 A1 US2017169828 A1 US 2017169828A1
- Authority
- US
- United States
- Prior art keywords
- input voice
- voice sample
- module configured
- sample
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000007781 pre-processing Methods 0.000 claims abstract description 56
- 238000001514 detection method Methods 0.000 claims abstract description 47
- 230000000694 effects Effects 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims abstract description 22
- 230000009467 reduction Effects 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000010606 normalization Methods 0.000 claims description 18
- 239000000203 mixture Substances 0.000 claims description 9
- 230000003321 amplification Effects 0.000 claims description 4
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 4
- 238000012805 post-processing Methods 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000008859 change Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 5
- 230000007613 environmental effect Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000006854 communication Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000008049 biological aging Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- VIKNJXKGJWUCNN-XGXHKTLJSA-N norethisterone Chemical compound O=C1CC[C@@H]2[C@H]3CC[C@](C)([C@](CC4)(O)C#C)[C@@H]4[C@@H]3CCC2=C1 VIKNJXKGJWUCNN-XGXHKTLJSA-N 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/09—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
Definitions
- the invention relates generally to voice biometric applications, and more particularly to a system and a method for increasing a quality of audio signals.
- enrolment of a user's voice sample is performed once. Thereafter, every time the user accesses the system, authentication of the user's is performed. Since the enrolment process is typically performed only once, the initial enrolment audio signal is of importance. However, in certain situations, the initial parameters extracted from the user's enrolment voice sample may not be of the desired quality. In such cases, the user's voice sample for the enrolment process is not accepted and as a result, a re-enrolment process is initiated which deceases the quality of the initial user experience.
- the user's voice sample can fluctuate on several conditions such as biological ageing, a number of environmental conditions like background noise, surrounding ambience, use of different microphones, quality of microphone, etc. These fluctuations in the user's voice sample contribute to errors in the authentication system by increasing the false acceptance and false rejection rates.
- a voice biometrics system adapted to authenticate a user based on speech diagnostics.
- the system includes a pre-processing module configured to receive an input voice sample and to pre-process the input voice sample.
- the pre-processing module includes a clipping module configured to clip the input voice sample based on a clipping threshold.
- the pre-processing module also includes a voice activity detection module configured to apply a detection model on the input voice sample to determine an audible region and a non-audible region in the input voice sample.
- the pre-processing module includes a noise reduction module configured to apply a noise reduction model to remove noise components from the input voice sample.
- the voice biometrics system includes a feature extraction module configured to extract features from the pre-processed input voice sample.
- the voice biometrics system also include an authentication module configured to authenticate the user by comparing a plurality of features extracted from the pre-processed input voice sample to a plurality of enrolment features.
- a method for pre-processing input voice sample utilized for an enrolment and an authentication process in a voice biometric applications includes clipping the input voice sample based on a clipping threshold. The method also includes amplifying the magnitude of the input voice sample and detecting an audible region and a non-audible region in the input voice sample. Additionally, the method includes suppressing plurality of noise components from the input voice sample. Lastly, the method performing normalizing steps to remove noise components from the input voice sample caused by the input channel and/or device.
- FIG. 1 is a block diagram of an example embodiment of an user authentication system facilitating improved audio consistency over input voice samples implemented according to aspects of the present technique
- FIG. 2 is a block diagram of an example embodiment of a pre-processing module of the authentication system implemented according to aspects of the present technique
- FIG. 3 is a block diagram of an example embodiment of a voice activity detection module of the pre-processing module implemented according to aspects of the present technique.
- FIG. 4 is a block diagram of an embodiment of a computing device executing modules of a voice biometrics system, in accordance with an embodiment of the present invention.
- Voice biometrics applications are a class of user authentication solutions that utilizes a user's voice to uniquely identify them.
- a voice print model is built from the user's voice sample and is used to uniquely identify the user during the authentication process.
- the system described herein employ several pre-processing techniques on the input voice sample of the user that enables audio consistency and robust normalization resulting in improved enrolment and authentication rates.
- FIG. 1 is a block diagram of an example embodiment of an authentication system facilitating improved audio consistency over input voice samples implemented according to aspects of the present technique.
- the system 10 represents a user's mobile device 12 , a mobile application 14 , a transmission channel 16 , and a service provider system 24 .
- the service provider system 24 includes a pre-processing module 18 , an adaptive voice authentication system 20 to authenticate a user for accessing the services 22 .
- the system 10 depicts the use of an authentication system to analyze a user's unique information for verifying his/her identity.
- the term “user” may refer to natural people using their voice/audio that aids to uniquely identify them. Examples of users include consumers accessing the bank accounts, participating merchants of several organizations, customers transacting credit or debit cards, and the like.
- the system 10 is implemented for authorizing a user to obtain access to one or more services provided (as represented by reference numeral 22 ) by a remote service provider system 24 .
- the system 10 includes an input means such as a mobile application 14 installed on a user's mobile device 12 for prompting the user to speak a plurality of words. Moreover, the plurality of words spoken by the user are captured and stored by the mobile application 14 as an input voice sample.
- the mobile application 14 installed on the mobile device 12 operates under the control of a program stored therein and in response to the receipt of the spoken words from the user, transmits the spoken words to the service provider system 22 .
- the input voice sample are transmitted using a transmission channel as represented by reference numeral 16 .
- the service provider system 24 includes a pre-processing module 18 configured to receive and pre-process the input voice sample.
- the pre-processed the input voice sample is obtained by filtering a plurality of distortion elements.
- the pre-processing module 18 performs several processing operations on the input voice sample and delivers a consistent voice sample and/or audio to the adaptive voice authentication system 20 by normalizing and suppressing the channel and other environmental conditions.
- the processing operations performed by the pre-processing module 18 are described in further detail below in FIG. 2 and FIG. 3 .
- Voice authentication systems analyze and extract salient features from user's voice for the purpose of authentication.
- the user's voice samples are the input voice samples (as represented by reference numeral 30 ) received by the pre-processing module 18 .
- the received input voice samples 30 may be the user's enrolment voice samples or the user's authentication voice samples.
- the enrolment technique is implemented when the user uses the system for the first time and is typically done only once.
- the user's enrolment voice samples are received by the pre-processing module 18 .
- the received voice samples at the time of authentication are the user's authentication voice samples.
- the authentication process is activated every time the user uses the system subsequently to gain access to the system.
- the user's authentication voice samples are received by the pre-processing module 18 .
- the pre-processing module 18 is the core module of the authentication system that ensures consistency of audio and helps in better user experience during enrolment and reduce false rejection rates during authentication.
- the pre-processing technique is a generic stage that ensures that the input voice samples 30 are obtained in a consistent fashion and is agnostic to channel and other environmental factors. The following paragraphs describe the numerous stages implemented during the pre-processing of the input voice sample 30 .
- Clipping module 32 is configured to clip the input voice sample 30 based on a clipping threshold.
- a clipping threshold is set to about 0.95 dB.
- alpha is about 0.97 with respect to voice authentication applications.
- Amplification module 36 is configured to amplify the magnitude of the input voice sample 30 .
- the amplification of the magnitude of the input voice sample 30 involves boosting the signal amplitude such that amplitude of the signal is boosted to desired level. Further, the scaling factor is obtained from ratio of desired level and the maximum amplitude of input voice sample 30 . The signal is scaled with determined scaling factor to amplify the signal.
- Voice activity detection module 38 is configured to apply a detection model on the input voice sample 30 to determine an audible region and a non-audible region in the input voice sample 30 .
- voice activity detection is a technique used in speech processing to detect the presence or absence of human speech in a voice sample.
- the voice activity detection is used mainly in speech compression and speech recognition.
- the voice activity detection module 38 is configured to identify audible and non-audible regions in the input voice sample 30 based on features from short term energy, zero crossing rate, pitch to build a statistical model which can detect audible and non-audible regions from the input voice sample 30 .
- the components of voice activity detection module 38 are described further detail below in FIG. 3
- Feature normalization module 42 is configured to apply a mean and variance normalization model to remove noise components from the input voice sample 30 caused by the input channel and/or device.
- Cepstal Mean Normalization (CMN) and Cepstral Variance Normalization (CVN) are simple ways of performing feature normalization.
- CNN Cepstal Mean Normalization
- CVN Cepstral Variance Normalization
- the mean and variance of the vectors are computed over a specified time segment. Then each vector is recomputed by subtracting it from the mean and dividing the variance. This approach normalizes the vectors and reduces the distortion caused by the channel.
- Post-processing module (not shown) is configured to apply a Gaussian mixture model to detect the input channel and/or device through which the features from the voice samples are entered.
- a Gaussian mixture model to detect the input channel and/or device through which the features from the voice samples are entered.
- variability in the handset or a user's device causes significant performance degradation in speaker recognition systems.
- Channel compensation in the front-end processing addresses linear channel effects, but there is evidence that handset transducer effects are nonlinear in nature and are thus difficult to remove from the features prior to training and recognition.
- the speaker's model will represent the speaker's acoustic characteristics coupled with the distortions caused by the handset from which the training speech is collected.
- the effect is that log-likelihood ratio scores produced from different speaker models can have handset-dependent biases and scales. To offset this, score normalization is done in addition to pre-processing. This is done as a post processing step after pre-processing the input voice sample 30 .
- the pre-processed voice sample 44 is received by the feature extraction module (not shown).
- the feature extraction module is configured to extract features from the pre-processed voice sample 44 .
- an authentication module is configured to authenticate the user by comparing a plurality of features extracted from the pre-processed voice sample 44 to a plurality of enrolment features.
- the enrolment features are the features enrolled and stored at the time of enrollment process.
- voice activity detection module 38 of the pre-processing module 18 implements speech processing and/or speech detection in the input voice sample 30 is described in further detail below.
- FIG. 3 is a block diagram of an example embodiment of a voice activity detection module of the pre-processing module implemented according to aspects of the present technique.
- the voice activity detection module 38 includes a zero crossing module 50 , a short time energy module 52 , a pitch detection module 54 , and a voice activity detection sub-system 56 . Each component is described in further detail below.
- Zero crossing module 50 is configured to detect the polarity of the input voice sample 30 across a time.
- zero crossing rates are used for voice activity detection (VAD), i.e., finding whether a segment of speech is voiced or unvoiced.
- VAD voice activity detection
- the zero-crossing rate is the rate of sign-changes along the input voice sample 30 , i.e., the rate at which the signal changes from positive to negative or back.
- Zero crossing rate indicates the presence or absence of speech in the input signal.
- the zero crossing rate is high, the frame is considered to be unvoiced frame and when the zero crossing rate is low, the frame is considered to be voiced frame.
- the voiced frame is the audible region of the input voice sample 30 and the unvoiced frame is the non-audible region of the input voice sample 30 .
- Short time energy module 52 is configured to classify the audible region and the non-audible region of the input voice sample 30 .
- short-time energy calculation is another parameter used in the classification of audible region and the non-audible region of the input voice sample 30 .
- the speech signal is divided into non-overlapping frames of about 160 samples at about 8 KHz sampling frequency which is equivalent to about 20 ms time duration. From this 160 samples, the root mean square energy is calculated as sum of squares of all the samples. This is then averaged and square root of the average is the Root Mean Square for that frame.
- Pitch detection module 54 is configured to estimate a pitch level of the input voice sample 30 .
- the pitch detection algorithm is an algorithm designed to estimate the pitch or fundamental frequency of a virtually periodic signal, usually a digital recording of speech or a musical note, tone or the input voice sample 30 . This can be done in the time domain or the frequency domain or both the two domains.
- Voice activity detection sub-system 56 is configured to detect plurality of speech frames comprising speech and non-speech frames of the input voice sample 30 .
- the features mentioned above, are then used as inputs to build Gaussian Mixture Model (GMM) based classifiers.
- GMM Gaussian Mixture Model
- two GMM's are trained using the training data. Training data is obtained by manually tagging the silence and speech frames from several speech files. This training data is then used to build two GMM's, one for speech frame and one for non-speech frames (i.e silence and noise). Since, the speech data is more, 256 mixture model is built for speech GMM and for non-speech 64 mixture model is built.
- each input frame is scored against the two GMM's which outputs a log-likelihood score. Then based on some heuristics for smoothening, the frame is chosen as either being speech or silence.
- a method for pre-processing input voice sample utilized for an enrolment and an authentication process in a voice biometric applications includes clipping the input voice sample based on a clipping threshold. The method further includes amplifying the magnitude of the input voice sample and detecting an audible region and a non-audible region in the input voice sample. The method includes suppressing plurality of noise components from the input voice sample and performing normalizing steps to remove noise components from the input voice sample caused by the input channel and/or device.
- a set of 100 users were asked to enroll the input voice sample in a variety of environments like noisy conditions, using low quality microphones, speaking loudly and softly.
- the test audio samples were collected from users using an android application and a web application.
- this android application was designed for collecting the voice samples and details of various users and devices.
- the user needs to record the phrase “My voice is my password” and android application uploads the voice samples to the storage module. After uploading the voice sample, the user is asked to provide the next voice sample. After providing three voice samples, the user will get a user id and a confirmation that the voice samples have been successfully uploaded to the system.
- the web application is designed for the collecting voice samples from different users and from various laptops. The user needs to provide his/her details like his name, email, device details, gender, age in the form almost similar to the android application
- the enrolment rates of the user were observed with and without implementation of preprocessing module 18 . It was observed, by using noise suppression, clipping check and amplitude correction, the enrolment performance was improved to about 18 percent absolute.
- the implementation of the pre-processing module 18 also improves the authentication rates. This audio consistency also helps during the verification stage when there is a mismatch between the enrolled and verification conditions either due to noise or microphone changes. For this experiment, all the users who enrolled using Android application were asked to verify using a web portal or a different phone. And similarly, those users who enrolled using the web portal were asked to verify using a cell phone. This way for all the 100 users, there was a mismatch in the enrollment and verification conditions.
- preprocessing module 18 gives an absolute increase of about 5 percent in authentication success rate when the conditions are mismatched. By performing channel normalization and other techniques in preprocessing, a consistent audio is then provided to the adaptive authentication module which improves the authentication rate. It may be noted that pre-processing module 18 is independent of the authentication module. Hence, the pre-processing module 18 can be used with other systems too if needed.
- the implementation of the preprocessing module 18 in the authentication system ensures that a consistent speech signal is provided to the core engine which helps in increase in enrolment and also improves the verification success rate.
- Preprocessing for noise and channel conditions also ensures that the user does not have to enroll every time there is a change in surrounding environment (clean to noisy conditions) or a change in microphone (could be due to a change in cell phone from the user's side). Being agnostic to the core engine enables this to be a plug and play for other voice biometric engines too.
- the various modules of the authentication system 10 including a pre-processing module 18 and the adaptive voice authentication system 20 can be stored in tangible storage device 70 . Both, the operating system and the authentication system 10 are executed by processor 62 via one or more respective RAMs 64 (which typically include cache memory).
- Examples of storage devices 70 include semiconductor storage devices such as ROM 66 , EPROM, flash memory or any other computer-readable tangible storage device that can store a computer program and digital information.
- the authentication system 10 can be downloaded from an external computer via a network (for example, the Internet, a local area network or other, wide area network) and network adapter or interface 72 .
- Computing device further includes device drivers 76 to interface with input and output devices.
- the input and output devices can include a computer display monitor 78 , a keyboard 84 , a keypad, a touch screen, a computer mouse 86 , and/or some other suitable input device.
- a range includes each individual member.
- a group having 1-3 cells refers to groups having 1, 2, or 3 cells.
- a group having 1-5 cells refers to groups having 1, 2, 3, 4, or 5 cells, and so forth.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
Description
- The present application hereby claims priority under 35 U.S.C. §119 to Indian patent application number 6580/CHE/2015 filed Dec. 9, 2015, the entire contents of which are hereby incorporated herein by reference.
- The invention relates generally to voice biometric applications, and more particularly to a system and a method for increasing a quality of audio signals.
- Typically, in a voice authentication system, enrolment of a user's voice sample is performed once. Thereafter, every time the user accesses the system, authentication of the user's is performed. Since the enrolment process is typically performed only once, the initial enrolment audio signal is of importance. However, in certain situations, the initial parameters extracted from the user's enrolment voice sample may not be of the desired quality. In such cases, the user's voice sample for the enrolment process is not accepted and as a result, a re-enrolment process is initiated which deceases the quality of the initial user experience.
- Further since the enrolment process is performed only at the initial stages even the user is likely to use the system for a long period thereafter, it is likely that the user's voice might change due to several factors. For example, the user's voice sample can fluctuate on several conditions such as biological ageing, a number of environmental conditions like background noise, surrounding ambience, use of different microphones, quality of microphone, etc. These fluctuations in the user's voice sample contribute to errors in the authentication system by increasing the false acceptance and false rejection rates.
- Existing systems typically address the above described problem by asking the users to enroll the input voice sample again which is often a difficult and tedious process for the user.
- Therefore, a system and method is needed that provides high quality audio signal that can be used seamlessly in voice biometric applications.
- The following summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
- According to some examples of the present disclosure, a voice biometrics system adapted to authenticate a user based on speech diagnostics is provided. The system includes a pre-processing module configured to receive an input voice sample and to pre-process the input voice sample. The pre-processing module includes a clipping module configured to clip the input voice sample based on a clipping threshold. The pre-processing module also includes a voice activity detection module configured to apply a detection model on the input voice sample to determine an audible region and a non-audible region in the input voice sample. Additionally, the pre-processing module includes a noise reduction module configured to apply a noise reduction model to remove noise components from the input voice sample. The voice biometrics system includes a feature extraction module configured to extract features from the pre-processed input voice sample. In addition, the voice biometrics system also include an authentication module configured to authenticate the user by comparing a plurality of features extracted from the pre-processed input voice sample to a plurality of enrolment features.
- According to additional examples of the present disclosure a method for pre-processing input voice sample utilized for an enrolment and an authentication process in a voice biometric applications is provided. The method includes clipping the input voice sample based on a clipping threshold. The method also includes amplifying the magnitude of the input voice sample and detecting an audible region and a non-audible region in the input voice sample. Additionally, the method includes suppressing plurality of noise components from the input voice sample. Lastly, the method performing normalizing steps to remove noise components from the input voice sample caused by the input channel and/or device.
-
FIG. 1 is a block diagram of an example embodiment of an user authentication system facilitating improved audio consistency over input voice samples implemented according to aspects of the present technique; -
FIG. 2 is a block diagram of an example embodiment of a pre-processing module of the authentication system implemented according to aspects of the present technique; -
FIG. 3 is a block diagram of an example embodiment of a voice activity detection module of the pre-processing module implemented according to aspects of the present technique; and -
FIG. 4 is a block diagram of an embodiment of a computing device executing modules of a voice biometrics system, in accordance with an embodiment of the present invention. - In the following detailed description, reference is made to the accompanying drawings, which form a part thereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be used, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the Figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
- Voice biometrics applications are a class of user authentication solutions that utilizes a user's voice to uniquely identify them. To uniquely identify the user, a voice print model is built from the user's voice sample and is used to uniquely identify the user during the authentication process. The system described herein employ several pre-processing techniques on the input voice sample of the user that enables audio consistency and robust normalization resulting in improved enrolment and authentication rates.
-
FIG. 1 is a block diagram of an example embodiment of an authentication system facilitating improved audio consistency over input voice samples implemented according to aspects of the present technique. Thesystem 10 represents a user'smobile device 12, amobile application 14, atransmission channel 16, and a service provider system 24. The service provider system 24 includes apre-processing module 18, an adaptivevoice authentication system 20 to authenticate a user for accessing theservices 22. - The
system 10 depicts the use of an authentication system to analyze a user's unique information for verifying his/her identity. As used herein, the term “user” may refer to natural people using their voice/audio that aids to uniquely identify them. Examples of users include consumers accessing the bank accounts, participating merchants of several organizations, customers transacting credit or debit cards, and the like. In particular, thesystem 10 is implemented for authorizing a user to obtain access to one or more services provided (as represented by reference numeral 22) by a remote service provider system 24. - The
system 10 includes an input means such as amobile application 14 installed on a user'smobile device 12 for prompting the user to speak a plurality of words. Moreover, the plurality of words spoken by the user are captured and stored by themobile application 14 as an input voice sample. Themobile application 14 installed on themobile device 12 operates under the control of a program stored therein and in response to the receipt of the spoken words from the user, transmits the spoken words to theservice provider system 22. The input voice sample are transmitted using a transmission channel as represented byreference numeral 16. - The service provider system 24 includes a
pre-processing module 18 configured to receive and pre-process the input voice sample. The pre-processed the input voice sample is obtained by filtering a plurality of distortion elements. In particular, thepre-processing module 18 performs several processing operations on the input voice sample and delivers a consistent voice sample and/or audio to the adaptivevoice authentication system 20 by normalizing and suppressing the channel and other environmental conditions. The processing operations performed by thepre-processing module 18 are described in further detail below inFIG. 2 andFIG. 3 . - The service provider system 24 includes an adaptive
voice authentication system 20 to verify the user and correspondingly provide access to theservices 22. For example, theservices 22 may comprise several banking services and the service provider system 24 may be a bank. For conciseness, the adaptivevoice authentication system 20 described herein comprises a user-centric adaptation and audio normalization mechanisms to improve the enrolment and authentication rates for users of thesystem 10. By using automated user profile adaptation and normalization techniques, the usability of the authentication system is gained. - The functionality of the adaptive
voice authentication system 20 is described in detail in India patent application number 6215/CHE/2015 titled “ADAPTIVE VOICE AUTHENTICATION SYSTEM AND METHOD” filed on the Nov. 18, 2015 and is incorporated herein. The manner in which enrolment and authentication rates of the users are dynamically improved using several pre-processing techniques by normalizing and suppressing the channel and other environmental conditions is described in further detail below. -
FIG. 2 is a block diagram of an example embodiment of a pre-processing module of the authentication system implemented according to aspects of the present technique. Thepre-processing module 18 includes aclipping module 32, apre-emphasis module 34, anamplification module 36, a voiceactivity detection module 38, anoise reduction module 40 and afeature normalization module 42. Each component is described in further detail below. - Voice authentication systems analyze and extract salient features from user's voice for the purpose of authentication. The user's voice samples are the input voice samples (as represented by reference numeral 30) received by the
pre-processing module 18. The receivedinput voice samples 30 may be the user's enrolment voice samples or the user's authentication voice samples. In one embodiment, the enrolment technique is implemented when the user uses the system for the first time and is typically done only once. In the course of the enrolment, the user's enrolment voice samples are received by thepre-processing module 18. On the other hand, the received voice samples at the time of authentication are the user's authentication voice samples. In one embodiment, the authentication process is activated every time the user uses the system subsequently to gain access to the system. In the course of the authentication process, the user's authentication voice samples are received by thepre-processing module 18. -
Pre-processing module 18 is configured to improve the user's enrolment voice sample and user's authentication voice samples by filtering a plurality of distortion elements. The word ‘user's enrolment voice sample and user's authentication voice samples’ and user's input voice sample, input voice signal and input voice sample refers to theinput voice sample 30 and may be used interchangeably in the description below. In one embodiment, thepre-processing module 18 is configured to employ filtering operations comprising clipping, smoothening, amplifying, detecting speech frames, suppressing noise and feature normalization of theinput voice sample 30. As a result of implementation of thepre-processing module 18, the enrolment and authentication rates are improved for all the speakers using variety of microphones under different loudness and noisy conditions. - In one embodiment, the
pre-processing module 18 is the core module of the authentication system that ensures consistency of audio and helps in better user experience during enrolment and reduce false rejection rates during authentication. The pre-processing technique is a generic stage that ensures that theinput voice samples 30 are obtained in a consistent fashion and is agnostic to channel and other environmental factors. The following paragraphs describe the numerous stages implemented during the pre-processing of theinput voice sample 30. - Clipping
module 32 is configured to clip theinput voice sample 30 based on a clipping threshold. In one example embodiment, when the sequence of continuous input voicessamples 30 cross a particular threshold, then it implies that theinput voice samples 30 are being clipped. For example, considering theinput voice sample 30 utilized in the process of enrolment and authentication, the clipping threshold is set to about 0.95 dB. When the clipped samples are more in thevoice input signal 30 then the voice sample is either rejected or else the clipping will be ignored. -
Pre-emphasis module 34 is configured to remove low frequency components from theinput voice sample 30. In one embodiment, thepre-emphasis module 34 is a smoothened high pass filter. Low frequency signals sampled at high sampling rate tend to yield adjacent samples of similar numerical value. The reason is that low frequency essentially means slow variation in time. So the numerical values of a low frequency signal tend to change slowly or smoothly from sample to sample. By implementing pre-emphasis, the portion of the signal is removed that does not change in relation to its adjacent samples. As a result, only the portion of theinput voice sample 30 that varies rapidly is retained. The rapidly changing signals are the high frequency components of theinput voice sample 30. The equation used for smoothening mechanism is represented as ‘yt=αx t+(1−α) xt−1’, where ‘xt’ is the time domain sample at time ‘t’ and alpha (α) is the pre-emphasis coefficient which determines the weight given to the current voice sample. In one embodiment, the value of alpha is about 0.97 with respect to voice authentication applications. -
Amplification module 36 is configured to amplify the magnitude of theinput voice sample 30. In one embodiment, the amplification of the magnitude of theinput voice sample 30 involves boosting the signal amplitude such that amplitude of the signal is boosted to desired level. Further, the scaling factor is obtained from ratio of desired level and the maximum amplitude ofinput voice sample 30. The signal is scaled with determined scaling factor to amplify the signal. - Voice
activity detection module 38 is configured to apply a detection model on theinput voice sample 30 to determine an audible region and a non-audible region in theinput voice sample 30. In one embodiment, voice activity detection, is a technique used in speech processing to detect the presence or absence of human speech in a voice sample. For conciseness, the voice activity detection is used mainly in speech compression and speech recognition. The voiceactivity detection module 38 is configured to identify audible and non-audible regions in theinput voice sample 30 based on features from short term energy, zero crossing rate, pitch to build a statistical model which can detect audible and non-audible regions from theinput voice sample 30. The components of voiceactivity detection module 38 are described further detail below inFIG. 3 -
Noise reduction module 40 is configured to apply a noise reduction model to remove noise components from theinput voice sample 30. In one embodiment, the noise reduction model implements techniques like Spectral Subtraction (SS) based on Minimum Mean Square Error (MMSE) estimation. This estimation based methods are used in de-noising theinput voice sample 30. In the MMSE method, the modulation magnitude spectrum of clean speech is estimated from noisy observations. The proposed estimator minimizes the mean-square error between the modulation magnitude spectra of clean and estimated speech. Noise may be defined as any unwanted signal that interferes with the communication, measurement or processing of an information-bearing signal such as an enrolment voice sample or an authentication voice sample. Noise can cause transmission errors and may even disrupt a communication process; hence noise processing is an important part of signal pre-processing. - In further embodiment, the spectral subtraction method is a simple and effective method of noise reduction. In this method, a signal spectrum estimated of frames of recorded sample and averaged noise spectrum are subtracted from each other to get the noise free desired signal. The phase is same in the input and restored or desired signal. A minimum mean square error (MMSE) estimator is used which is an estimation method which minimizes the mean square error (MSE) of the fitted values of a dependent variable, which is a common measure of estimator quality.
- In one example embodiment, the input signal y(m) may be represented as a sum of the speech signal x(m) and the noise n(m). The equation is represented as y(m)=x(m)+n(m). In the frequency domain, this may be denoted as: Y(jω)=X(jω)+N(jω)=>X(jω)=Y(jω)−N(jω), where Y(jω), X(jω), N(jω) are Fourier transforms of y(m), x(m), n(m), respectively.
-
Feature normalization module 42 is configured to apply a mean and variance normalization model to remove noise components from theinput voice sample 30 caused by the input channel and/or device. In one embodiment, Cepstal Mean Normalization (CMN) and Cepstral Variance Normalization (CVN) are simple ways of performing feature normalization. In one example embodiment, for a given a segment of acoustic feature vectors O(T)={o1, o2, . . . , oT}, the mean and variance of the vectors are computed over a specified time segment. Then each vector is recomputed by subtracting it from the mean and dividing the variance. This approach normalizes the vectors and reduces the distortion caused by the channel. Longer segments yield better mean and variance estimates, but introduces longer delay since the system needs to wait till the end of the segment before normalization can be done. To balance the delay and accuracy, about a 400 milli second window is chosen to implement the CMN and CVN. Moreover, only diagonal covariance is considered while implementing CVN since the features are assumed uncorrelated. After performing the pre-processing steps like clipping, smoothening, amplifying, detecting speech frames, suppressing noise and feature normalization on theinput voice sample 30, the output voice sample is an improved voice sample and is represented aspre-processed voice sample 44. - Post-processing module (not shown) is configured to apply a Gaussian mixture model to detect the input channel and/or device through which the features from the voice samples are entered. In one embodiment, it is observed that variability in the handset or a user's device causes significant performance degradation in speaker recognition systems. Channel compensation in the front-end processing addresses linear channel effects, but there is evidence that handset transducer effects are nonlinear in nature and are thus difficult to remove from the features prior to training and recognition. Since, the handset effects remain in the features, the speaker's model will represent the speaker's acoustic characteristics coupled with the distortions caused by the handset from which the training speech is collected. The effect is that log-likelihood ratio scores produced from different speaker models can have handset-dependent biases and scales. To offset this, score normalization is done in addition to pre-processing. This is done as a post processing step after pre-processing the
input voice sample 30. - In one example embodiment, to identify the handset type (mobile, landline, desktop), a set of training labels is created. A Gaussian Mixture Model (GMM) based classifier is built. A set of 50 speakers are asked to speak the same utterance through three sets of microphones and a 256 mixture GMM is built for each set of microphone. After the voice biometric engine outputs a score, the
input voice sample 30 is passed through the score normalizer module. This module detects the handset type using the GMM classifier and normalizes the score accordingly. Each handset type is normalized differently to generate the final score. - Further to the pre-processing of the
input voice sample 30, thepre-processed voice sample 44 is received by the feature extraction module (not shown). The feature extraction module is configured to extract features from thepre-processed voice sample 44. Thereafter, an authentication module is configured to authenticate the user by comparing a plurality of features extracted from thepre-processed voice sample 44 to a plurality of enrolment features. The enrolment features are the features enrolled and stored at the time of enrollment process. - The manner in which the voice
activity detection module 38 of thepre-processing module 18 implements speech processing and/or speech detection in theinput voice sample 30 is described in further detail below. -
FIG. 3 is a block diagram of an example embodiment of a voice activity detection module of the pre-processing module implemented according to aspects of the present technique. The voiceactivity detection module 38 includes azero crossing module 50, a shorttime energy module 52, apitch detection module 54, and a voice activity detection sub-system 56. Each component is described in further detail below. - Zero
crossing module 50 is configured to detect the polarity of theinput voice sample 30 across a time. In one embodiment, zero crossing rates are used for voice activity detection (VAD), i.e., finding whether a segment of speech is voiced or unvoiced. The zero-crossing rate is the rate of sign-changes along theinput voice sample 30, i.e., the rate at which the signal changes from positive to negative or back. Zero crossing rate indicates the presence or absence of speech in the input signal. When the zero crossing rate is high, the frame is considered to be unvoiced frame and when the zero crossing rate is low, the frame is considered to be voiced frame. Thus the voiced frame is the audible region of theinput voice sample 30 and the unvoiced frame is the non-audible region of theinput voice sample 30. - Short
time energy module 52 is configured to classify the audible region and the non-audible region of theinput voice sample 30. In one embodiment, short-time energy calculation is another parameter used in the classification of audible region and the non-audible region of theinput voice sample 30. When the energy of the incoming frame of theinput voice sample 30 is high, the frame is classified into voiced frame i.e the audible region and when the energy of the incoming frame of theinput voice sample 30 is low, it is classified into unvoiced frame i.e the non-audible region of theinput voice sample 30. In one example embodiment, within the frame by frame block, the speech signal is divided into non-overlapping frames of about 160 samples at about 8 KHz sampling frequency which is equivalent to about 20 ms time duration. From this 160 samples, the root mean square energy is calculated as sum of squares of all the samples. This is then averaged and square root of the average is the Root Mean Square for that frame. -
Pitch detection module 54 is configured to estimate a pitch level of theinput voice sample 30. In one embodiment, the pitch detection algorithm (PDA) is an algorithm designed to estimate the pitch or fundamental frequency of a virtually periodic signal, usually a digital recording of speech or a musical note, tone or theinput voice sample 30. This can be done in the time domain or the frequency domain or both the two domains. - In one example embodiment, in the time domain, a pitch detection algorithm typically estimates the period of a quasiperiodic signal, then inverts that value to give the frequency. One simple approach would be to measure the distance between zero crossing points of the signal (i.e. the zero-crossing rate). In other example embodiment, in the frequency domain, polyphonic detection is possible, usually utilizing the period gram to convert the signal to an estimate of the frequency spectrum. This requires more processing power as the desired accuracy increases, although the well-known efficiency of the FFT, makes it suitably efficient for many purposes.
- Voice activity detection sub-system 56 is configured to detect plurality of speech frames comprising speech and non-speech frames of the
input voice sample 30. The features mentioned above, are then used as inputs to build Gaussian Mixture Model (GMM) based classifiers. In one example embodiment, two GMM's are trained using the training data. Training data is obtained by manually tagging the silence and speech frames from several speech files. This training data is then used to build two GMM's, one for speech frame and one for non-speech frames (i.e silence and noise). Since, the speech data is more, 256 mixture model is built for speech GMM and fornon-speech 64 mixture model is built. At runtime, each input frame is scored against the two GMM's which outputs a log-likelihood score. Then based on some heuristics for smoothening, the frame is chosen as either being speech or silence. - The flow of the pre-processing steps to provide the audio consistency in the input voice sample is described in detail below. A method for pre-processing input voice sample utilized for an enrolment and an authentication process in a voice biometric applications includes clipping the input voice sample based on a clipping threshold. The method further includes amplifying the magnitude of the input voice sample and detecting an audible region and a non-audible region in the input voice sample. The method includes suppressing plurality of noise components from the input voice sample and performing normalizing steps to remove noise components from the input voice sample caused by the input channel and/or device.
- The benefits of a
preprocessing module 18 was analyzed on the experimental results. The process described inFIG. 2 of the present invention will be described below in further detail with examples thereof, but it should be noted that the present invention is by no means intended to be limited to these examples. - In one example embodiment, a set of 100 users were asked to enroll the input voice sample in a variety of environments like noisy conditions, using low quality microphones, speaking loudly and softly. The test audio samples were collected from users using an android application and a web application. In one embodiment, this android application was designed for collecting the voice samples and details of various users and devices. For example, in one embodiment, the user needs to record the phrase “My voice is my password” and android application uploads the voice samples to the storage module. After uploading the voice sample, the user is asked to provide the next voice sample. After providing three voice samples, the user will get a user id and a confirmation that the voice samples have been successfully uploaded to the system. In another embodiment, the web application is designed for the collecting voice samples from different users and from various laptops. The user needs to provide his/her details like his name, email, device details, gender, age in the form almost similar to the android application
- The enrolment rates of the user were observed with and without implementation of
preprocessing module 18. It was observed, by using noise suppression, clipping check and amplitude correction, the enrolment performance was improved to about 18 percent absolute. - In one embodiment, the implementation of the
pre-processing module 18 also improves the authentication rates. This audio consistency also helps during the verification stage when there is a mismatch between the enrolled and verification conditions either due to noise or microphone changes. For this experiment, all the users who enrolled using Android application were asked to verify using a web portal or a different phone. And similarly, those users who enrolled using the web portal were asked to verify using a cell phone. This way for all the 100 users, there was a mismatch in the enrollment and verification conditions. - It was observed, the implementation of
preprocessing module 18 gives an absolute increase of about 5 percent in authentication success rate when the conditions are mismatched. By performing channel normalization and other techniques in preprocessing, a consistent audio is then provided to the adaptive authentication module which improves the authentication rate. It may be noted thatpre-processing module 18 is independent of the authentication module. Hence, thepre-processing module 18 can be used with other systems too if needed. - Thus the implementation of the
preprocessing module 18 in the authentication system ensures that a consistent speech signal is provided to the core engine which helps in increase in enrolment and also improves the verification success rate. Preprocessing for noise and channel conditions also ensures that the user does not have to enroll every time there is a change in surrounding environment (clean to noisy conditions) or a change in microphone (could be due to a change in cell phone from the user's side). Being agnostic to the core engine enables this to be a plug and play for other voice biometric engines too. -
FIG. 4 is a block diagram of an embodiment of a computing device executing modules of an authentication system, in accordance with an embodiment of the present invention. The modules of the authentication system described herein are implemented in computing devices. One example of acomputing device 60 is described below inFIG. 4 . The computing device comprises one ormore processor 62, one or more computer-readable RAMs 64 and one or more computer-readable ROMs 66 on one ormore buses 68. Further,computing device 60 includes atangible storage device 70 that may be used to executeoperating systems 80, apreprocessing module 18 and adaptivevoice authentication system 20. - The various modules of the
authentication system 10 including apre-processing module 18 and the adaptivevoice authentication system 20 can be stored intangible storage device 70. Both, the operating system and theauthentication system 10 are executed byprocessor 62 via one or more respective RAMs 64 (which typically include cache memory). - Examples of
storage devices 70 include semiconductor storage devices such asROM 66, EPROM, flash memory or any other computer-readable tangible storage device that can store a computer program and digital information. - Computing device also includes a R/W drive or
interface 74 to read from and write to one or more portable computer-readabletangible storage devices 88 such as a CD-ROM, DVD, memory stick or semiconductor storage device. Further, network adapters orinterfaces 72 such as a TCP/IP adapter cards, wireless wi-fi interface cards, or 3G or 4G wireless interface cards or other wired or wireless communication links are also included in computing device. - In one embodiment, the
authentication system 10 can be downloaded from an external computer via a network (for example, the Internet, a local area network or other, wide area network) and network adapter orinterface 72. Computing device further includesdevice drivers 76 to interface with input and output devices. The input and output devices can include acomputer display monitor 78, akeyboard 84, a keypad, a touch screen, acomputer mouse 86, and/or some other suitable input device. - The present disclosure is not to be limited in terms of the particular embodiments described in this application, which are intended as illustrations of various aspects. Many modifications and variations can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatuses within the scope of the disclosure, in addition to those enumerated herein, will be apparent to those skilled in the art from the foregoing descriptions. Such modifications and variations are intended to fall within the scope of the appended claims.
- The present disclosure is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled. It is to be understood that this disclosure is not limited to particular methods, reagents, compounds compositions or biological systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.
- With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
- It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present.
- For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to embodiments containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” and/or “an” should be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations.
- In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, means at least two recitations, or two or more recitations). Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).
- It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
- As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible subranges and combinations of subranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc.
- As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into subranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 cells refers to groups having 1, 2, or 3 cells. Similarly, a group having 1-5 cells refers to groups having 1, 2, 3, 4, or 5 cells, and so forth.
- While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
Claims (14)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN6580/CHE/2015 | 2015-12-09 | ||
IN6580CH2015 | 2015-12-09 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170169828A1 true US20170169828A1 (en) | 2017-06-15 |
US9691392B1 US9691392B1 (en) | 2017-06-27 |
Family
ID=59020061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/071,258 Active - Reinstated US9691392B1 (en) | 2015-12-09 | 2016-03-16 | System and method for improved audio consistency |
Country Status (1)
Country | Link |
---|---|
US (1) | US9691392B1 (en) |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180040338A1 (en) * | 2016-08-08 | 2018-02-08 | Plantronics, Inc. | Vowel Sensing Voice Activity Detector |
US9978392B2 (en) * | 2016-09-09 | 2018-05-22 | Tata Consultancy Services Limited | Noisy signal identification from non-stationary audio signals |
US20180211671A1 (en) * | 2017-01-23 | 2018-07-26 | Qualcomm Incorporated | Keyword voice authentication |
CN108630208A (en) * | 2018-05-14 | 2018-10-09 | 平安科技(深圳)有限公司 | Server, auth method and storage medium based on vocal print |
GB2567503A (en) * | 2017-10-13 | 2019-04-17 | Cirrus Logic Int Semiconductor Ltd | Analysing speech signals |
CN110134722A (en) * | 2019-05-22 | 2019-08-16 | 北京小度信息科技有限公司 | Target user determines method, apparatus, equipment and storage medium |
US10529356B2 (en) | 2018-05-15 | 2020-01-07 | Cirrus Logic, Inc. | Detecting unwanted audio signal components by comparing signals processed with differing linearity |
US10535364B1 (en) * | 2016-09-08 | 2020-01-14 | Amazon Technologies, Inc. | Voice activity detection using air conduction and bone conduction microphones |
US10616701B2 (en) | 2017-11-14 | 2020-04-07 | Cirrus Logic, Inc. | Detection of loudspeaker playback |
US10692490B2 (en) | 2018-07-31 | 2020-06-23 | Cirrus Logic, Inc. | Detection of replay attack |
US10770076B2 (en) | 2017-06-28 | 2020-09-08 | Cirrus Logic, Inc. | Magnetic detection of replay attack |
US10832702B2 (en) | 2017-10-13 | 2020-11-10 | Cirrus Logic, Inc. | Robustness of speech processing system against ultrasound and dolphin attacks |
US10839808B2 (en) | 2017-10-13 | 2020-11-17 | Cirrus Logic, Inc. | Detection of replay attack |
US10839810B2 (en) | 2017-11-21 | 2020-11-17 | Cirrus Logic, Inc. | Speaker enrollment |
US10847165B2 (en) | 2017-10-13 | 2020-11-24 | Cirrus Logic, Inc. | Detection of liveness |
US10853464B2 (en) | 2017-06-28 | 2020-12-01 | Cirrus Logic, Inc. | Detection of replay attack |
US10915614B2 (en) | 2018-08-31 | 2021-02-09 | Cirrus Logic, Inc. | Biometric authentication |
US10984083B2 (en) | 2017-07-07 | 2021-04-20 | Cirrus Logic, Inc. | Authentication of user using ear biometric data |
US11017252B2 (en) | 2017-10-13 | 2021-05-25 | Cirrus Logic, Inc. | Detection of liveness |
US11023755B2 (en) | 2017-10-13 | 2021-06-01 | Cirrus Logic, Inc. | Detection of liveness |
US11037574B2 (en) | 2018-09-05 | 2021-06-15 | Cirrus Logic, Inc. | Speaker recognition and speaker change detection |
US11042618B2 (en) | 2017-07-07 | 2021-06-22 | Cirrus Logic, Inc. | Methods, apparatus and systems for biometric processes |
US11042617B2 (en) | 2017-07-07 | 2021-06-22 | Cirrus Logic, Inc. | Methods, apparatus and systems for biometric processes |
US11042616B2 (en) | 2017-06-27 | 2021-06-22 | Cirrus Logic, Inc. | Detection of replay attack |
US11074917B2 (en) * | 2017-10-30 | 2021-07-27 | Cirrus Logic, Inc. | Speaker identification |
US11264037B2 (en) | 2018-01-23 | 2022-03-01 | Cirrus Logic, Inc. | Speaker identification |
US11276409B2 (en) | 2017-11-14 | 2022-03-15 | Cirrus Logic, Inc. | Detection of replay attack |
US11289098B2 (en) * | 2019-03-08 | 2022-03-29 | Samsung Electronics Co., Ltd. | Method and apparatus with speaker recognition registration |
US11315573B2 (en) * | 2019-04-12 | 2022-04-26 | Panasonic Intellectual Property Corporation Of America | Speaker recognizing method, speaker recognizing apparatus, recording medium recording speaker recognizing program, database making method, database making apparatus, and recording medium recording database making program |
US11475899B2 (en) | 2018-01-23 | 2022-10-18 | Cirrus Logic, Inc. | Speaker identification |
US20230047187A1 (en) * | 2021-08-10 | 2023-02-16 | Avaya Management L.P. | Extraneous voice removal from audio in a communication session |
US11610591B2 (en) * | 2021-05-19 | 2023-03-21 | Capital One Services, Llc | Machine learning for improving quality of voice biometrics |
US20230086832A1 (en) * | 2021-09-17 | 2023-03-23 | International Business Machines Corporation | Method and system for automatic detection and correction of sound distortion |
CN116229986A (en) * | 2023-05-05 | 2023-06-06 | 北京远鉴信息技术有限公司 | Voice noise reduction method and device for voiceprint identification task |
US11735189B2 (en) | 2018-01-23 | 2023-08-22 | Cirrus Logic, Inc. | Speaker identification |
US11755701B2 (en) | 2017-07-07 | 2023-09-12 | Cirrus Logic Inc. | Methods, apparatus and systems for authentication |
US11829461B2 (en) | 2017-07-07 | 2023-11-28 | Cirrus Logic Inc. | Methods, apparatus and systems for audio playback |
US12170098B2 (en) * | 2021-09-03 | 2024-12-17 | Alibaba Damo (Hangzhou) Technology Co., Ltd. | Sound detection method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7386448B1 (en) * | 2004-06-24 | 2008-06-10 | T-Netix, Inc. | Biometric voice authentication |
-
2016
- 2016-03-16 US US15/071,258 patent/US9691392B1/en active Active - Reinstated
Cited By (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11587579B2 (en) * | 2016-08-08 | 2023-02-21 | Plantronics, Inc. | Vowel sensing voice activity detector |
US20210366508A1 (en) * | 2016-08-08 | 2021-11-25 | Plantronics, Inc. | Vowel sensing voice activity detector |
US11120821B2 (en) * | 2016-08-08 | 2021-09-14 | Plantronics, Inc. | Vowel sensing voice activity detector |
US20180040338A1 (en) * | 2016-08-08 | 2018-02-08 | Plantronics, Inc. | Vowel Sensing Voice Activity Detector |
US10535364B1 (en) * | 2016-09-08 | 2020-01-14 | Amazon Technologies, Inc. | Voice activity detection using air conduction and bone conduction microphones |
US9978392B2 (en) * | 2016-09-09 | 2018-05-22 | Tata Consultancy Services Limited | Noisy signal identification from non-stationary audio signals |
US20180211671A1 (en) * | 2017-01-23 | 2018-07-26 | Qualcomm Incorporated | Keyword voice authentication |
US10720165B2 (en) * | 2017-01-23 | 2020-07-21 | Qualcomm Incorporated | Keyword voice authentication |
US11042616B2 (en) | 2017-06-27 | 2021-06-22 | Cirrus Logic, Inc. | Detection of replay attack |
US12026241B2 (en) | 2017-06-27 | 2024-07-02 | Cirrus Logic Inc. | Detection of replay attack |
US10853464B2 (en) | 2017-06-28 | 2020-12-01 | Cirrus Logic, Inc. | Detection of replay attack |
US10770076B2 (en) | 2017-06-28 | 2020-09-08 | Cirrus Logic, Inc. | Magnetic detection of replay attack |
US11164588B2 (en) | 2017-06-28 | 2021-11-02 | Cirrus Logic, Inc. | Magnetic detection of replay attack |
US11704397B2 (en) | 2017-06-28 | 2023-07-18 | Cirrus Logic, Inc. | Detection of replay attack |
US11714888B2 (en) | 2017-07-07 | 2023-08-01 | Cirrus Logic Inc. | Methods, apparatus and systems for biometric processes |
US11042618B2 (en) | 2017-07-07 | 2021-06-22 | Cirrus Logic, Inc. | Methods, apparatus and systems for biometric processes |
US12248551B2 (en) | 2017-07-07 | 2025-03-11 | Cirrus Logic Inc. | Methods, apparatus and systems for audio playback |
US12135774B2 (en) | 2017-07-07 | 2024-11-05 | Cirrus Logic Inc. | Methods, apparatus and systems for biometric processes |
US10984083B2 (en) | 2017-07-07 | 2021-04-20 | Cirrus Logic, Inc. | Authentication of user using ear biometric data |
US11829461B2 (en) | 2017-07-07 | 2023-11-28 | Cirrus Logic Inc. | Methods, apparatus and systems for audio playback |
US11755701B2 (en) | 2017-07-07 | 2023-09-12 | Cirrus Logic Inc. | Methods, apparatus and systems for authentication |
US11042617B2 (en) | 2017-07-07 | 2021-06-22 | Cirrus Logic, Inc. | Methods, apparatus and systems for biometric processes |
US11705135B2 (en) | 2017-10-13 | 2023-07-18 | Cirrus Logic, Inc. | Detection of liveness |
US10847165B2 (en) | 2017-10-13 | 2020-11-24 | Cirrus Logic, Inc. | Detection of liveness |
US10832702B2 (en) | 2017-10-13 | 2020-11-10 | Cirrus Logic, Inc. | Robustness of speech processing system against ultrasound and dolphin attacks |
US11023755B2 (en) | 2017-10-13 | 2021-06-01 | Cirrus Logic, Inc. | Detection of liveness |
US11017252B2 (en) | 2017-10-13 | 2021-05-25 | Cirrus Logic, Inc. | Detection of liveness |
US10839808B2 (en) | 2017-10-13 | 2020-11-17 | Cirrus Logic, Inc. | Detection of replay attack |
GB2567503A (en) * | 2017-10-13 | 2019-04-17 | Cirrus Logic Int Semiconductor Ltd | Analysing speech signals |
US11270707B2 (en) | 2017-10-13 | 2022-03-08 | Cirrus Logic, Inc. | Analysing speech signals |
US11074917B2 (en) * | 2017-10-30 | 2021-07-27 | Cirrus Logic, Inc. | Speaker identification |
US10616701B2 (en) | 2017-11-14 | 2020-04-07 | Cirrus Logic, Inc. | Detection of loudspeaker playback |
US11276409B2 (en) | 2017-11-14 | 2022-03-15 | Cirrus Logic, Inc. | Detection of replay attack |
US11051117B2 (en) | 2017-11-14 | 2021-06-29 | Cirrus Logic, Inc. | Detection of loudspeaker playback |
US10839810B2 (en) | 2017-11-21 | 2020-11-17 | Cirrus Logic, Inc. | Speaker enrollment |
US11694695B2 (en) | 2018-01-23 | 2023-07-04 | Cirrus Logic, Inc. | Speaker identification |
US11475899B2 (en) | 2018-01-23 | 2022-10-18 | Cirrus Logic, Inc. | Speaker identification |
US11264037B2 (en) | 2018-01-23 | 2022-03-01 | Cirrus Logic, Inc. | Speaker identification |
US11735189B2 (en) | 2018-01-23 | 2023-08-22 | Cirrus Logic, Inc. | Speaker identification |
CN108630208A (en) * | 2018-05-14 | 2018-10-09 | 平安科技(深圳)有限公司 | Server, auth method and storage medium based on vocal print |
US10529356B2 (en) | 2018-05-15 | 2020-01-07 | Cirrus Logic, Inc. | Detecting unwanted audio signal components by comparing signals processed with differing linearity |
US10692490B2 (en) | 2018-07-31 | 2020-06-23 | Cirrus Logic, Inc. | Detection of replay attack |
US11631402B2 (en) | 2018-07-31 | 2023-04-18 | Cirrus Logic, Inc. | Detection of replay attack |
US11748462B2 (en) | 2018-08-31 | 2023-09-05 | Cirrus Logic Inc. | Biometric authentication |
US10915614B2 (en) | 2018-08-31 | 2021-02-09 | Cirrus Logic, Inc. | Biometric authentication |
US11037574B2 (en) | 2018-09-05 | 2021-06-15 | Cirrus Logic, Inc. | Speaker recognition and speaker change detection |
US11289098B2 (en) * | 2019-03-08 | 2022-03-29 | Samsung Electronics Co., Ltd. | Method and apparatus with speaker recognition registration |
US11315573B2 (en) * | 2019-04-12 | 2022-04-26 | Panasonic Intellectual Property Corporation Of America | Speaker recognizing method, speaker recognizing apparatus, recording medium recording speaker recognizing program, database making method, database making apparatus, and recording medium recording database making program |
CN110134722A (en) * | 2019-05-22 | 2019-08-16 | 北京小度信息科技有限公司 | Target user determines method, apparatus, equipment and storage medium |
US11610591B2 (en) * | 2021-05-19 | 2023-03-21 | Capital One Services, Llc | Machine learning for improving quality of voice biometrics |
US20230047187A1 (en) * | 2021-08-10 | 2023-02-16 | Avaya Management L.P. | Extraneous voice removal from audio in a communication session |
US12170098B2 (en) * | 2021-09-03 | 2024-12-17 | Alibaba Damo (Hangzhou) Technology Co., Ltd. | Sound detection method |
US11967332B2 (en) * | 2021-09-17 | 2024-04-23 | International Business Machines Corporation | Method and system for automatic detection and correction of sound caused by facial coverings |
US20230086832A1 (en) * | 2021-09-17 | 2023-03-23 | International Business Machines Corporation | Method and system for automatic detection and correction of sound distortion |
CN116229986A (en) * | 2023-05-05 | 2023-06-06 | 北京远鉴信息技术有限公司 | Voice noise reduction method and device for voiceprint identification task |
Also Published As
Publication number | Publication date |
---|---|
US9691392B1 (en) | 2017-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9691392B1 (en) | System and method for improved audio consistency | |
US9940934B2 (en) | Adaptive voice authentication system and method | |
Mak et al. | A study of voice activity detection techniques for NIST speaker recognition evaluations | |
CN104835498B (en) | Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter | |
JP4802135B2 (en) | Speaker authentication registration and confirmation method and apparatus | |
CN108198547A (en) | Voice endpoint detection method, device, computer equipment and storage medium | |
Vestman et al. | Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction | |
WO2014153800A1 (en) | Voice recognition system | |
Chauhan et al. | Speech to text converter using Gaussian Mixture Model (GMM) | |
Selva Nidhyananthan et al. | Noise robust speaker identification using RASTA–MFCC feature with quadrilateral filter bank structure | |
Ramgire et al. | A survey on speaker recognition with various feature extraction and classification techniques | |
CN112216285B (en) | Multi-user session detection method, system, mobile terminal and storage medium | |
Safavi et al. | Comparison of speaker verification performance for adult and child speech | |
Singh et al. | Performance evaluation of normalization techniques in adverse conditions | |
Rehr et al. | Cepstral noise subtraction for robust automatic speech recognition | |
Alam et al. | Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition | |
Wang et al. | Robust Text-independent Speaker Identification in a Time-varying Noisy Environment. | |
Tzudir et al. | Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients | |
Kumar et al. | Effective preprocessing of speech and acoustic features extraction for spoken language identification | |
Chougule et al. | Speaker recognition in mismatch conditions: a feature level approach | |
Nidhyananthan et al. | Text independent voice based students attendance system under noisy environment using RASTA-MFCC feature | |
Alam et al. | Smoothed nonlinear energy operator-based amplitude modulation features for robust speech recognition | |
Ouzounov | Telephone speech endpoint detection using mean-delta feature | |
Al-Ali et al. | Speaker verification with multi-run ICA based speech enhancement | |
Neelima et al. | Spoofing detection and countermeasure in automatic speaker verification system using dynamic features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: UNIPHORE SOFTWARE SYSTEMS, INDIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SACHDEV, UMESH;REEL/FRAME:039292/0467 Effective date: 20160315 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210627 |
|
PRDP | Patent reinstated due to the acceptance of a late maintenance fee |
Effective date: 20211102 |
|
FEPP | Fee payment procedure |
Free format text: PETITION RELATED TO MAINTENANCE FEES FILED (ORIGINAL EVENT CODE: PMFP); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Free format text: PETITION RELATED TO MAINTENANCE FEES GRANTED (ORIGINAL EVENT CODE: PMFG); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Free format text: SURCHARGE, PETITION TO ACCEPT PYMT AFTER EXP, UNINTENTIONAL. (ORIGINAL EVENT CODE: M2558); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: UNIPHORE TECHNOLOGIES INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNIPHORE SOFTWARE SYSTEMS;REEL/FRAME:061841/0541 Effective date: 20220311 |
|
AS | Assignment |
Owner name: HSBC VENTURES USA INC., NEW YORK Free format text: SECURITY INTEREST;ASSIGNORS:UNIPHORE TECHNOLOGIES INC.;UNIPHORE TECHNOLOGIES NORTH AMERICA INC.;UNIPHORE SOFTWARE SYSTEMS INC.;AND OTHERS;REEL/FRAME:062440/0619 Effective date: 20230109 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |