+

WO2007041789A1 - Front-end processing of speech signals - Google Patents

Front-end processing of speech signals Download PDF

Info

Publication number
WO2007041789A1
WO2007041789A1 PCT/AU2006/001498 AU2006001498W WO2007041789A1 WO 2007041789 A1 WO2007041789 A1 WO 2007041789A1 AU 2006001498 W AU2006001498 W AU 2006001498W WO 2007041789 A1 WO2007041789 A1 WO 2007041789A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
frames
speech
speech signal
noise
Prior art date
Application number
PCT/AU2006/001498
Other languages
French (fr)
Inventor
Eric Choi
Julien Epps
Original Assignee
National Ict Australia Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Ict Australia Limited filed Critical National Ict Australia Limited
Priority to AU2006301933A priority Critical patent/AU2006301933A1/en
Publication of WO2007041789A1 publication Critical patent/WO2007041789A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the invention concerns a method of processing speech signals. For example, but not limited to, processing speech signals as part of a speech recognition, speaker verification, speech enhancement or speech coding system.
  • the invention also concerns software and a computer system to perform the method of processing speech signals.
  • Speech recognition systems comprise a few basic components, as shown in Figure 1.
  • the 'front-end' 10 is the first stage of the system and uses signal processing techniques to derive a compact representation of a frame (or single short segment) of speech.
  • three broad classes of techniques can be used to improve the overall recognition performance:
  • Model-space techniques These are based in the back-end 12, and can produce some useful improvement, however they are still limited by the quality of the features being received from the front-end.
  • the invention provides a method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.
  • This invention provides good performance in recognising speech in speech signals spoken in noisy environments at a reduced processing load, making deployment in many practical situations (e.g. handheld devices) feasible.
  • the frequency filtering is based on the Mel-scale frequency.
  • the step of dividing the speech signal into frames may comprise calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing, such as for feature extraction of the speech signal.
  • the method may further comprise the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames from a tail region of the distribution in further processing of the speech signal.
  • Selecting the frame may be based on the position of the previous frame relative in time.
  • the frames may be in a sequence of time order.
  • the predetermined time window may have a predetermined minimum time and maximum time that the candidate frames start at.
  • the given time window may be based on the position of the previous frame.
  • the time differences between candidate frames may be predetermined. Two or more of the selected frames of the speech signal may overlap in time.
  • the predetermined criteria may be that the candidate frame that has the optimum energy function value, this may be a minimum or a maximum value.
  • the predetermined criteria may be the candidate frame that has the largest absolute difference in energy function value than the previous frame.
  • the energy function may be the log energy of the frame. The energy of each candidate frame is dependent upon the energy value of a previous candidate frame, which makes the calculation of the energy values of all the candidate frames computationally inexpensive.
  • the noise estimate of the speech signal is determined from part of the speech signal that does not include any speech.
  • the filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate.
  • the noise estimate may be derived for each frame.
  • the step of weighting the filtered outputs may comprise subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate in that frequency band.
  • the function used for weighting may include a first weighting factor that is dependent on the filtered outputs and noise estimates of multiple frequency bands that may be based on a ratio of the Signal-to-Noise Ratio of a particular f ⁇ lterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs.
  • the step of deriving the noise estimate for each frequency band may be derived dynamically for each frame.
  • the step of weighting frequency filtered outputs may comprise scaling the magnitude value of a frequency filtered frame and adding an offset to the scaled value.
  • the step of weighting frequency filtered frames after logarithmic compression may comprise weighting by a function that is dependent on the signal-to-noise ratios of frequency filtered outputs at multiple frequency bands.
  • the weighting function is calculated dynamically for different frames of speech.
  • the step of removing from the mapped distribution frames from the tail region may be from the left tail region.
  • the method may be performed at the front-end of a speech recognition system. This has the distinct advantage of being cepstral-based, meaning that it fits well into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.
  • DSR distributed speech recognition
  • the invention provides a method of dividing a speech signal into frames for further front-end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.
  • the invention provides software to perform the method described above.
  • the invention provides a computer system programmed to perform the method described above.
  • the computer system may be a distributed system.
  • the invention provides software to perform the method described in any one of the preceding claims.
  • Figure 1 is a block diagram of a speech recognition system (prior art);
  • Figure 2 is a flowchart of the method of processing speech signals;
  • Figure 3 schematically shows candidate frames within a given time window;
  • Figure 4 shows the calculated Mel-filterbank values for a frame;
  • Figure 5 is a schematic diagram of a computer system used for pre-processing of speech signals in accordance with the present invention; and
  • Figure 6 shows graphically some experimental results of the invention.
  • a speech signal y(t) 30 is provided as input to a 'front-end' of the speech recognition system. Pre-processing and dividing the signal into frames (i.e. framing) is then applied to the speech signal y(t) according to an energy search variable frame rate (VFR) analysis 32.
  • VFR energy search variable frame rate
  • Energy Search and VFR analysis seeks to estimate the optimum relative position of the next frame of speech k in time by maximizing the difference in log energy between the current frame and possible next frame.
  • the optimum position of the next frame is determined based on an energy search. It is predetermined that the next frame will start somewhere in a time window defined by K mn 52 and K max 54. The time increment between possible candidate frames is also predetermined. An energy search is then conducted on the time interval between K min 52 and K max 54 by calculating the energy E ⁇ + ⁇ (k) of each candidate frame that starts between K mtn 52 and Kmax 54 in order to determine the next frame of speech k .
  • the energy search is performed according to:
  • k is the candidate frame advance relative to the current frame position in samples
  • Ei is the energy value of the current frame
  • E ⁇ + ⁇ (k) is the energy value of the next frame
  • / is the frame index
  • K mtn and K max are the minimum and maximum admissible values of frame advance in samples.
  • energy is calculated according to the usual formula, except that the energy of the next frame is dependent upon the candidate frame advance k, i.e. where N is the number of sample points in a frame and k is defined relative to the beginning of the current (Zth) frame, as shown in Figure 3.
  • E 1+1 (k + V) E 1+1 (Jc) + xf (N + k + ⁇ ) - xf (k) , (3)
  • Equation (1) might alternatively comprise some other function of energy, i.e.
  • the estimated next frame k is usually the candidate frame k with the highest average amplitude of the speech signal within that frame when compared to the other candidate frames.
  • Energy Search and VFR analysis (1) is repeated with the next candidate frame k now being the current frame position (7th).
  • FFT Fast Fourier Transform
  • Mel-frequency is applied, that is, each frame is filtered by different frequency filters to produce for each frequency band j in the frame, a magnitude (amplitude) value Y j .
  • FFT Fast Fourier Transform
  • Y j a magnitude (amplitude) value
  • Mel-filtering Output Weighting 38 is applied to Y j for each estimated frame.
  • Mel-filtering output weighting aims to adjust the Y j values in order to compensate for noise and improve the quality of the speech signal. The result is for each output magnitude of the 7-th Mel filterbank, an enhanced value m j value is determined.
  • O 1 -, ⁇ j, % all e (0,1) are parameters to adjust the noise compensation
  • Y j is the output magnitude of the j-th Mel-filterbank
  • N j is the noise estimate of the y ' -th MeI- filterbank output
  • MAX[.] is a function which returns the maximum value of its arguments.
  • O 1 - is basically calculated as the ratio of the SNR of a particular filterbank output to the sum of the SNR' s of all the filterbank outputs.
  • all the weighing factors are calculated frame-by-frame dynamically based on the noise estimates, e.g. from the first 10 frames of each speech utterance. Further enhancement to the noise estimates can be applied that updates the estimates dynamically based on online speech/non-speech classification of each frame of data.
  • a Discrete Cosine Transform (DCT) 40 is applied to nt j in a linear form.
  • Cumulative Distribution Mapping 42 is then applied to the enhanced output features according to the mapped distribution.
  • the main idea of this method 42 is to map the distribution of the noisy speech features into a target distribution with a pre-defined probability density function (PDF).
  • PDF probability density function
  • F v (v) is the corresponding cumulative distribution function (CDF) of a given set of speech features
  • F z (z) is the target CDF
  • h(z) are the respective PDF's.
  • is the number of frames whose corresponding feature values are less than a particular V 0 in an utterance and T is the total number of frames in the utterance.
  • a truncated Gaussian is used as the target distribution.
  • the left tail region of the distribution of a normalized feature may not be that useful as it represents mainly the range of more noisy features.
  • the front-end 80 includes an analog-to-digital (A/D) converter 84, a storage buffer 86, a central processing unit 88, a memory module 90, a data output interface 92 and a system bus 94.
  • the memory module is a computer readable storage medium.
  • memory module 90 stores the hardwired or programmable code for a pre-processing and framing module 96, an energy search variable frame rate (VFR) analysis module 98, a fast Fourier transform (FFT) and Mel-filtering module 100, a Mel-output weighting module 102, a discrete cosine transform (DCT) module 104 and a cumulative distribution mapping (CDM) module 106.
  • VFR energy search variable frame rate
  • FFT fast Fourier transform
  • Mel-filtering module 100 a Mel-output weighting module 102
  • DCT discrete cosine transform
  • CDDM cumulative distribution mapping
  • A/D converter 84 receives an input speech signal 82 and converts the signal into digital data.
  • the digital speech data are then stored in buffer 86 which in turn provides the data to various component modules via system bus 94.
  • CPU 88 accesses the speech data on system bus 94 and then processes the data according to the software instructions stored in memory module 90.
  • CPU 88 filters the digital speech data and then divides the chunk of speech data into frames, preferably 25ms long and with an overlapping of 10ms.
  • CPU 88 For each frame of speech data, CPU 88 decides if a frame should be discarded according to the instructions and calculations stored in VFR analysis module 98 (step 32), If the speech frame is retained, CPU 88 estimates the frequency contents of this frame of speech data according to the instructions stored in FFT and Mel-filtering module 100 (step 36). CPU 88 then weights each Mel-filter output according to the instructions stored in Mel-output weighting module 102 (step 38) and correspondingly applies de-correlation processing according to the instructions and parameters stored in DCT module 104 to generate MFCCs.
  • CPU 88 normalizes the MFCCs (step 40) according to the instructions stored in CDM module 106 (step 42) and instructs output interface 92 to transmit the enhanced features, preferably wirelessly, to backend 12 where pattern matching occurs.
  • CPU 88 repeats the above processing steps until all speech frames have been processed and instructs output interface 92 to send a control signal to backend 12 to indicate the completion of front-end processing. Backend 12 responses to this control signal and sends off recognition results 110 to an application.
  • This method is used at the 'front-end' of speech recognition, speech coding and speech identification systems.
  • the likely end users are current users of speech recognition or speaker verification technology, particularly people whose occupation or physical abilities require or are greatly enhanced by this technology. Also users of telephone spoken dialog systems, particularly where such users regularly phone in from noisy environments and other related handheld devices that use speech recognition.
  • This front-end method described above (i.e. the portion to the left of the dotted line in Fig. 1 marked 'front-end' 10) could be implemented with good computational efficiency on a standard mobile phone processor (e.g. an XScale processor), for example.
  • the system could optionally be distributed for example by performing the back-end operations (i.e. the portion to the right of the dotted line in Fig. 1 marked 'back-end' 12) on a central server, so that the complex back-end computations were not performed by the mobile phone processor.
  • the features extracted by the front-end on the phone would be sent via a wireless network protocol to a central telephone exchange, where the back-end server was located.
  • the cepstral based front-end is by far the most popular choice in the field of speech recognition, and that is the main reason why the ETSI adopted this type of front-ends into their standards.
  • the 'Advanced Front-End' This provides substantially improved performance over the ETSI standard front-end, with a very large (threefold) increase in computational complexity. This configuration is indicative of the state of the art in robust speech recognition at very high complexity.
  • each of the above methods 32, 38, 42 produces improvements over ETSI.
  • the combined addition of the above methods 32, 38 and 42 creates only a very slight increase in computational complexity over the ETSI standard configuration.
  • the combined system produces substantial improvements over ETSI, closely approaching the performance of the advanced front-end.
  • the combined system creates only a slight increase in computational complexity over the ETSI configuration.
  • the above method provides speech recognition accuracies exceeding the ETSI standard MFCC front-end at reasonable complexity, and very closely approaching the state of the art (see Table 1) for a reduction in computational complexity of at least threefold.
  • the Aurora noisy digit database is a standard, very large, difficult recognition task used widely in the research community.
  • Table 1 Average digit accuracies (%) for Aurora test sets, proposed front-end compared with ETSI MFCC front-ends, clean HMM set
  • processing blocks of the proposed front-end can be made distributed in different physical locations dependent on the requirement of an application.
  • One example is to have the static cepstral coefficients generated on the client side and then apply CDM on the server side when receiving the static features.
  • CDM CDM on the server side when receiving the static features.
  • the proposed front-end Compared with signal-space and model-space methods for noisy speech recognition, the proposed front-end has the following advantages:
  • DSR distributed speech recognition
  • ETSI advanced front-end represents the current state of the art in the field of noise robustness.
  • the proposed front-end has a much lighter computation load and is very nearly as noise robust as the advanced front-end (see Table 1).
  • any suitable frequency scale known in the art can be used for filtering, such as Barker-frequency or linear frequency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention concerns a method of processing speech signal, such as processing speech signals as part of a speech recognition, speech verification, speech enhancement or speech coding system. The method comprises dividing the speech signal into a sequence of frames (32) by calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing. The frames are then frequency filtering and a noise estimate is derived for each frequency bands of the speech signal. The frequency filtered outputs are weighted (38) with a function derived from the filtered outputs and the noise estimates to emphasise outputs that are less effected by noise. The invention also concerns software and a computer system to perform the method of processing speech signals.

Description

Title
. FRONT END PROCESSING OF SPEECH SIGNALS
Technical Field The invention concerns a method of processing speech signals. For example, but not limited to, processing speech signals as part of a speech recognition, speaker verification, speech enhancement or speech coding system. The invention also concerns software and a computer system to perform the method of processing speech signals.
Background Art
Speech recognition systems comprise a few basic components, as shown in Figure 1. The 'front-end' 10 is the first stage of the system and uses signal processing techniques to derive a compact representation of a frame (or single short segment) of speech. When the input speech is corrupted with some kind of environmental noise, three broad classes of techniques can be used to improve the overall recognition performance:
• Signal-space techniques: These operate on the speech signal before it is input to the front-end. Typically these techniques are not very effective and can be very computationally expensive. • Feature-space techniques: Most current literature on robust speech recognition focuses on methods to derive features that both compactly describe the speech signal and are relatively invariant when noise is present in the input speech.
• Model-space techniques: These are based in the back-end 12, and can produce some useful improvement, however they are still limited by the quality of the features being received from the front-end.
A common problem with the practical deployment of speech recognition solutions is that their performance rapidly degrades to unacceptably low recognition accuracies in the presence of virtually any kind of background noise. Existing solutions partially address the problem, however in general they are either too slow and/or do not offer the best available performance.
Summary of the Invention In a first aspect the invention provides a method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.
This invention provides good performance in recognising speech in speech signals spoken in noisy environments at a reduced processing load, making deployment in many practical situations (e.g. handheld devices) feasible.
The frequency filtering is based on the Mel-scale frequency.
The step of dividing the speech signal into frames may comprise calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing, such as for feature extraction of the speech signal.
The method may further comprise the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames from a tail region of the distribution in further processing of the speech signal.
Selecting the frame may be based on the position of the previous frame relative in time. The frames may be in a sequence of time order. The predetermined time window may have a predetermined minimum time and maximum time that the candidate frames start at. The given time window may be based on the position of the previous frame. The time differences between candidate frames may be predetermined. Two or more of the selected frames of the speech signal may overlap in time. The predetermined criteria may be that the candidate frame that has the optimum energy function value, this may be a minimum or a maximum value. Further, the predetermined criteria may be the candidate frame that has the largest absolute difference in energy function value than the previous frame. The energy function may be the log energy of the frame. The energy of each candidate frame is dependent upon the energy value of a previous candidate frame, which makes the calculation of the energy values of all the candidate frames computationally inexpensive.
The noise estimate of the speech signal is determined from part of the speech signal that does not include any speech. The filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate. The noise estimate may be derived for each frame.
The step of weighting the filtered outputs may comprise subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate in that frequency band.
The function used for weighting may include a first weighting factor that is dependent on the filtered outputs and noise estimates of multiple frequency bands that may be based on a ratio of the Signal-to-Noise Ratio of a particular fϊlterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs. The step of deriving the noise estimate for each frequency band may be derived dynamically for each frame.
The function used for weighting includes a second weighting factor that is dependent on the filtered output and noise estimate at a particular filterbank, but independent of noise estimates at other multiple frequency bands.
The step of weighting frequency filtered outputs may comprise scaling the magnitude value of a frequency filtered frame and adding an offset to the scaled value.
The step of weighting frequency filtered frames after logarithmic compression may comprise weighting by a function that is dependent on the signal-to-noise ratios of frequency filtered outputs at multiple frequency bands. The weighting function is calculated dynamically for different frames of speech.
The step of removing from the mapped distribution frames from the tail region may be from the left tail region. The method may be performed at the front-end of a speech recognition system. This has the distinct advantage of being cepstral-based, meaning that it fits well into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.
All methods described herein can apply equally to any other speech and multimodal processing applications, in particular speaker verification (for biometric applications), speech coding and compression, speech enhancement and speech recognition.
In a further aspect the invention provides a method of dividing a speech signal into frames for further front-end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.
In yet a further aspect, the invention provides software to perform the method described above.
In yet a further aspect the invention provides a computer system programmed to perform the method described above. The computer system may be a distributed system.
In a further aspect, the invention provides software to perform the method described in any one of the preceding claims.
Brief Description of the Drawings
An example of the invention of will now be explained in reference to the following drawings, in which:
Figure 1 is a block diagram of a speech recognition system (prior art); Figure 2 is a flowchart of the method of processing speech signals; Figure 3 schematically shows candidate frames within a given time window; Figure 4 shows the calculated Mel-filterbank values for a frame; Figure 5 is a schematic diagram of a computer system used for pre-processing of speech signals in accordance with the present invention; and Figure 6 shows graphically some experimental results of the invention.
Best Modes of the Invention
In the following example, the processing of speech signals by the 'front-end' of a speech recognition system will now be described in reference to Fig. 2.
A speech signal y(t) 30 is provided as input to a 'front-end' of the speech recognition system. Pre-processing and dividing the signal into frames (i.e. framing) is then applied to the speech signal y(t) according to an energy search variable frame rate (VFR) analysis 32.
Energy Search and VFR analysis will now be explained with reference to Fig. 3. Energy Search and VFR analysis seeks to estimate the optimum relative position of the next frame of speech k in time by maximizing the difference in log energy between the current frame and possible next frame.
So given the position of the current frame 50 the optimum position of the next frame (relative in time) is determined based on an energy search. It is predetermined that the next frame will start somewhere in a time window defined by Kmn 52 and Kmax 54. The time increment between possible candidate frames is also predetermined. An energy search is then conducted on the time interval between Kmin 52 and Kmax 54 by calculating the energy Eι+ι(k) of each candidate frame that starts between Kmtn 52 and Kmax 54 in order to determine the next frame of speech k .
The energy search is performed according to:
Figure imgf000006_0001
where k is the candidate frame advance relative to the current frame position in samples, Ei is the energy value of the current frame, Eι+\(k) is the energy value of the next frame, / is the frame index, and Kmtn and Kmax are the minimum and maximum admissible values of frame advance in samples.
Here, energy is calculated according to the usual formula, except that the energy of the next frame is dependent upon the candidate frame advance k, i.e.
Figure imgf000007_0001
where N is the number of sample points in a frame and k is defined relative to the beginning of the current (Zth) frame, as shown in Figure 3.
In order to achieve good computational efficiency in calculating the K101x - Kx^n + 1 sample-by-sample next frame candidate energies [E1+1 (k) | K1^n ≤ k < Kmm } , these are calculated in a pipelined manner, taking advantage of the fact that
E1+1 (k + V) = E1+1 (Jc) + xf (N + k + ϊ) - xf (k) , (3)
The efficient computation of all possible candidate frame energies allows a very fine- grained search, which cannot easily be replicated by existing Euclidean MFCC distance and entropy-based MFCC VFR methods. Since in this scheme, the frame advance k is selected by search, the main design consideration is the setting of the predefined search interval limits Kmιn and Kmax.
Equation (1) might alternatively comprise some other function of energy, i.e.
k = argmax /(..., £M, E1, E1+1,...) . (4)
Kmin ≤k≤Kmaκ
The estimated next frame k is usually the candidate frame k with the highest average amplitude of the speech signal within that frame when compared to the other candidate frames.
Once the next candidate frame k has been estimated, Energy Search and VFR analysis (1) is repeated with the next candidate frame k now being the current frame position (7th).
For each of the frames estimated using Energy Search and VFR analysis 32, Fast Fourier Transform (FFT) & Mel-frequency is applied, that is, each frame is filtered by different frequency filters to produce for each frequency band j in the frame, a magnitude (amplitude) value Yj. This is shown in Fig. 4 where for a frame of speech 60, the corresponding calculated Mel-Filtered frame 62 is also shown.
Next, Mel-filtering Output Weighting 38 is applied to Yj for each estimated frame. Mel-filtering output weighting aims to adjust the Yj values in order to compensate for noise and improve the quality of the speech signal. The result is for each output magnitude of the 7-th Mel filterbank, an enhanced value mj value is determined.
To do this Mel Filtering is also applied to the noise signal (i.e. part of the speech signal that does not include any speech) to determine a noise estimate for each frequency band
NJ . This is then taken into consideration with the corresponding magnitude value Y1- and a weighting factor that is dependent upon on the Signal-to-Noise ratios of other filterbank outputs a, (i.e. weighting factor that is dependent of the Mel-filtering index j). The enhanced value m,- better emphasises those filterbank outputs which are found to be more reliable and less effected by the actual noise signal spectral characteristics.
The noise robustness of the proposed front-end is enhanced by compensating the MeI- filterbank outputs based on the noise spectral characteristics. In this work, an enhanced log Mel-filterbank output (mj) is given by:
Figure imgf000008_0001
(5)
where O1-, βj, % all e (0,1) are parameters to adjust the noise compensation, Yj is the output magnitude of the j-th Mel-filterbank, Nj is the noise estimate of the y'-th MeI- filterbank output and MAX[.] is a function which returns the maximum value of its arguments.
Note that # is used to control the degree of noise spectral subtraction and βj is used to control the degree of spectral flooring. Here, both % and βj are assumed to be independent of the Mel-filterbank index j as we are more interested in the log Mel- filterbank output weighting and this assumption can simplify the formulation. Also these two parameters are applied globally in that they have the same values for all the speech utterances. While βj and γj are currently being determined empirically, it however automatic learning of the appropriate parameter values from training data is possible by using a gradient-descent algorithm.
One way to measure the reliability of a filterbank output is the signal-to-noise ratio
(SNR). From the viewpoint of psychoacoustics, these weighing factors (aj) are related to the spectral compression process that converts sound intensity into loudness as perceived by humans. So far in the literature, each of the weighting factors has been assumed to be dependent on its individual output SNR only. However, in our case, a weighting factor is also dependent on the SNR' s of other filterbank outputs and it is given by:
Figure imgf000009_0001
where M is the number of Mel-filters in the frequency analysis. The constant "1" is added to the log function to prevent it from having negative values since there may be errors in the noise estimates. In essence, O1- is basically calculated as the ratio of the SNR of a particular filterbank output to the sum of the SNR' s of all the filterbank outputs. Moreover, in this case, all the weighing factors are calculated frame-by-frame dynamically based on the noise estimates, e.g. from the first 10 frames of each speech utterance. Further enhancement to the noise estimates can be applied that updates the estimates dynamically based on online speech/non-speech classification of each frame of data.
Next, a Discrete Cosine Transform (DCT) 40 is applied to ntj in a linear form. Cumulative Distribution Mapping 42 is then applied to the enhanced output features according to the mapped distribution.
The main idea of this method 42 is to map the distribution of the noisy speech features into a target distribution with a pre-defined probability density function (PDF). In our case, it is assumed that for a given feature value V0, the mapping relationship is:
Figure imgf000009_0002
or
-Fv(Vo) = ^(Zo), (8)
where Fv(v) is the corresponding cumulative distribution function (CDF) of a given set of speech features, Fz(z) is the target CDF,/(v) and h(z) are the respective PDF's. From equation (8), we have
Figure imgf000010_0001
Therefore, the required mapping from a given speech feature v0 into the corresponding target feature z0 is represented by equation (9). In our work, the target PDF h(z) is assumed to be a Gaussian with zero mean and unity variance. We also use the following formula to estimate Fv(y0):
Figure imgf000010_0002
/2= Count{v < Vo}. (11)
i.e. Ω is the number of frames whose corresponding feature values are less than a particular V0 in an utterance and T is the total number of frames in the utterance.
While the nonlinear mapping of CDM can be made equivalent to that obtained by using histogram equalization, there is no particular strong reason, other than easier implementation, that one has to assume h(z) being Gaussian as in the literature of histogram equalization.
Using the invention a truncated Gaussian is used as the target distribution. In fact, we have observed that the left tail region of the distribution of a normalized feature may not be that useful as it represents mainly the range of more noisy features.
Mathematically, the additional constraint is specified by:
Figure imgf000010_0003
where θth is a constant that determines the fraction of features to be truncated, and "SKIP" denotes a function that skips the current frame of speech data and does not output any feature value. In the current implementation, we perform the skipping of a whole feature vector based only on cO (zeroth-order cepstral coefficient) as it indicates the energy level of a frame of speech data. A block diagram of one embodiment for a hardware implementation is shown in Figure 5. The front-end 80 includes an analog-to-digital (A/D) converter 84, a storage buffer 86, a central processing unit 88, a memory module 90, a data output interface 92 and a system bus 94. The memory module is a computer readable storage medium. In the preferred embodiment, memory module 90 stores the hardwired or programmable code for a pre-processing and framing module 96, an energy search variable frame rate (VFR) analysis module 98, a fast Fourier transform (FFT) and Mel-filtering module 100, a Mel-output weighting module 102, a discrete cosine transform (DCT) module 104 and a cumulative distribution mapping (CDM) module 106.
When front-end 80 is in operation, A/D converter 84 receives an input speech signal 82 and converts the signal into digital data. The digital speech data are then stored in buffer 86 which in turn provides the data to various component modules via system bus 94. CPU 88 accesses the speech data on system bus 94 and then processes the data according to the software instructions stored in memory module 90. Following the instructions stored in pre-processing and framing module 96, CPU 88 filters the digital speech data and then divides the chunk of speech data into frames, preferably 25ms long and with an overlapping of 10ms.
For each frame of speech data, CPU 88 decides if a frame should be discarded according to the instructions and calculations stored in VFR analysis module 98 (step 32), If the speech frame is retained, CPU 88 estimates the frequency contents of this frame of speech data according to the instructions stored in FFT and Mel-filtering module 100 (step 36). CPU 88 then weights each Mel-filter output according to the instructions stored in Mel-output weighting module 102 (step 38) and correspondingly applies de-correlation processing according to the instructions and parameters stored in DCT module 104 to generate MFCCs. Finally CPU 88 normalizes the MFCCs (step 40) according to the instructions stored in CDM module 106 (step 42) and instructs output interface 92 to transmit the enhanced features, preferably wirelessly, to backend 12 where pattern matching occurs. CPU 88 repeats the above processing steps until all speech frames have been processed and instructs output interface 92 to send a control signal to backend 12 to indicate the completion of front-end processing. Backend 12 responses to this control signal and sends off recognition results 110 to an application.
This method is used at the 'front-end' of speech recognition, speech coding and speech identification systems. The likely end users are current users of speech recognition or speaker verification technology, particularly people whose occupation or physical abilities require or are greatly enhanced by this technology. Also users of telephone spoken dialog systems, particularly where such users regularly phone in from noisy environments and other related handheld devices that use speech recognition.
This front-end method described above (i.e. the portion to the left of the dotted line in Fig. 1 marked 'front-end' 10) could be implemented with good computational efficiency on a standard mobile phone processor (e.g. an XScale processor), for example. The system could optionally be distributed for example by performing the back-end operations (i.e. the portion to the right of the dotted line in Fig. 1 marked 'back-end' 12) on a central server, so that the complex back-end computations were not performed by the mobile phone processor. In this example, the features extracted by the front-end on the phone would be sent via a wireless network protocol to a central telephone exchange, where the back-end server was located.
The cepstral based front-end is by far the most popular choice in the field of speech recognition, and that is the main reason why the ETSI adopted this type of front-ends into their standards. There are two relevant baseline systems for comparison: The ETSI standard MFCC front-end, known as the 'standard' front-end. The 'Advanced Front-End'. This provides substantially improved performance over the ETSI standard front-end, with a very large (threefold) increase in computational complexity. This configuration is indicative of the state of the art in robust speech recognition at very high complexity.
When implemented individually in conjunction with ETSI standard MFCC front-end, each of the above methods 32, 38, 42 produces improvements over ETSI. The combined addition of the above methods 32, 38 and 42 creates only a very slight increase in computational complexity over the ETSI standard configuration.
When this above method 32, 38 and 42 is implemented together in conjunction with the ETSI standard front-end, the combined system produces substantial improvements over ETSI, closely approaching the performance of the advanced front-end. The combined system creates only a slight increase in computational complexity over the ETSI configuration. To summarize, the above method provides speech recognition accuracies exceeding the ETSI standard MFCC front-end at reasonable complexity, and very closely approaching the state of the art (see Table 1) for a reduction in computational complexity of at least threefold. Note that the Aurora noisy digit database is a standard, very large, difficult recognition task used widely in the research community.
Table 1: Average digit accuracies (%) for Aurora test sets, proposed front-end compared with ETSI MFCC front-ends, clean HMM set
Figure imgf000013_0001
* Measured in percentage of relative error rate reduction for the proposed front-end.
In addition, the processing blocks of the proposed front-end can be made distributed in different physical locations dependent on the requirement of an application. One example is to have the static cepstral coefficients generated on the client side and then apply CDM on the server side when receiving the static features. A variation on the new energy search VFR method was given in equation (4) and discussed above.
Compared with signal-space and model-space methods for noisy speech recognition, the proposed front-end has the following advantages:
• Improvement in accuracy can be better than the other types of methods, but with less additional computational load, particularly for complex recognition systems.
• The same front-end can be plugged into different systems without major modification.
• The methods fit easily into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.
Compared with the ETSI standard MFCC front-end, the proposed front-end is found to be much more robust and there is only a slight increase in computation. Some benchmarking results for using the Aurora noisy digit database grouped by signal-to- noise ratio (SNR) are shown in Figure 6. ETSI advanced front-end represents the current state of the art in the field of noise robustness. However the advanced front-end has an issue of high computation load. The proposed front-end has a much lighter computation load and is very nearly as noise robust as the advanced front-end (see Table 1).
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described.
For example, any suitable frequency scale known in the art can be used for filtering, such as Barker-frequency or linear frequency.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive .

Claims

Claims
1. A method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.
2. A method according to claim 1, wherein the noise estimate of the speech signal is determined from part of the speech signal that does not include any speech.
3. A method according to claim 1 or 2, wherein the filtering into frequency bands is based on the Mel-scale frequency.
4. A method according to claim 1, 2 or 3, wherein the filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate.
5. A method according to any one of the preceding claims, wherein the function used for weighting includes a first weighting factor that is dependent on filtered outputs and noise estimates of multiple frequency bands.
6. A method according to claim 5, wherein the first weighting factor is based on a ratio of the Signal-to-Noise Ratio of a particular filterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs.
7. A method according to any one of the preceding claims, wherein the step of deriving the noise estimate for each frequency band is derived dynamically for each frame.
8. A method according to any one of the preceding claims, wherein the function used for weighting includes a second weighting factor that is dependent on the filtered output and noise estimate at a particular filterbank, but independent of noise estimates at other multiple frequency bands.
9. A method according to any one of the preceding claims, wherein the step of weighting the filtered outputs comprises subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate at that frequency band.
10. A method according to any one of the preceding claims, wherein the step of weighting the filtered outputs comprises scaling the magnitude value of a frequency band of a filtered frame and adding an offset to the scaled value.
11. A method according to any one of the preceding claims, wherein the weighting function is calculated dynamically for different frames of speech.
12. A method according to any one of the preceding claims, wherein the step of dividing the speech signal into frames comprises calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.
13. A method according to claim 12, wherein selecting the frame is based on the position of the previous frame relative in time.
14. A method according to claim 12 or 13, wherein the predetermined time window has a predetermined minimum time and predetermined maximum time that the candidate frames start at or end between.
15. A method according to any one of claims 12 to 14, wherein two or more of the selected frames of the speech signal overlap in time.
16. A method according to any one of claims 12 to 15, wherein the predetermined criteria comprise the candidate frame that has the optimum energy function value.
17. A method according to any one of claims 12 to 16, wherein the predetermined criteria comprise the candidate frame that has the largest absolute difference in energy function value than the previous frame.
18. A method according to any one of the preceding claims, wherein the method further comprising the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames in a tail region of the distribution in further processing of the speech signal.
19. A method according to claim 18, wherein the step of removing from the mapped distribution frames from the tail region is from the left tail region.
20. A method according to any one of the preceding claims, wherein the processed speech signals is used for feature extraction of the speech signal.
21. A method of dividing a speech signal into a sequence of frames for further front- end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.
22. A method according to claim 21, wherein selecting the frame is based on the position of the previous frame relative in time.
23. A method according to claim 21 or 22, wherein the predetermined time window has a predetermined minimum time and predetermined maximum time that the candidate frames start at.
24. A method according to any one of claims 21 to 23, wherein two or more of the selected frames of the speech signal overlap in time.
25. A method according to any one of claims 21 to 24, wherein the predetermined criteria comprise the candidate frame that has the optimum energy function value.
26. A method according to any one, of claims 21 to 25, wherein the predetermined criteria comprise the candidate frame that has the largest absolute difference in energy function value than the previous frame.
27. A method according to any one of claims 21 to 26, wherein the sequence of frames of the speech signal is further processed for feature extraction of the speech signal.
28. A method for front-end processing of speech signals comprising the following steps: dividing the speech signal into a sequence of frames;
Mel-frequency filtering the frames of the speech signal to produce Mel- frequency filtered outputs for each frame; and deriving a noise estimate for each Mel-scale frequency band of the speech signal and weighting the Mel-frequency filtered outputs of each frame with a function derived from the filtered outputs and the Mel-scale noise estimates to emphasise outputs that are less effected by noise.
29. Software, that when installed on computer readable storage medium of computer, to operate in accordance with the method described as described in any one of the preceding claims.
30. A computer system for front-end processing of speech signals comprising: a computer readable storage medium; a processor; software installed on the storage medium operable to perform the method according to any one proceeding claims using the processor.
31. A computer system according to claim 28, wherein the computer system is a distributed system.
32. A computer system according to claim 28, wherein the computer system is a speech recognition system.
PCT/AU2006/001498 2005-10-11 2006-10-11 Front-end processing of speech signals WO2007041789A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2006301933A AU2006301933A1 (en) 2005-10-11 2006-10-11 Front-end processing of speech signals

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU200505604 2005-10-11
AU00505604 2005-10-11

Publications (1)

Publication Number Publication Date
WO2007041789A1 true WO2007041789A1 (en) 2007-04-19

Family

ID=37942224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2006/001498 WO2007041789A1 (en) 2005-10-11 2006-10-11 Front-end processing of speech signals

Country Status (1)

Country Link
WO (1) WO2007041789A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450886A (en) * 2007-07-10 2009-01-14 Motorola Inc Voice activity detector that eliminates from enhancement noise sub-frames based on data from neighbouring speech frames
WO2010075789A1 (en) * 2008-12-31 2010-07-08 华为技术有限公司 Signal processing method and apparatus
WO2015165264A1 (en) * 2014-04-29 2015-11-05 华为技术有限公司 Signal processing method and device
CN108053837A (en) * 2017-12-28 2018-05-18 深圳市保千里电子有限公司 A kind of method and system of turn signal voice signal identification
CN111462757A (en) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 Data processing method and device based on voice signal, terminal and storage medium
CN117388835A (en) * 2023-12-13 2024-01-12 湖南赛能环测科技有限公司 Multi-spelling fusion sodar signal enhancement method
CN118016049A (en) * 2022-11-10 2024-05-10 唯思电子商务(深圳)有限公司 A closed-loop OTP verification system based on voice verification code

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4630304A (en) * 1985-07-01 1986-12-16 Motorola, Inc. Automatic background noise estimator for a noise suppression system
US5406635A (en) * 1992-02-14 1995-04-11 Nokia Mobile Phones, Ltd. Noise attenuation system
US6230122B1 (en) * 1998-09-09 2001-05-08 Sony Corporation Speech detection with noise suppression based on principal components analysis
US6411925B1 (en) * 1998-10-20 2002-06-25 Canon Kabushiki Kaisha Speech processing apparatus and method for noise masking
US6826528B1 (en) * 1998-09-09 2004-11-30 Sony Corporation Weighted frequency-channel background noise suppressor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4630304A (en) * 1985-07-01 1986-12-16 Motorola, Inc. Automatic background noise estimator for a noise suppression system
US5406635A (en) * 1992-02-14 1995-04-11 Nokia Mobile Phones, Ltd. Noise attenuation system
US6230122B1 (en) * 1998-09-09 2001-05-08 Sony Corporation Speech detection with noise suppression based on principal components analysis
US6826528B1 (en) * 1998-09-09 2004-11-30 Sony Corporation Weighted frequency-channel background noise suppressor
US6411925B1 (en) * 1998-10-20 2002-06-25 Canon Kabushiki Kaisha Speech processing apparatus and method for noise masking

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450886A (en) * 2007-07-10 2009-01-14 Motorola Inc Voice activity detector that eliminates from enhancement noise sub-frames based on data from neighbouring speech frames
GB2450886B (en) * 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
US8909522B2 (en) 2007-07-10 2014-12-09 Motorola Solutions, Inc. Voice activity detector based upon a detected change in energy levels between sub-frames and a method of operation
WO2010075789A1 (en) * 2008-12-31 2010-07-08 华为技术有限公司 Signal processing method and apparatus
CN101770775B (en) * 2008-12-31 2011-06-22 华为技术有限公司 Signal processing method and device
RU2688259C2 (en) * 2014-04-29 2019-05-21 Хуавэй Текнолоджиз Ко., Лтд. Method and device for signal processing
US11081121B2 (en) 2014-04-29 2021-08-03 Huawei Technologies Co., Ltd. Signal processing method and device
US12249339B2 (en) 2014-04-29 2025-03-11 Huawei Technologies Co., Ltd. Signal processing method and device
RU2656812C2 (en) * 2014-04-29 2018-06-06 Хуавэй Текнолоджиз Ко., Лтд. Signal processing method and device
US10186271B2 (en) 2014-04-29 2019-01-22 Huawei Technologies Co., Ltd. Signal processing method and device
WO2015165264A1 (en) * 2014-04-29 2015-11-05 华为技术有限公司 Signal processing method and device
US10347264B2 (en) 2014-04-29 2019-07-09 Huawei Technologies Co., Ltd. Signal processing method and device
US10546591B2 (en) 2014-04-29 2020-01-28 Huawei Technologies Co., Ltd. Signal processing method and device
US11881226B2 (en) 2014-04-29 2024-01-23 Huawei Technologies Co., Ltd. Signal processing method and device
US9837088B2 (en) 2014-04-29 2017-12-05 Huawei Technologies Co., Ltd. Signal processing method and device
US11580996B2 (en) 2014-04-29 2023-02-14 Huawei Technologies Co., Ltd. Signal processing method and device
CN108053837A (en) * 2017-12-28 2018-05-18 深圳市保千里电子有限公司 A kind of method and system of turn signal voice signal identification
CN111462757A (en) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 Data processing method and device based on voice signal, terminal and storage medium
CN111462757B (en) * 2020-01-15 2024-02-23 北京远鉴信息技术有限公司 Voice signal-based data processing method, device, terminal and storage medium
CN118016049A (en) * 2022-11-10 2024-05-10 唯思电子商务(深圳)有限公司 A closed-loop OTP verification system based on voice verification code
CN117388835A (en) * 2023-12-13 2024-01-12 湖南赛能环测科技有限公司 Multi-spelling fusion sodar signal enhancement method
CN117388835B (en) * 2023-12-13 2024-03-08 湖南赛能环测科技有限公司 Multi-spelling fusion sodar signal enhancement method

Similar Documents

Publication Publication Date Title
CN108281146B (en) Short voice speaker identification method and device
JP4218982B2 (en) Audio processing
US7725314B2 (en) Method and apparatus for constructing a speech filter using estimates of clean speech and noise
EP2431972B1 (en) Method and apparatus for multi-sensory speech enhancement
US6253175B1 (en) Wavelet-based energy binning cepstal features for automatic speech recognition
EP1500087B1 (en) On-line parametric histogram normalization for noise robust speech recognition
US7181390B2 (en) Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
JP3154487B2 (en) A method of spectral estimation to improve noise robustness in speech recognition
EP1569422A2 (en) Method and apparatus for multi-sensory speech enhancement on a mobile device
US7027979B2 (en) Method and apparatus for speech reconstruction within a distributed speech recognition system
US20060253285A1 (en) Method and apparatus using spectral addition for speaker recognition
WO2002029782A1 (en) Perceptual harmonic cepstral coefficients as the front-end for speech recognition
WO2007041789A1 (en) Front-end processing of speech signals
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
CN108053842B (en) Short wave voice endpoint detection method based on image recognition
US20110066426A1 (en) Real-time speaker-adaptive speech recognition apparatus and method
Siam et al. A novel speech enhancement method using Fourier series decomposition and spectral subtraction for robust speaker identification
US20020010578A1 (en) Determination and use of spectral peak information and incremental information in pattern recognition
EP1465153A2 (en) Method and apparatus for formant tracking using a residual model
KR20170088165A (en) Method and apparatus for speech recognition using deep neural network
El-Henawy et al. Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs
Faycal et al. Comparative performance study of several features for voiced/non-voiced classification
Ondusko et al. Blind signal-to-noise ratio estimation of speech based on vector quantizer classifiers and decision level fusion
AU2006301933A1 (en) Front-end processing of speech signals
Park et al. Noise robust feature for automatic speech recognition based on mel-spectrogram gradient histogram.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2006301933

Country of ref document: AU

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2006301933

Country of ref document: AU

Date of ref document: 20061011

Kind code of ref document: A

WWP Wipo information: published in national office

Ref document number: 2006301933

Country of ref document: AU

122 Ep: pct application non-entry in european phase

Ref document number: 06790368

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载