+

WO2007041789A1 - Traitement frontal des signaux vocaux - Google Patents

Traitement frontal des signaux vocaux Download PDF

Info

Publication number
WO2007041789A1
WO2007041789A1 PCT/AU2006/001498 AU2006001498W WO2007041789A1 WO 2007041789 A1 WO2007041789 A1 WO 2007041789A1 AU 2006001498 W AU2006001498 W AU 2006001498W WO 2007041789 A1 WO2007041789 A1 WO 2007041789A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
frames
speech
speech signal
noise
Prior art date
Application number
PCT/AU2006/001498
Other languages
English (en)
Inventor
Eric Choi
Julien Epps
Original Assignee
National Ict Australia Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Ict Australia Limited filed Critical National Ict Australia Limited
Priority to AU2006301933A priority Critical patent/AU2006301933A1/en
Publication of WO2007041789A1 publication Critical patent/WO2007041789A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the invention concerns a method of processing speech signals. For example, but not limited to, processing speech signals as part of a speech recognition, speaker verification, speech enhancement or speech coding system.
  • the invention also concerns software and a computer system to perform the method of processing speech signals.
  • Speech recognition systems comprise a few basic components, as shown in Figure 1.
  • the 'front-end' 10 is the first stage of the system and uses signal processing techniques to derive a compact representation of a frame (or single short segment) of speech.
  • three broad classes of techniques can be used to improve the overall recognition performance:
  • Model-space techniques These are based in the back-end 12, and can produce some useful improvement, however they are still limited by the quality of the features being received from the front-end.
  • the invention provides a method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.
  • This invention provides good performance in recognising speech in speech signals spoken in noisy environments at a reduced processing load, making deployment in many practical situations (e.g. handheld devices) feasible.
  • the frequency filtering is based on the Mel-scale frequency.
  • the step of dividing the speech signal into frames may comprise calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing, such as for feature extraction of the speech signal.
  • the method may further comprise the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames from a tail region of the distribution in further processing of the speech signal.
  • Selecting the frame may be based on the position of the previous frame relative in time.
  • the frames may be in a sequence of time order.
  • the predetermined time window may have a predetermined minimum time and maximum time that the candidate frames start at.
  • the given time window may be based on the position of the previous frame.
  • the time differences between candidate frames may be predetermined. Two or more of the selected frames of the speech signal may overlap in time.
  • the predetermined criteria may be that the candidate frame that has the optimum energy function value, this may be a minimum or a maximum value.
  • the predetermined criteria may be the candidate frame that has the largest absolute difference in energy function value than the previous frame.
  • the energy function may be the log energy of the frame. The energy of each candidate frame is dependent upon the energy value of a previous candidate frame, which makes the calculation of the energy values of all the candidate frames computationally inexpensive.
  • the noise estimate of the speech signal is determined from part of the speech signal that does not include any speech.
  • the filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate.
  • the noise estimate may be derived for each frame.
  • the step of weighting the filtered outputs may comprise subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate in that frequency band.
  • the function used for weighting may include a first weighting factor that is dependent on the filtered outputs and noise estimates of multiple frequency bands that may be based on a ratio of the Signal-to-Noise Ratio of a particular f ⁇ lterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs.
  • the step of deriving the noise estimate for each frequency band may be derived dynamically for each frame.
  • the step of weighting frequency filtered outputs may comprise scaling the magnitude value of a frequency filtered frame and adding an offset to the scaled value.
  • the step of weighting frequency filtered frames after logarithmic compression may comprise weighting by a function that is dependent on the signal-to-noise ratios of frequency filtered outputs at multiple frequency bands.
  • the weighting function is calculated dynamically for different frames of speech.
  • the step of removing from the mapped distribution frames from the tail region may be from the left tail region.
  • the method may be performed at the front-end of a speech recognition system. This has the distinct advantage of being cepstral-based, meaning that it fits well into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.
  • DSR distributed speech recognition
  • the invention provides a method of dividing a speech signal into frames for further front-end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.
  • the invention provides software to perform the method described above.
  • the invention provides a computer system programmed to perform the method described above.
  • the computer system may be a distributed system.
  • the invention provides software to perform the method described in any one of the preceding claims.
  • Figure 1 is a block diagram of a speech recognition system (prior art);
  • Figure 2 is a flowchart of the method of processing speech signals;
  • Figure 3 schematically shows candidate frames within a given time window;
  • Figure 4 shows the calculated Mel-filterbank values for a frame;
  • Figure 5 is a schematic diagram of a computer system used for pre-processing of speech signals in accordance with the present invention; and
  • Figure 6 shows graphically some experimental results of the invention.
  • a speech signal y(t) 30 is provided as input to a 'front-end' of the speech recognition system. Pre-processing and dividing the signal into frames (i.e. framing) is then applied to the speech signal y(t) according to an energy search variable frame rate (VFR) analysis 32.
  • VFR energy search variable frame rate
  • Energy Search and VFR analysis seeks to estimate the optimum relative position of the next frame of speech k in time by maximizing the difference in log energy between the current frame and possible next frame.
  • the optimum position of the next frame is determined based on an energy search. It is predetermined that the next frame will start somewhere in a time window defined by K mn 52 and K max 54. The time increment between possible candidate frames is also predetermined. An energy search is then conducted on the time interval between K min 52 and K max 54 by calculating the energy E ⁇ + ⁇ (k) of each candidate frame that starts between K mtn 52 and Kmax 54 in order to determine the next frame of speech k .
  • the energy search is performed according to:
  • k is the candidate frame advance relative to the current frame position in samples
  • Ei is the energy value of the current frame
  • E ⁇ + ⁇ (k) is the energy value of the next frame
  • / is the frame index
  • K mtn and K max are the minimum and maximum admissible values of frame advance in samples.
  • energy is calculated according to the usual formula, except that the energy of the next frame is dependent upon the candidate frame advance k, i.e. where N is the number of sample points in a frame and k is defined relative to the beginning of the current (Zth) frame, as shown in Figure 3.
  • E 1+1 (k + V) E 1+1 (Jc) + xf (N + k + ⁇ ) - xf (k) , (3)
  • Equation (1) might alternatively comprise some other function of energy, i.e.
  • the estimated next frame k is usually the candidate frame k with the highest average amplitude of the speech signal within that frame when compared to the other candidate frames.
  • Energy Search and VFR analysis (1) is repeated with the next candidate frame k now being the current frame position (7th).
  • FFT Fast Fourier Transform
  • Mel-frequency is applied, that is, each frame is filtered by different frequency filters to produce for each frequency band j in the frame, a magnitude (amplitude) value Y j .
  • FFT Fast Fourier Transform
  • Y j a magnitude (amplitude) value
  • Mel-filtering Output Weighting 38 is applied to Y j for each estimated frame.
  • Mel-filtering output weighting aims to adjust the Y j values in order to compensate for noise and improve the quality of the speech signal. The result is for each output magnitude of the 7-th Mel filterbank, an enhanced value m j value is determined.
  • O 1 -, ⁇ j, % all e (0,1) are parameters to adjust the noise compensation
  • Y j is the output magnitude of the j-th Mel-filterbank
  • N j is the noise estimate of the y ' -th MeI- filterbank output
  • MAX[.] is a function which returns the maximum value of its arguments.
  • O 1 - is basically calculated as the ratio of the SNR of a particular filterbank output to the sum of the SNR' s of all the filterbank outputs.
  • all the weighing factors are calculated frame-by-frame dynamically based on the noise estimates, e.g. from the first 10 frames of each speech utterance. Further enhancement to the noise estimates can be applied that updates the estimates dynamically based on online speech/non-speech classification of each frame of data.
  • a Discrete Cosine Transform (DCT) 40 is applied to nt j in a linear form.
  • Cumulative Distribution Mapping 42 is then applied to the enhanced output features according to the mapped distribution.
  • the main idea of this method 42 is to map the distribution of the noisy speech features into a target distribution with a pre-defined probability density function (PDF).
  • PDF probability density function
  • F v (v) is the corresponding cumulative distribution function (CDF) of a given set of speech features
  • F z (z) is the target CDF
  • h(z) are the respective PDF's.
  • is the number of frames whose corresponding feature values are less than a particular V 0 in an utterance and T is the total number of frames in the utterance.
  • a truncated Gaussian is used as the target distribution.
  • the left tail region of the distribution of a normalized feature may not be that useful as it represents mainly the range of more noisy features.
  • the front-end 80 includes an analog-to-digital (A/D) converter 84, a storage buffer 86, a central processing unit 88, a memory module 90, a data output interface 92 and a system bus 94.
  • the memory module is a computer readable storage medium.
  • memory module 90 stores the hardwired or programmable code for a pre-processing and framing module 96, an energy search variable frame rate (VFR) analysis module 98, a fast Fourier transform (FFT) and Mel-filtering module 100, a Mel-output weighting module 102, a discrete cosine transform (DCT) module 104 and a cumulative distribution mapping (CDM) module 106.
  • VFR energy search variable frame rate
  • FFT fast Fourier transform
  • Mel-filtering module 100 a Mel-output weighting module 102
  • DCT discrete cosine transform
  • CDDM cumulative distribution mapping
  • A/D converter 84 receives an input speech signal 82 and converts the signal into digital data.
  • the digital speech data are then stored in buffer 86 which in turn provides the data to various component modules via system bus 94.
  • CPU 88 accesses the speech data on system bus 94 and then processes the data according to the software instructions stored in memory module 90.
  • CPU 88 filters the digital speech data and then divides the chunk of speech data into frames, preferably 25ms long and with an overlapping of 10ms.
  • CPU 88 For each frame of speech data, CPU 88 decides if a frame should be discarded according to the instructions and calculations stored in VFR analysis module 98 (step 32), If the speech frame is retained, CPU 88 estimates the frequency contents of this frame of speech data according to the instructions stored in FFT and Mel-filtering module 100 (step 36). CPU 88 then weights each Mel-filter output according to the instructions stored in Mel-output weighting module 102 (step 38) and correspondingly applies de-correlation processing according to the instructions and parameters stored in DCT module 104 to generate MFCCs.
  • CPU 88 normalizes the MFCCs (step 40) according to the instructions stored in CDM module 106 (step 42) and instructs output interface 92 to transmit the enhanced features, preferably wirelessly, to backend 12 where pattern matching occurs.
  • CPU 88 repeats the above processing steps until all speech frames have been processed and instructs output interface 92 to send a control signal to backend 12 to indicate the completion of front-end processing. Backend 12 responses to this control signal and sends off recognition results 110 to an application.
  • This method is used at the 'front-end' of speech recognition, speech coding and speech identification systems.
  • the likely end users are current users of speech recognition or speaker verification technology, particularly people whose occupation or physical abilities require or are greatly enhanced by this technology. Also users of telephone spoken dialog systems, particularly where such users regularly phone in from noisy environments and other related handheld devices that use speech recognition.
  • This front-end method described above (i.e. the portion to the left of the dotted line in Fig. 1 marked 'front-end' 10) could be implemented with good computational efficiency on a standard mobile phone processor (e.g. an XScale processor), for example.
  • the system could optionally be distributed for example by performing the back-end operations (i.e. the portion to the right of the dotted line in Fig. 1 marked 'back-end' 12) on a central server, so that the complex back-end computations were not performed by the mobile phone processor.
  • the features extracted by the front-end on the phone would be sent via a wireless network protocol to a central telephone exchange, where the back-end server was located.
  • the cepstral based front-end is by far the most popular choice in the field of speech recognition, and that is the main reason why the ETSI adopted this type of front-ends into their standards.
  • the 'Advanced Front-End' This provides substantially improved performance over the ETSI standard front-end, with a very large (threefold) increase in computational complexity. This configuration is indicative of the state of the art in robust speech recognition at very high complexity.
  • each of the above methods 32, 38, 42 produces improvements over ETSI.
  • the combined addition of the above methods 32, 38 and 42 creates only a very slight increase in computational complexity over the ETSI standard configuration.
  • the combined system produces substantial improvements over ETSI, closely approaching the performance of the advanced front-end.
  • the combined system creates only a slight increase in computational complexity over the ETSI configuration.
  • the above method provides speech recognition accuracies exceeding the ETSI standard MFCC front-end at reasonable complexity, and very closely approaching the state of the art (see Table 1) for a reduction in computational complexity of at least threefold.
  • the Aurora noisy digit database is a standard, very large, difficult recognition task used widely in the research community.
  • Table 1 Average digit accuracies (%) for Aurora test sets, proposed front-end compared with ETSI MFCC front-ends, clean HMM set
  • processing blocks of the proposed front-end can be made distributed in different physical locations dependent on the requirement of an application.
  • One example is to have the static cepstral coefficients generated on the client side and then apply CDM on the server side when receiving the static features.
  • CDM CDM on the server side when receiving the static features.
  • the proposed front-end Compared with signal-space and model-space methods for noisy speech recognition, the proposed front-end has the following advantages:
  • DSR distributed speech recognition
  • ETSI advanced front-end represents the current state of the art in the field of noise robustness.
  • the proposed front-end has a much lighter computation load and is very nearly as noise robust as the advanced front-end (see Table 1).
  • any suitable frequency scale known in the art can be used for filtering, such as Barker-frequency or linear frequency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Cette invention concerne un procédé de traitement de signaler vocal, tel qu'un traitement de ces nouveaux qu'au faisant partie de la reconnaissance vocale, de la vérification vocale, de l'amélioration vocale ou d'un système de codage vocal. Le procédé consiste à diviser le signaler vocal en une séquence de trame (32) en calculant une valeur de fonction d'énergie des trames candidate dans une fenêtre temporelle prédéterminée et en sélectionnant une trame candidate qui a une valeur de fonction d'énergie répondant à des critères prédéterminés pour être une trame dans un autre traitement. Les trames sont ensuite filtrée en fréquence est une estimation de bruit est dérivée de chaque bande de fréquences du signal vocal. Les sorties filtrées en fréquence sont pondérées (38) par une fonction dérivée des sorties filtrées et des estimations de bruit pour mettre en évidence des sorties qui le sont moins à cause du bruit. Cette invention concerne également un logiciel et un système informatique permettant d'effectuer le procédé de traitement des signaux vocaux.
PCT/AU2006/001498 2005-10-11 2006-10-11 Traitement frontal des signaux vocaux WO2007041789A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2006301933A AU2006301933A1 (en) 2005-10-11 2006-10-11 Front-end processing of speech signals

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU200505604 2005-10-11
AU00505604 2005-10-11

Publications (1)

Publication Number Publication Date
WO2007041789A1 true WO2007041789A1 (fr) 2007-04-19

Family

ID=37942224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2006/001498 WO2007041789A1 (fr) 2005-10-11 2006-10-11 Traitement frontal des signaux vocaux

Country Status (1)

Country Link
WO (1) WO2007041789A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450886A (en) * 2007-07-10 2009-01-14 Motorola Inc Voice activity detector that eliminates from enhancement noise sub-frames based on data from neighbouring speech frames
WO2010075789A1 (fr) * 2008-12-31 2010-07-08 华为技术有限公司 Procédé et appareil de traitement de signal
WO2015165264A1 (fr) * 2014-04-29 2015-11-05 华为技术有限公司 Procédé et dispositif de traitement de signal
CN108053837A (zh) * 2017-12-28 2018-05-18 深圳市保千里电子有限公司 一种汽车转向灯声音信号识别的方法和系统
CN111462757A (zh) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 基于语音信号的数据处理方法、装置、终端及存储介质
CN117388835A (zh) * 2023-12-13 2024-01-12 湖南赛能环测科技有限公司 一种多拼融合的声雷达信号增强方法
CN118016049A (zh) * 2022-11-10 2024-05-10 唯思电子商务(深圳)有限公司 一套基于语音验证码的闭环otp验证系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4630304A (en) * 1985-07-01 1986-12-16 Motorola, Inc. Automatic background noise estimator for a noise suppression system
US5406635A (en) * 1992-02-14 1995-04-11 Nokia Mobile Phones, Ltd. Noise attenuation system
US6230122B1 (en) * 1998-09-09 2001-05-08 Sony Corporation Speech detection with noise suppression based on principal components analysis
US6411925B1 (en) * 1998-10-20 2002-06-25 Canon Kabushiki Kaisha Speech processing apparatus and method for noise masking
US6826528B1 (en) * 1998-09-09 2004-11-30 Sony Corporation Weighted frequency-channel background noise suppressor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4630304A (en) * 1985-07-01 1986-12-16 Motorola, Inc. Automatic background noise estimator for a noise suppression system
US5406635A (en) * 1992-02-14 1995-04-11 Nokia Mobile Phones, Ltd. Noise attenuation system
US6230122B1 (en) * 1998-09-09 2001-05-08 Sony Corporation Speech detection with noise suppression based on principal components analysis
US6826528B1 (en) * 1998-09-09 2004-11-30 Sony Corporation Weighted frequency-channel background noise suppressor
US6411925B1 (en) * 1998-10-20 2002-06-25 Canon Kabushiki Kaisha Speech processing apparatus and method for noise masking

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450886A (en) * 2007-07-10 2009-01-14 Motorola Inc Voice activity detector that eliminates from enhancement noise sub-frames based on data from neighbouring speech frames
GB2450886B (en) * 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
US8909522B2 (en) 2007-07-10 2014-12-09 Motorola Solutions, Inc. Voice activity detector based upon a detected change in energy levels between sub-frames and a method of operation
WO2010075789A1 (fr) * 2008-12-31 2010-07-08 华为技术有限公司 Procédé et appareil de traitement de signal
CN101770775B (zh) * 2008-12-31 2011-06-22 华为技术有限公司 信号处理方法及装置
RU2688259C2 (ru) * 2014-04-29 2019-05-21 Хуавэй Текнолоджиз Ко., Лтд. Способ и устройство обработки сигналов
US11081121B2 (en) 2014-04-29 2021-08-03 Huawei Technologies Co., Ltd. Signal processing method and device
US12249339B2 (en) 2014-04-29 2025-03-11 Huawei Technologies Co., Ltd. Signal processing method and device
RU2656812C2 (ru) * 2014-04-29 2018-06-06 Хуавэй Текнолоджиз Ко., Лтд. Способ и устройство обработки сигналов
US10186271B2 (en) 2014-04-29 2019-01-22 Huawei Technologies Co., Ltd. Signal processing method and device
WO2015165264A1 (fr) * 2014-04-29 2015-11-05 华为技术有限公司 Procédé et dispositif de traitement de signal
US10347264B2 (en) 2014-04-29 2019-07-09 Huawei Technologies Co., Ltd. Signal processing method and device
US10546591B2 (en) 2014-04-29 2020-01-28 Huawei Technologies Co., Ltd. Signal processing method and device
US11881226B2 (en) 2014-04-29 2024-01-23 Huawei Technologies Co., Ltd. Signal processing method and device
US9837088B2 (en) 2014-04-29 2017-12-05 Huawei Technologies Co., Ltd. Signal processing method and device
US11580996B2 (en) 2014-04-29 2023-02-14 Huawei Technologies Co., Ltd. Signal processing method and device
CN108053837A (zh) * 2017-12-28 2018-05-18 深圳市保千里电子有限公司 一种汽车转向灯声音信号识别的方法和系统
CN111462757A (zh) * 2020-01-15 2020-07-28 北京远鉴信息技术有限公司 基于语音信号的数据处理方法、装置、终端及存储介质
CN111462757B (zh) * 2020-01-15 2024-02-23 北京远鉴信息技术有限公司 基于语音信号的数据处理方法、装置、终端及存储介质
CN118016049A (zh) * 2022-11-10 2024-05-10 唯思电子商务(深圳)有限公司 一套基于语音验证码的闭环otp验证系统
CN117388835A (zh) * 2023-12-13 2024-01-12 湖南赛能环测科技有限公司 一种多拼融合的声雷达信号增强方法
CN117388835B (zh) * 2023-12-13 2024-03-08 湖南赛能环测科技有限公司 一种多拼融合的声雷达信号增强方法

Similar Documents

Publication Publication Date Title
CN108281146B (zh) 一种短语音说话人识别方法和装置
JP4218982B2 (ja) 音声処理
US7725314B2 (en) Method and apparatus for constructing a speech filter using estimates of clean speech and noise
EP2431972B1 (fr) Procédé et appareil pour l'enrichissement de parole multi-sensoriel
US6253175B1 (en) Wavelet-based energy binning cepstal features for automatic speech recognition
EP1500087B1 (fr) Normalisation d'histogramme parametrique en-ligne pour reconnaissance de la parole resistant aux bruits
US7181390B2 (en) Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
JP3154487B2 (ja) 音声認識の際の雑音のロバストネスを改善するためにスペクトル的推定を行う方法
EP1569422A2 (fr) Méthode et dispositif multisensoriel d'amélioration de la parole pour un terminal mobile
US7027979B2 (en) Method and apparatus for speech reconstruction within a distributed speech recognition system
US20060253285A1 (en) Method and apparatus using spectral addition for speaker recognition
WO2002029782A1 (fr) Coefficients cepstraux a harmoniques perceptuelles analyse lpcc comme debut de la reconnaissance du langage
WO2007041789A1 (fr) Traitement frontal des signaux vocaux
CN108922543A (zh) 模型库建立方法、语音识别方法、装置、设备及介质
CN108053842B (zh) 基于图像识别的短波语音端点检测方法
US20110066426A1 (en) Real-time speaker-adaptive speech recognition apparatus and method
Siam et al. A novel speech enhancement method using Fourier series decomposition and spectral subtraction for robust speaker identification
US20020010578A1 (en) Determination and use of spectral peak information and incremental information in pattern recognition
EP1465153A2 (fr) Méthode et appareil pour localiser les formants avec utilisation d'un modèle résiduel
KR20170088165A (ko) 심층 신경망 기반 음성인식 방법 및 그 장치
El-Henawy et al. Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs
Faycal et al. Comparative performance study of several features for voiced/non-voiced classification
Ondusko et al. Blind signal-to-noise ratio estimation of speech based on vector quantizer classifiers and decision level fusion
AU2006301933A1 (en) Front-end processing of speech signals
Park et al. Noise robust feature for automatic speech recognition based on mel-spectrogram gradient histogram.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2006301933

Country of ref document: AU

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2006301933

Country of ref document: AU

Date of ref document: 20061011

Kind code of ref document: A

WWP Wipo information: published in national office

Ref document number: 2006301933

Country of ref document: AU

122 Ep: pct application non-entry in european phase

Ref document number: 06790368

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载