WO2007041789A1

WO2007041789A1 - Front-end processing of speech signals

Info

Publication number: WO2007041789A1
Application number: PCT/AU2006/001498
Authority: WO
Inventors: Eric Choi; Julien Epps
Original assignee: National Ict Australia Limited
Priority date: 2005-10-11
Filing date: 2006-10-11
Publication date: 2007-04-19

Abstract

The invention concerns a method of processing speech signal, such as processing speech signals as part of a speech recognition, speech verification, speech enhancement or speech coding system. The method comprises dividing the speech signal into a sequence of frames (32) by calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing. The frames are then frequency filtering and a noise estimate is derived for each frequency bands of the speech signal. The frequency filtered outputs are weighted (38) with a function derived from the filtered outputs and the noise estimates to emphasise outputs that are less effected by noise. The invention also concerns software and a computer system to perform the method of processing speech signals.

Description

Title

. FRONT END PROCESSING OF SPEECH SIGNALS

Technical Field The invention concerns a method of processing speech signals. For example, but not limited to, processing speech signals as part of a speech recognition, speaker verification, speech enhancement or speech coding system. The invention also concerns software and a computer system to perform the method of processing speech signals.

Background Art

Speech recognition systems comprise a few basic components, as shown in Figure 1. The 'front-end' 10 is the first stage of the system and uses signal processing techniques to derive a compact representation of a frame (or single short segment) of speech. When the input speech is corrupted with some kind of environmental noise, three broad classes of techniques can be used to improve the overall recognition performance:

• Signal-space techniques: These operate on the speech signal before it is input to the front-end. Typically these techniques are not very effective and can be very computationally expensive. • Feature-space techniques: Most current literature on robust speech recognition focuses on methods to derive features that both compactly describe the speech signal and are relatively invariant when noise is present in the input speech.

• Model-space techniques: These are based in the back-end 12, and can produce some useful improvement, however they are still limited by the quality of the features being received from the front-end.

A common problem with the practical deployment of speech recognition solutions is that their performance rapidly degrades to unacceptably low recognition accuracies in the presence of virtually any kind of background noise. Existing solutions partially address the problem, however in general they are either too slow and/or do not offer the best available performance.

Summary of the Invention In a first aspect the invention provides a method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.

This invention provides good performance in recognising speech in speech signals spoken in noisy environments at a reduced processing load, making deployment in many practical situations (e.g. handheld devices) feasible.

The frequency filtering is based on the Mel-scale frequency.

The step of dividing the speech signal into frames may comprise calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing, such as for feature extraction of the speech signal.

The method may further comprise the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames from a tail region of the distribution in further processing of the speech signal.

Selecting the frame may be based on the position of the previous frame relative in time. The frames may be in a sequence of time order. The predetermined time window may have a predetermined minimum time and maximum time that the candidate frames start at. The given time window may be based on the position of the previous frame. The time differences between candidate frames may be predetermined. Two or more of the selected frames of the speech signal may overlap in time. The predetermined criteria may be that the candidate frame that has the optimum energy function value, this may be a minimum or a maximum value. Further, the predetermined criteria may be the candidate frame that has the largest absolute difference in energy function value than the previous frame. The energy function may be the log energy of the frame. The energy of each candidate frame is dependent upon the energy value of a previous candidate frame, which makes the calculation of the energy values of all the candidate frames computationally inexpensive.

The noise estimate of the speech signal is determined from part of the speech signal that does not include any speech. The filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate. The noise estimate may be derived for each frame.

The step of weighting the filtered outputs may comprise subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate in that frequency band.

The function used for weighting may include a first weighting factor that is dependent on the filtered outputs and noise estimates of multiple frequency bands that may be based on a ratio of the Signal-to-Noise Ratio of a particular fϊlterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs. The step of deriving the noise estimate for each frequency band may be derived dynamically for each frame.

The function used for weighting includes a second weighting factor that is dependent on the filtered output and noise estimate at a particular filterbank, but independent of noise estimates at other multiple frequency bands.

The step of weighting frequency filtered outputs may comprise scaling the magnitude value of a frequency filtered frame and adding an offset to the scaled value.

The step of weighting frequency filtered frames after logarithmic compression may comprise weighting by a function that is dependent on the signal-to-noise ratios of frequency filtered outputs at multiple frequency bands. The weighting function is calculated dynamically for different frames of speech.

The step of removing from the mapped distribution frames from the tail region may be from the left tail region. The method may be performed at the front-end of a speech recognition system. This has the distinct advantage of being cepstral-based, meaning that it fits well into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.

All methods described herein can apply equally to any other speech and multimodal processing applications, in particular speaker verification (for biometric applications), speech coding and compression, speech enhancement and speech recognition.

In a further aspect the invention provides a method of dividing a speech signal into frames for further front-end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.

In yet a further aspect, the invention provides software to perform the method described above.

In yet a further aspect the invention provides a computer system programmed to perform the method described above. The computer system may be a distributed system.

In a further aspect, the invention provides software to perform the method described in any one of the preceding claims.

Brief Description of the Drawings

An example of the invention of will now be explained in reference to the following drawings, in which:

Figure 1 is a block diagram of a speech recognition system (prior art); Figure 2 is a flowchart of the method of processing speech signals; Figure 3 schematically shows candidate frames within a given time window; Figure 4 shows the calculated Mel-filterbank values for a frame; Figure 5 is a schematic diagram of a computer system used for pre-processing of speech signals in accordance with the present invention; and Figure 6 shows graphically some experimental results of the invention.

Best Modes of the Invention

In the following example, the processing of speech signals by the 'front-end' of a speech recognition system will now be described in reference to Fig. 2.

A speech signal y(t) 30 is provided as input to a 'front-end' of the speech recognition system. Pre-processing and dividing the signal into frames (i.e. framing) is then applied to the speech signal y(t) according to an energy search variable frame rate (VFR) analysis 32.

Energy Search and VFR analysis will now be explained with reference to Fig. 3. Energy Search and VFR analysis seeks to estimate the optimum relative position of the next frame of speech k in time by maximizing the difference in log energy between the current frame and possible next frame.

So given the position of the current frame 50 the optimum position of the next frame (relative in time) is determined based on an energy search. It is predetermined that the next frame will start somewhere in a time window defined by K_mn 52 and K_max 54. The time increment between possible candidate frames is also predetermined. An energy search is then conducted on the time interval between K_min 52 and K_max 54 by calculating the energy Eι₊ι(k) of each candidate frame that starts between K_mtn 52 and Kmax 54 in order to determine the next frame of speech k .

The energy search is performed according to:

where k is the candidate frame advance relative to the current frame position in samples, Ei is the energy value of the current frame, Eι_+\(k) is the energy value of the next frame, / is the frame index, and K_mtn and K_max are the minimum and maximum admissible values of frame advance in samples.

Here, energy is calculated according to the usual formula, except that the energy of the next frame is dependent upon the candidate frame advance k, i.e.

where N is the number of sample points in a frame and k is defined relative to the beginning of the current (Zth) frame, as shown in Figure 3.

In order to achieve good computational efficiency in calculating the K_101x - K_x^_n + 1 sample-by-sample next frame candidate energies [E₁₊₁ (k) | K₁^_n ≤ k < K_mm } , these are calculated in a pipelined manner, taking advantage of the fact that

E₁₊₁ (k + V) = E₁₊₁ (Jc) + xf (N + k + ϊ) - xf (k) , (3)

The efficient computation of all possible candidate frame energies allows a very fine- grained search, which cannot easily be replicated by existing Euclidean MFCC distance and entropy-based MFCC VFR methods. Since in this scheme, the frame advance k is selected by search, the main design consideration is the setting of the predefined search interval limits K_mι_n and K_max.

Equation (1) might alternatively comprise some other function of energy, i.e.

k = argmax /(..., £_M, E₁, E₁₊₁,...) . (4)

K_min ≤k≤K_maκ

The estimated next frame k is usually the candidate frame k with the highest average amplitude of the speech signal within that frame when compared to the other candidate frames.

Once the next candidate frame k has been estimated, Energy Search and VFR analysis (1) is repeated with the next candidate frame k now being the current frame position (7th).

For each of the frames estimated using Energy Search and VFR analysis 32, Fast Fourier Transform (FFT) & Mel-frequency is applied, that is, each frame is filtered by different frequency filters to produce for each frequency band j in the frame, a magnitude (amplitude) value Y_j. This is shown in Fig. 4 where for a frame of speech 60, the corresponding calculated Mel-Filtered frame 62 is also shown.

Next, Mel-filtering Output Weighting 38 is applied to Y_j for each estimated frame. Mel-filtering output weighting aims to adjust the Y_j values in order to compensate for noise and improve the quality of the speech signal. The result is for each output magnitude of the 7-th Mel filterbank, an enhanced value m_j value is determined.

To do this Mel Filtering is also applied to the noise signal (i.e. part of the speech signal that does not include any speech) to determine a noise estimate for each frequency band

NJ . This is then taken into consideration with the corresponding magnitude value Y₁- and a weighting factor that is dependent upon on the Signal-to-Noise ratios of other filterbank outputs a, (i.e. weighting factor that is dependent of the Mel-filtering index j). The enhanced value m,- better emphasises those filterbank outputs which are found to be more reliable and less effected by the actual noise signal spectral characteristics.

The noise robustness of the proposed front-end is enhanced by compensating the MeI- filterbank outputs based on the noise spectral characteristics. In this work, an enhanced log Mel-filterbank output (mj) is given by:

(5)

where O₁-, βj, % all e (0,1) are parameters to adjust the noise compensation, Y_j is the output magnitude of the j-th Mel-filterbank, N_j is the noise estimate of the y^'-th MeI- filterbank output and MAX[.] is a function which returns the maximum value of its arguments.

Note that #^• is used to control the degree of noise spectral subtraction and β_j is used to control the degree of spectral flooring. Here, both % and β_j are assumed to be independent of the Mel-filterbank index j as we are more interested in the log Mel- filterbank output weighting and this assumption can simplify the formulation. Also these two parameters are applied globally in that they have the same values for all the speech utterances. While β_j and γ_j are currently being determined empirically, it however automatic learning of the appropriate parameter values from training data is possible by using a gradient-descent algorithm.

One way to measure the reliability of a filterbank output is the signal-to-noise ratio

(SNR). From the viewpoint of psychoacoustics, these weighing factors (aj) are related to the spectral compression process that converts sound intensity into loudness as perceived by humans. So far in the literature, each of the weighting factors has been assumed to be dependent on its individual output SNR only. However, in our case, a weighting factor is also dependent on the SNR' s of other filterbank outputs and it is given by:

where M is the number of Mel-filters in the frequency analysis. The constant "1" is added to the log function to prevent it from having negative values since there may be errors in the noise estimates. In essence, O₁- is basically calculated as the ratio of the SNR of a particular filterbank output to the sum of the SNR' s of all the filterbank outputs. Moreover, in this case, all the weighing factors are calculated frame-by-frame dynamically based on the noise estimates, e.g. from the first 10 frames of each speech utterance. Further enhancement to the noise estimates can be applied that updates the estimates dynamically based on online speech/non-speech classification of each frame of data.

Next, a Discrete Cosine Transform (DCT) 40 is applied to nt_j in a linear form. Cumulative Distribution Mapping 42 is then applied to the enhanced output features according to the mapped distribution.

The main idea of this method 42 is to map the distribution of the noisy speech features into a target distribution with a pre-defined probability density function (PDF). In our case, it is assumed that for a given feature value V₀, the mapping relationship is:

or

-Fv(Vo) = ^(Zo), (8)

where F_v(v) is the corresponding cumulative distribution function (CDF) of a given set of speech features, F_z(z) is the target CDF,/(v) and h(z) are the respective PDF's. From equation (8), we have

Therefore, the required mapping from a given speech feature v₀ into the corresponding target feature z₀ is represented by equation (9). In our work, the target PDF h(z) is assumed to be a Gaussian with zero mean and unity variance. We also use the following formula to estimate F_v(y₀):

/2= Count{v < Vo}. (11)

i.e. Ω is the number of frames whose corresponding feature values are less than a particular V₀ in an utterance and T is the total number of frames in the utterance.

While the nonlinear mapping of CDM can be made equivalent to that obtained by using histogram equalization, there is no particular strong reason, other than easier implementation, that one has to assume h(z) being Gaussian as in the literature of histogram equalization.

Using the invention a truncated Gaussian is used as the target distribution. In fact, we have observed that the left tail region of the distribution of a normalized feature may not be that useful as it represents mainly the range of more noisy features.

Mathematically, the additional constraint is specified by:

where θ_th is a constant that determines the fraction of features to be truncated, and "SKIP" denotes a function that skips the current frame of speech data and does not output any feature value. In the current implementation, we perform the skipping of a whole feature vector based only on cO (zero^th-order cepstral coefficient) as it indicates the energy level of a frame of speech data. A block diagram of one embodiment for a hardware implementation is shown in Figure 5. The front-end 80 includes an analog-to-digital (A/D) converter 84, a storage buffer 86, a central processing unit 88, a memory module 90, a data output interface 92 and a system bus 94. The memory module is a computer readable storage medium. In the preferred embodiment, memory module 90 stores the hardwired or programmable code for a pre-processing and framing module 96, an energy search variable frame rate (VFR) analysis module 98, a fast Fourier transform (FFT) and Mel-filtering module 100, a Mel-output weighting module 102, a discrete cosine transform (DCT) module 104 and a cumulative distribution mapping (CDM) module 106.

When front-end 80 is in operation, A/D converter 84 receives an input speech signal 82 and converts the signal into digital data. The digital speech data are then stored in buffer 86 which in turn provides the data to various component modules via system bus 94. CPU 88 accesses the speech data on system bus 94 and then processes the data according to the software instructions stored in memory module 90. Following the instructions stored in pre-processing and framing module 96, CPU 88 filters the digital speech data and then divides the chunk of speech data into frames, preferably 25ms long and with an overlapping of 10ms.

For each frame of speech data, CPU 88 decides if a frame should be discarded according to the instructions and calculations stored in VFR analysis module 98 (step 32), If the speech frame is retained, CPU 88 estimates the frequency contents of this frame of speech data according to the instructions stored in FFT and Mel-filtering module 100 (step 36). CPU 88 then weights each Mel-filter output according to the instructions stored in Mel-output weighting module 102 (step 38) and correspondingly applies de-correlation processing according to the instructions and parameters stored in DCT module 104 to generate MFCCs. Finally CPU 88 normalizes the MFCCs (step 40) according to the instructions stored in CDM module 106 (step 42) and instructs output interface 92 to transmit the enhanced features, preferably wirelessly, to backend 12 where pattern matching occurs. CPU 88 repeats the above processing steps until all speech frames have been processed and instructs output interface 92 to send a control signal to backend 12 to indicate the completion of front-end processing. Backend 12 responses to this control signal and sends off recognition results 110 to an application.

This method is used at the 'front-end' of speech recognition, speech coding and speech identification systems. The likely end users are current users of speech recognition or speaker verification technology, particularly people whose occupation or physical abilities require or are greatly enhanced by this technology. Also users of telephone spoken dialog systems, particularly where such users regularly phone in from noisy environments and other related handheld devices that use speech recognition.

This front-end method described above (i.e. the portion to the left of the dotted line in Fig. 1 marked 'front-end' 10) could be implemented with good computational efficiency on a standard mobile phone processor (e.g. an XScale processor), for example. The system could optionally be distributed for example by performing the back-end operations (i.e. the portion to the right of the dotted line in Fig. 1 marked 'back-end' 12) on a central server, so that the complex back-end computations were not performed by the mobile phone processor. In this example, the features extracted by the front-end on the phone would be sent via a wireless network protocol to a central telephone exchange, where the back-end server was located.

The cepstral based front-end is by far the most popular choice in the field of speech recognition, and that is the main reason why the ETSI adopted this type of front-ends into their standards. There are two relevant baseline systems for comparison: The ETSI standard MFCC front-end, known as the 'standard' front-end. The 'Advanced Front-End'. This provides substantially improved performance over the ETSI standard front-end, with a very large (threefold) increase in computational complexity. This configuration is indicative of the state of the art in robust speech recognition at very high complexity.

When implemented individually in conjunction with ETSI standard MFCC front-end, each of the above methods 32, 38, 42 produces improvements over ETSI. The combined addition of the above methods 32, 38 and 42 creates only a very slight increase in computational complexity over the ETSI standard configuration.

When this above method 32, 38 and 42 is implemented together in conjunction with the ETSI standard front-end, the combined system produces substantial improvements over ETSI, closely approaching the performance of the advanced front-end. The combined system creates only a slight increase in computational complexity over the ETSI configuration. To summarize, the above method provides speech recognition accuracies exceeding the ETSI standard MFCC front-end at reasonable complexity, and very closely approaching the state of the art (see Table 1) for a reduction in computational complexity of at least threefold. Note that the Aurora noisy digit database is a standard, very large, difficult recognition task used widely in the research community.

Table 1: Average digit accuracies (%) for Aurora test sets, proposed front-end compared with ETSI MFCC front-ends, clean HMM set

* Measured in percentage of relative error rate reduction for the proposed front-end.

In addition, the processing blocks of the proposed front-end can be made distributed in different physical locations dependent on the requirement of an application. One example is to have the static cepstral coefficients generated on the client side and then apply CDM on the server side when receiving the static features. A variation on the new energy search VFR method was given in equation (4) and discussed above.

Compared with signal-space and model-space methods for noisy speech recognition, the proposed front-end has the following advantages:

• Improvement in accuracy can be better than the other types of methods, but with less additional computational load, particularly for complex recognition systems.

• The same front-end can be plugged into different systems without major modification.

• The methods fit easily into the paradigm of distributed speech recognition (DSR), where international standards are available for leveraging the application of speech recognition on mobile and handheld devices.

Compared with the ETSI standard MFCC front-end, the proposed front-end is found to be much more robust and there is only a slight increase in computation. Some benchmarking results for using the Aurora noisy digit database grouped by signal-to- noise ratio (SNR) are shown in Figure 6. ETSI advanced front-end represents the current state of the art in the field of noise robustness. However the advanced front-end has an issue of high computation load. The proposed front-end has a much lighter computation load and is very nearly as noise robust as the advanced front-end (see Table 1).

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described.

For example, any suitable frequency scale known in the art can be used for filtering, such as Barker-frequency or linear frequency.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive .

Claims

1. A method for front-end processing of speech signals comprising the following steps: dividing the speech signal into frames; filtering the frames of the speech signal into frequency bands to produce filtered outputs for each frame; and deriving a noise estimate for each frequency band of the speech signal and weighting the filtered outputs of each frame with a function derived from the filtered outputs and noise estimates to emphasise outputs that are less effected by noise.

2. A method according to claim 1, wherein the noise estimate of the speech signal is determined from part of the speech signal that does not include any speech.

3. A method according to claim 1 or 2, wherein the filtering into frequency bands is based on the Mel-scale frequency.

4. A method according to claim 1, 2 or 3, wherein the filtered outputs of a frame comprise a magnitude value for every frequency band in the frame and the filtered noise estimates comprise a magnitude value for every frequency band in the noise estimate.

5. A method according to any one of the preceding claims, wherein the function used for weighting includes a first weighting factor that is dependent on filtered outputs and noise estimates of multiple frequency bands.

6. A method according to claim 5, wherein the first weighting factor is based on a ratio of the Signal-to-Noise Ratio of a particular filterbank output to the sum of the Signal-to-Noise Ratios of all the filterbank outputs.

7. A method according to any one of the preceding claims, wherein the step of deriving the noise estimate for each frequency band is derived dynamically for each frame.

8. A method according to any one of the preceding claims, wherein the function used for weighting includes a second weighting factor that is dependent on the filtered output and noise estimate at a particular filterbank, but independent of noise estimates at other multiple frequency bands.

9. A method according to any one of the preceding claims, wherein the step of weighting the filtered outputs comprises subtracting from the magnitude value of a frequency band of a filtered frame, the value of the magnitude of the noise estimate at that frequency band.

10. A method according to any one of the preceding claims, wherein the step of weighting the filtered outputs comprises scaling the magnitude value of a frequency band of a filtered frame and adding an offset to the scaled value.

11. A method according to any one of the preceding claims, wherein the weighting function is calculated dynamically for different frames of speech.

12. A method according to any one of the preceding claims, wherein the step of dividing the speech signal into frames comprises calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.

13. A method according to claim 12, wherein selecting the frame is based on the position of the previous frame relative in time.

14. A method according to claim 12 or 13, wherein the predetermined time window has a predetermined minimum time and predetermined maximum time that the candidate frames start at or end between.

15. A method according to any one of claims 12 to 14, wherein two or more of the selected frames of the speech signal overlap in time.

16. A method according to any one of claims 12 to 15, wherein the predetermined criteria comprise the candidate frame that has the optimum energy function value.

17. A method according to any one of claims 12 to 16, wherein the predetermined criteria comprise the candidate frame that has the largest absolute difference in energy function value than the previous frame.

18. A method according to any one of the preceding claims, wherein the method further comprising the steps of: applying a discrete cosine transformation to frames having weighted frequency filtered outputs; and mapping the discrete cosine transformed frames to a predetermined probability density function and disregarding the mapped frames in a tail region of the distribution in further processing of the speech signal.

19. A method according to claim 18, wherein the step of removing from the mapped distribution frames from the tail region is from the left tail region.

20. A method according to any one of the preceding claims, wherein the processed speech signals is used for feature extraction of the speech signal.

21. A method of dividing a speech signal into a sequence of frames for further front- end processing of the speech signals, the method comprising the following steps: calculating an energy function value of candidate frames within a predetermined time window and selecting a candidate frame that has an energy function value that meets predetermined criteria to be a frame for further processing.

22. A method according to claim 21, wherein selecting the frame is based on the position of the previous frame relative in time.

23. A method according to claim 21 or 22, wherein the predetermined time window has a predetermined minimum time and predetermined maximum time that the candidate frames start at.

24. A method according to any one of claims 21 to 23, wherein two or more of the selected frames of the speech signal overlap in time.

25. A method according to any one of claims 21 to 24, wherein the predetermined criteria comprise the candidate frame that has the optimum energy function value.

26. A method according to any one, of claims 21 to 25, wherein the predetermined criteria comprise the candidate frame that has the largest absolute difference in energy function value than the previous frame.

27. A method according to any one of claims 21 to 26, wherein the sequence of frames of the speech signal is further processed for feature extraction of the speech signal.

28. A method for front-end processing of speech signals comprising the following steps: dividing the speech signal into a sequence of frames;

Mel-frequency filtering the frames of the speech signal to produce Mel- frequency filtered outputs for each frame; and deriving a noise estimate for each Mel-scale frequency band of the speech signal and weighting the Mel-frequency filtered outputs of each frame with a function derived from the filtered outputs and the Mel-scale noise estimates to emphasise outputs that are less effected by noise.

29. Software, that when installed on computer readable storage medium of computer, to operate in accordance with the method described as described in any one of the preceding claims.

30. A computer system for front-end processing of speech signals comprising: a computer readable storage medium; a processor; software installed on the storage medium operable to perform the method according to any one proceeding claims using the processor.

31. A computer system according to claim 28, wherein the computer system is a distributed system.

32. A computer system according to claim 28, wherein the computer system is a speech recognition system.