WO1999027525A1 - Procede de reconnaissance vocale continue - Google Patents
Procede de reconnaissance vocale continue Download PDFInfo
- Publication number
- WO1999027525A1 WO1999027525A1 PCT/SG1997/000061 SG9700061W WO9927525A1 WO 1999027525 A1 WO1999027525 A1 WO 1999027525A1 SG 9700061 W SG9700061 W SG 9700061W WO 9927525 A1 WO9927525 A1 WO 9927525A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- syllable
- lattice
- character
- language
- signal
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/027—Syllables being the recognition units
Definitions
- This invention relates to a method of continuous language recognition.
- Recent proposals for automatic language recognition are mostly software-based and operate by generating a sequence of word hypotheses from an input language signal.
- speech recognition the most popular are based on statistical methods in which a continuous-time speech waveform is sampled and spectrally processed to produce a representation of the speech signal as a sequence of equally spaced discrete-time parameter vectors.
- the parameter vectors are then used in the estimation of the probability that the portions of the input waveform analyzed correspond to particular phonetic events based on comparison with a set of acoustic word models and a statistical language model to determine the most likely sequence of words.
- a refinement of the general technique discussed above involves the use of a stack decoder.
- a speech signal to be processed is divided into a plurality of frames and represented by parameters vectors.
- the first frame vector is analyzed against a number of theories which correspond to speech which may be expected.
- the extended theories are ordered in time and for each time, sorted according to the confidence or probability that the input up to that time is what is expected by the theory.
- the theories thus sorted form a stack corresponding to that frame in order of probability.
- the probability that the input speech up to a frame may match a given theory is determined based on two factors, acoustic confidence and contextual confidence.
- Acoustic confidence compares the input speech vectors up to that frame to what is expected by comparison with a Hidden Markov Model (HMM) for that theory and the contextual confidence compares the contents of the input frames to what is expected from a contextual point of view, that is to say by analysis of the linguistic context as a whole using a syntactic or statistical model.
- the confidence for that theory is then derived from a combination of these confidences on the basis of addition of their log probabilities . It is a disadvantage of such a proposed technique that the number of possible theories which need to be examined in comparison with each frame is large and even if the number of theories in the stack for any particular frame are pruned to the more likely alternatives, the technique requires high computational power.
- a method of continuous language recognition comprising the steps of: dividing a language signal into a plurality of components; analysing the components with reference to a plurality of syllable models to identify probable syllables in the language signal; storing the identified syllables in a syllable lattice; and analysing contiguous syllable combinations contextually.
- the invention is a two-stage technique. In the first stage, a lattice based upon syllable identification is developed which is then in a second stage analyzed contextually.
- the language signal is an acoustic, speech signal, the speech signal being divided up into a plurality of frames, with the contextual analysis being performed using a stack decoder.
- the method uses a two stage process.
- vectorial representations of the content of the speech frames forming an utterance are analyzed syllablically based only on determining acoustic confidence in respect of the utterance.
- the resulting identified probable syllable candidates are stored in a lattice with four attributes namely the syllable, start time, end time and acoustic confidence.
- the speech to be recognised is then analyzed by looking at contiguous combinations of syllables in the lattice and comparing these to contextually based theories ordered by frame.
- the acoustic confidence is now obtained by examining the lattice and combining legal combinations of the syllables in the lattice according to a lexicon, and by combining their likelihood, or confidence scores.
- the theories are sorted using a stack decoder in order of confidence that the theory at that point in time is contextually correct. Once all combinations have been analyzed, the theory at the top of the final stack will be that which provides the highest confidence at the finishing point that the theory corresponds to the utterance.
- the syllable lattice formation part of the method uses a Hidden Markov Model (HMM) for each syllable having a plurality of model states, the correspondence between the model states and each representation being analyzed probabilistically using a Viterbi Algorithm which finds the most probable temporal and spectral correspondence between the HMM and the input speech frames, this information being placed in the lattice principally with reference to syllable end time.
- the contextual confidence of the syllabic combinations is then analyzed backwards in time by traversing the lattice from the end of the speech to be recognised towards the start.
- An advantage of this embodiment of the invention is that the speed with which a recognition of a certain accuracy can be performed in reduced. Using the method of this embodiment in a real-time recognition application, recognition can be performed with substantially less pruning compared to prior art methods, thus increasing accuracy.
- the embodiment of the invention are particularly applicable for use in speech recognition of the Chinese language in which words are defined by combinations of characters the sound of which is a single syllable and of which there are only 1,600, including tone.
- the components of the language signal are visual rather than acoustic and may be phonetic visual representations such as Hanyu Pin Yin keyboard entries of syllables.
- the components are graphical, for example Chinese characters input to a character recognition program.
- Figure 1 schematically illustrates a first embodiment of the method of the present invention.
- Figure 2 shows the use of a Viterbi Algorithm to find the best path through a HMM / Speech Frame matrix.
- the first embodiment of the invention recognises an acoustic speech signal and has four main steps namely: 1. Dividing the speech signal into frames and representing the content of the frames vectorially as feature vectors;
- step 1A the speech which in this example is the Mandarin two syllable word PING2 GU03 meaning "apple" is sampled and the samples divided into a number of frames Fl- F6 of constant time interval.
- the speech signal in each frame is then represented vectorially in vectors V1-V6, choosing vector parameters to define spectral features of the frame signal.
- the resulting speech vectors V1-V6 are then compared to HMM syllabic models under operation of a recursive algorithm such as the Viterbi Algorithm in the manner well known in the art.
- a recursive algorithm such as the Viterbi Algorithm in the manner well known in the art.
- the algorithm can be visualized as finding the best path through a trellis where the vertical dimension represents the states of the HMM and the horizontal dimension represents the frames of speech F1-F6 as represented by the speech vectors V1-V6. Every dot in the picture represents the log probability of observing that frame at that time and each line joining the dots corresponds to a log transition probability.
- the log probability of any path is computed by summing the log transition probability and the log output probability along that path.
- the Viterbi Algorithm computes, for each frame, the probabilities for extending all partial paths of the previous frame and by the final frame has identified the match between the HMM and the speech frame with the highest
- each syllable is placed in a lattice as shown in Figure IB in which fourteen such syllables S1-S14 are shown.
- a lattice contains a large number of syllables of different start and end times.
- the speech contains a plurality of contiguous syllables.
- the lattice can therefore be analyzed by considering only those combinations of syllables which are contiguous in time as shown in Fig 1C, using a syllable pronunciation tree lexicon.
- a stack SK1-SK6 corresponds to each frame Fl-F ⁇ .
- Each theory on the stack at a given frame represents a sequence of syllables that can be derived from the lattice from the end of the utterance to that frame. That theory will then have a number of syllabic extensions backwards in time which are then match to the syllables in the lattice using the following algorithm:
- a stack is initialised with a null theory. All possible single syllable extensions ending at time t f are then analyzed by traversing the lattice backwards and collecting all syllables that reached a terminal node at the time t f . The contextual confidence of that syllable in comparison to each theory is then determined and the theory is placed in the stack at the frame at which that syllable starts, in order of confidence.
- both the lattice and the stacks are beam pruned in the manner known in the art so that confidences below a selected level are ignored.
- the embodiments of the method of continuous speech recognition described are not to be construed as limitative.
- the content of the speech frames need not be represented vectorially but may be represented functionally, by a curve fitting algorithm, for example.
- the identification of the syllables while preferably being undertaken by comparison of the syllables to Hidden Markov Models analyzed using a Viterbi algorithm, other algorithms may be used as may other model types.
- the stack decoder traverses the lattice backwards, but could also traverse the lattice in a forward direction, extending the syllables at each frame from their hypothesised start time. While use of a stack decoder is the preferred technique, other techniques may be employed.
- the first embodiment of the invention described above uses an acoustic speech signal as the input signal from which to derive the syllable lattice.
- the input signal is visual, namely a Hanyu Pin Yin phonetic representation of each syllable typed from a keyboard, with each key stroke forming a signal component.
- the lattice lb in Fig. 1 is formed of all possible (and similar) syllables which correspond to the typed input syllable.
- the lattice between the start and finish times of the typed input syllable may contain syllables having different tones or similar pronunciations.
- the syllables have defined start and finish times thus simplifying the lattice to some extent.
- the resulting lattice is analyzed as before using a stack decoder.
- the language signal again relies on a visual representation, like the second embodiment, but is based on graphical input of a Chinese character corresponding to each syllable.
- the Chinese character may be analyzed stroke by stroke or as an image, the image being divided into a plurality of segments, with each stroke or segment forming a component to be analysed by a character recognition program which provides a number of alternatives for the character, sorted in order of probability.
- the thus identified alternatives for each character are then stored in a character lattice having the same attributes as the syllable lattice except that the syllable identities are replaced by character identities.
- the lattice is then analyzed contextually as in the first embodiment using a character-to-word Lexicon.
- the strokes may be input by an entry device such as a mouse or input pen/pad.
- the image may be scanned and stored by conventional means.
- the present invention is applicable for language recognition in any spoken or written language but has particular application for the Chinese language.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/SG1997/000061 WO1999027525A1 (fr) | 1997-11-25 | 1997-11-25 | Procede de reconnaissance vocale continue |
AU52373/98A AU5237398A (en) | 1997-11-25 | 1997-11-25 | A method of continuous language recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/SG1997/000061 WO1999027525A1 (fr) | 1997-11-25 | 1997-11-25 | Procede de reconnaissance vocale continue |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1999027525A1 true WO1999027525A1 (fr) | 1999-06-03 |
Family
ID=20429576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SG1997/000061 WO1999027525A1 (fr) | 1997-11-25 | 1997-11-25 | Procede de reconnaissance vocale continue |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU5237398A (fr) |
WO (1) | WO1999027525A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3792911A4 (fr) * | 2018-05-08 | 2021-06-16 | Tencent Technology (Shenzhen) Company Limited | Procédé de détection d'un terme clé dans un signal vocal, dispositif, terminal et support de stockage |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5122951A (en) * | 1989-07-14 | 1992-06-16 | Sharp Kabushiki Kaisha | Subject and word associating devices |
US5129000A (en) * | 1986-04-05 | 1992-07-07 | Sharp Kabushiki Kaisha | Voice recognition method by analyzing syllables |
WO1996023298A2 (fr) * | 1995-01-26 | 1996-08-01 | Apple Computer, Inc. | Systeme et procede concus pour generer et utiliser des modeles de sous-syllabes dependant du contexte et reconnaitre une langue tonale |
-
1997
- 1997-11-25 AU AU52373/98A patent/AU5237398A/en not_active Abandoned
- 1997-11-25 WO PCT/SG1997/000061 patent/WO1999027525A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5129000A (en) * | 1986-04-05 | 1992-07-07 | Sharp Kabushiki Kaisha | Voice recognition method by analyzing syllables |
US5122951A (en) * | 1989-07-14 | 1992-06-16 | Sharp Kabushiki Kaisha | Subject and word associating devices |
WO1996023298A2 (fr) * | 1995-01-26 | 1996-08-01 | Apple Computer, Inc. | Systeme et procede concus pour generer et utiliser des modeles de sous-syllabes dependant du contexte et reconnaitre une langue tonale |
Non-Patent Citations (3)
Title |
---|
FUJISAKI ET AL.: "A new approach to continuous speech recognition based on considerations on human processes of speech perception", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 86), 7 April 1986 (1986-04-07) - 11 April 1986 (1986-04-11), TOKYO, JP, pages 1959 - 1962 vol.3, XP002072540 * |
GUPTA ET AL.: "Fast search strategy in a large vocabulary word recognizer", JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 84, no. 6, December 1988 (1988-12-01), US, pages 2007 - 2017, XP000096705 * |
WANG ET AL.: "Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but limited training data", PROCEEDINS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP 95), vol. 1, 9 May 1995 (1995-05-09) - 12 May 1995 (1995-05-12), DETROIT, MI, US, pages 61 - 64, XP000657931 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3792911A4 (fr) * | 2018-05-08 | 2021-06-16 | Tencent Technology (Shenzhen) Company Limited | Procédé de détection d'un terme clé dans un signal vocal, dispositif, terminal et support de stockage |
US11341957B2 (en) | 2018-05-08 | 2022-05-24 | Tencent Technology (Shenzhen) Company Limited | Method for detecting keyword in speech signal, terminal, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
AU5237398A (en) | 1999-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP0533491B1 (fr) | Reconnaissance de mots clefs dans la parole continue utilisant deux modèles de Markov cachés | |
US7162423B2 (en) | Method and apparatus for generating and displaying N-Best alternatives in a speech recognition system | |
Rigoll | Speaker adaptation for large vocabulary speech recognition systems using speaker Markov models | |
US6542866B1 (en) | Speech recognition method and apparatus utilizing multiple feature streams | |
US5677990A (en) | System and method using N-best strategy for real time recognition of continuously spelled names | |
US8019602B2 (en) | Automatic speech recognition learning using user corrections | |
JP2002507010A (ja) | 同時に起こるマルチモード口述のための装置及び方法 | |
JPH07152394A (ja) | 結合されたストリングモデルの最小誤認率訓練 | |
JP2006038895A (ja) | 音声処理装置および音声処理方法、プログラム、並びに記録媒体 | |
KR101014086B1 (ko) | 음성 처리 장치 및 방법, 및 기록 매체 | |
JP2955297B2 (ja) | 音声認識システム | |
KR101122591B1 (ko) | 핵심어 인식에 의한 음성 인식 장치 및 방법 | |
KR20210052563A (ko) | 문맥 기반의 음성인식 서비스를 제공하기 위한 방법 및 장치 | |
Evermann | Minimum word error rate decoding | |
EP0177854B1 (fr) | Système de reconnaissance de mot clef utilisant des chaînes d'éléments de langage | |
WO1999027525A1 (fr) | Procede de reconnaissance vocale continue | |
JP3216565B2 (ja) | 音声モデルの話者適応化方法及びその方法を用いた音声認識方法及びその方法を記録した記録媒体 | |
KR100586045B1 (ko) | 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식시스템 및 방법 | |
JPH1097275A (ja) | 大語彙音声認識装置 | |
JP2731133B2 (ja) | 連続音声認識装置 | |
Wu et al. | Application of simultaneous decoding algorithms to automatic transcription of known and unknown words | |
KR100304665B1 (ko) | 피치 웨이브 특성을 이용한 음성 인식 장치 및 그 방법 | |
JPH0997095A (ja) | 音声認識装置 | |
Fotinea et al. | Emotion in speech: Towards an integration of linguistic, paralinguistic, and psychological analysis | |
JPH10254480A (ja) | 音声認識方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 09117913 Country of ref document: US |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: KR |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: CA |