WO2004051626A1

WO2004051626A1 - A system and method for generating simulated speech

Info

Publication number: WO2004051626A1
Application number: PCT/SE2003/001895
Authority: WO
Inventors: Tomas Brusell; Erik Ullman
Original assignee: Parke, Ulf
Priority date: 2002-12-05
Filing date: 2003-12-04
Publication date: 2004-06-17
Also published as: SE0203609D0; AU2003284820A1

Abstract

A system and method for producing simulated speech without requiring the use of naturally sounding speech, where movements and conditions are detected with a user (102) when he/she performs speech movements. Ultrasonic signals are emitted from the outside mainly in a direction towards the mouth and/or throat area of the user by at least one ultrasonic device (106) of a detecting unit (104), wherein reflected ultrasonic signals are registered. The reflected ultrasonic signals are transformed into speech signals corresponding to the speech by a processing unit (112), which are then supplied to a speech generating device (114) in order to generate simulated speech. Thereby, simulated speech can be produced with good quality, without necessarily speaking out loud, or without being disturbed by a noisy environment. The speech signals may be supplied to a voice generator (116), a telephone (118) or a voice control device (120).

Description

A SYSTEM AND METHOD FOR GENERATING SIMULATED SPEECH.

TECHNICAL FIELD

The present invention relates to a system and method for generating simulated speech by registration of conditions and movements in the mouth and/or throat area of a user.

BACKGROUND OF THE INVENTION In certain situations it is inappropriate, difficult or even impossible to use a normal voice to communicate with other people, e.g. by telephone, or to make speech commands to voice controlled equipment, such as computers, machines and telephones. Thus, it may be forbidden or inappropriate to speak out loud in certain environments, e.g. in libraries and churches, at seminars, meetings and concerts, etc. Conversations in, e.g., a mobile phone are often regarded as being very disturbing in such situations. In other cases it may be difficult or impossible to catch and perceive speech in noisy environments, even though the voice is raised. Furthermore, speaking with a loud voice is tiring, and the same message must often be repeated several times in order to be understood correctly by a listening individual or a receiving apparatus. It is also impossible to convey speech in certain environments, such as in vacuum and under water.

Moreover, some persons lack the ability of speech or find it difficult to speak for different reasons, e.g. at certain throat diseases and without properly working vocal cords. Today, there are known sound producing speech devices intended to be applied inside or beside the mouth cavity or throat region of a user, such that speech sounds can be generated from the resonance that can be created by the user performing movements in the mouth and/or throat area, similar to movements performed during normal speech. Such speech devices are primarily used by persons having undergone throat surgery or the like, and thereby being deprived the normal ability of speech. In a known solution, a vibrating rod is used held against the outside of the neck or the face, requiring that the user holds the rod in a correct position in order to create any acceptable speech sound. In another solution, a tone generator is implanted in the throat, which naturally imposes discomfort to the user. Generally speaking, all solutions requiring the introduction of any device inside the mouth/throat are uncomfortable to the user. In the above-mentioned known solutions, the possibility of obtaining a good sound quality is quite limited, with respect to clarity and realness.

The patent document US 2002/0120449 Al describes a solution for speech recognition independent of whether the speech of the speaker can be heard or not. Reflection signals in the mouth cavity from transmitted noise are analysed with respect to frequency, in order to detect resonance characteristics, e.g. by spectral analysis. This resonance characteristics is correlated with sounds corresponding to speech, by comparing the information retrieved from reflections with previously collected reference data stored in a database. In another embodiment described in this document, a three-dimensional representation of the cavity is produced again by comparing detected information on the cavity with reference data stored in a database. However, this solution is exclusively directed to speech recognition for generating a corresponding text, furthermore requiring that a sensor 16 is applied inside the mouth of the user, which is troublesome and may even disturb the speech movements made by the user.

Hence, there is a need to be able to convey speech with good quality in a comfortable and effective manner using relatively simple equipment, without requiring devices inside the mouth/throat of the user, and without requiring the use of a sounding voice, as described above. There is thus a need to be able to produce speech without relying on a sounding voice, if it is not to be heard by other persons, or in noisy or extreme environments.

In a completely different context, there are known solutions for controlling a cursor on a computer screen without requiring the use of a hand operating a mouse. This technique is intended for, e.g., disabled persons or situations when both hands are engaged in other activities, such as when driving a car or when operating a machine or instrument. Applicant's own patent application SE 0100897-8 discloses a solution where the user can control a computer cursor by moving the tongue and/or other parts of the mouth cavity or the face, in order to obtain computer control without the use of hands. According to this solution, ultrasonic signals are emitted in a direction towards the mouth cavity of a user, wherein reflected ultrasonic signals are registered and transformed into control signals used for controlling the cursor. The created control signals can thereby produce movement or other activation of the cursor on the screen, such that each condition in the mouth cavity is associated with a specific cursor command. SUMMARY OF THE INVENTION

It is an object of the present invention to obtain a solution to produce simulated speech without requiring the use of naturally sounding speech. This object and others are obtained by means of a system and method where movements and conditions are detected with a user when he/she performs speech movements. According to the invented method, ultrasonic signals are emitted mainly in a direction towards the mouth and/or throat area of the user, wherein reflected ultrasonic signals are received and registered for detecting movements and conditions of the user when speech movements are performed. The reflected ultrasonic signals are transformed into speech signals corresponding to the speech by a processing unit, which are then supplied to a speech generating device in order to generate simulated speech.

The invented speech generating system comprises a detecting unit adapted to be carried by a user outside the face and/or neck, for detecting movements and conditions of the user when performing speech movements. The detecting unit is provided with at least one ultrasonic device adapted to emit ultrasonic signals from the outside mainly in a direction towards the mouth and/or throat area of the user, and to register reflected ultrasonic signals. The system further comprises a processing unit adapted to transform the reflected ultrasonic signals into speech signals corresponding to speech, and to supply the created speech signals to a speech generating device, for generating simulated speech. According to one embodiment, received reflection signals are transformed into data representing different conditions in the user's mouth and/or throat area, wherein transformed data are analysed and translated into speech signals, such that each condition in this area is associated with a corresponding speech state.

According to another embodiment, the ultrasonic signals which the ultrasonic device emits comprise at least one carrier wave of a specific frequency. The processing unit is further adapted to transpose received reflection signals from the carrier wave into speech signals in a frequency range of natural speech. The speech generating device may be a speech generator provided with a loudspeaker for generating speech sounds, or a telephone for communicating speech over a telephony network, or a voice control device for controlling by voice an apparatus or machine. The invention solves the problem of producing simulated speech, without the user necessarily speaking out loud, or when normal speech cannot be perceived and picked up, such as in noisy or extreme environments.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is described in more detail below by means of a preferred exemplary embodiment, and with reference to the attached drawings:

Fig. 1 is a schematic perspective view of a speech generating system.

Fig. 2 is a schematic perspective view of an alternative embodiment of a speech generating system. Fig. 3 is a schematic perspective view of a further alternative embodiment of a speech generating system. - Fig' s 4 - 6 are schematic perspective views of some alternative detailed embodiments of a detecting unit for registering movements in the mouth and/or throat area- of a user.

Fig. 7 is a flow chart of a working procedure of transforming reflected ultrasonic signals into speech signals.

Fig. 8 is a flow chart of an alternative working procedure of transforming reflected ultrasonic signals into speech signals.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

In Fig. 1, a speech generating system 100 is schematically shown for producing simulated speech without registering speech sounds. A user 102 about to speak carries a detecting unit 104 outside the mouth and/or neck, for detecting movements and conditions in the mouth and/or throat area of the user. During normal speech, certain specific movements and conditions are created inside the mouth and/or throat of the speaker, in particular in a region called larynx-pharynx, but also on the outside, i.e. the face and neck. These different movements and conditions are closely associated with the sounds normally generated when the vocal cords are sounding at the same time, such as when speech is performed, such that each movement or condition in this area corresponds to a specific speech sound. The present invention makes use of this relation to create simulated speech, without requiring the sounding of vocal cords, or that any speech sound from the speaker needs to be registered.

For this purpose, the detecting unit 104 is provided with an ultrasonic device 106 having a transmitter and a receiver for ultrasonic signals. Thus, the ultrasonic device 106 is adapted to emit ultrasonic signals mainly in a direction towards the mouth and/or throat area of the user, and to register reflection signals being reflected against inside and outside surfaces in the mouth and/or throat area, such as lips, tongue, mouth cavity, throat cartilage and face, which are moved as the user performs more or less normal speech movements. The received reflection signals are thus related to shape, position and/or movements of the different parts present in the read area. Within the scope of the present invention, ultrasonic signals may be optionally emitted only towards the mouth cavity, or only towards the throat, or towards both areas, depending on what is found to be appropriate.

The emitted ultrasonic signals include a non- audible ultrasonic carrier wave which may preferably be modulated, e.g. with respect to amplitude, frequency or phase, in order to increase the realness of the speech. The carrier wave may further be modulated as pulses, making it more insensitive to other disturbing sounds and also enabling distinction between transmission and reception in time. It is also possible to modulate the carrier wave by means of an encryption key or the like, in order to prevent undue overhearing by any other ultrasonic equipment. For example, an encryption key may be used to control a frequency hopping sequence. Reflected ultrasonic signals may thus be registered that correspond to specific speech sounds. In this way, the user may produce speech movements and conditions by speaking wholly "silently", whisper, or normally sounding, depending on the ability of the user or what would be appropriate in the environment he/she is present. The reflection signals are not substantially affected by whether the vocal cords are used or not. It is even possible to perform detectable speech movements with the mouth closed.

In the embodiment according to Fig. 1, a transmitter 108 is also provided on the detecting unit 104, adapted to further convey the reflection signals wirelessly to a receiver 110 being connected to a processing unit 112. According to the present invention, the processing unit 112 is adapted to process and analyse the incoming reflection signals, and to translate them into speech signals intended to be input to a speech generating device 114. The transfer between the transmitter 108 and the receiver 110 may be done by means of any optional communication technique for wireless transmission, such as IR or radio, e.g. according to the Bluetooth standard. It is also possible to use wire- based transmission, which is schematically illustrated in Fig. 2 where the reflection signals from the ultrasonic device 106 are conveyed by means of an electric or optical cable 200 to the processing unit 112. However, the present invention is not limited to any specific method of conveying the reflection signals to the processing unit 112.

In Fig. 3, a further alternative embodiment is shown, where the processing unit 112 is integrated together with the detecting unit 104. The reflection signals received at the ultrasonic unit 110 are processed and analysed, and are transformed into speech signals by means of the processing unit 112, wherein the speech signals are wirelessly transmitted by the transmitter 108 to a receiver 110 and a speech generating device 114. It is also possible to convey the speech signals by wire, not shown, to the speech generating device 114.

The speech signals created by the processing unit 112 may be analogue signals directly corresponding to W

acoustic sound, or digitally encoded speech which can be decoded at a later stage into such analogue signals. For this purpose, the processing unit 112 comprises a database with stored reference values, and a logic unit such as a processor, not shown, being programmed to translate received reflection signals into different speech states by comparison with stored reference values, which is described in more detail below. Alternatively, the logic unit may comprise a digital signal processor (DSP) for filtering and controlling pitch, wherein the database may comprise suitable stored filter parameters.

According to alternative applications of the present invention, as schematically illustrated in Fig. 1, the speech generating device 114 may be implemented as different types of devices. According to a first application, the speech generating device 114 is a speech generator 116 provided with a loudspeaker 116a and adapted to generate sounding speech by feeding the speech signal in an analogue form to the loudspeaker 116a, after suitable filtering and amplification. This application is useful, e.g., for persons lacking a normal speaking ability, or in noisy environments where the sound also may be amplified to a desired extent. The loudspeaker may also be placed in a different room from where the speaker is present, such as in a control room during, e.g., an industrial process where the speaker is standing next to a noisy machine or the like.

The speech generating device 114 may, in a second application, be a telephone 118, e.g. a mobile telephone. The telephone 118 may be provided with a digital input port for receiving the speech signals as digitally encoded speech for communication over a telephony network. Alternatively, analogue signals may be received in an analogue input port, thereby directly replacing signals from a conventional microphone in the telephone 118. According to a preferred embodiment, the speech generating system 100 may be built integrally with a telephone in a suitable manner, not shown, such that the created speech signals can be transferred directly to a telephony network as digitally encoded speech according to the prevailing encoding standard. The processing unit 112 may then be adapted to create digitally encoded speech from the reflection signals, or alternatively a separate speech coder is provided to encode an analogue speech signal from the processing, unit 112.

According to a third possible application, the speech generating device 114 may be a voice control device 120, for controlling by voice an apparatus or machine of any type, not shown. Today, there are many known examples of applications using voice control, such as computers and telephones. For example, there are computer programs which produce a continuous text based on supplied speech sound. Basically, this invention may be used for any application having voice control. The voice control device 120 may be adapted to receive the speech signals from the processing unit 112 either as analogue signals or as digitally encoded speech, for supplying the speech to the voice controlled apparatus or machine. Generally speaking, the processing unit 112 and the speech generating device 114 may be arranged at an optional distance from each other. They may thus be integrated in a single unit, or be placed at totally different locations, within the scope of the invention. The detecting unit 104 may, within the scope of the invention, be designed as a so-called headset, intended to be carried on the user's head. In Fig's 1 - 3, the detecting unit 104 is illustrated as an arm positioned at one side of the user' s face extending downwards in the area outside the mouth and throat. Fig's 4 - 6 illustrate some possible alternative embodiments of the detecting unit 104. In Fig. 4, the detecting unit 104 comprises two arms which are arranged at one side of the user's head. Each arm is provided with at least one ultrasonic device 106.1, 106.2.

Fig. 5 illustrates an embodiment with two arms intended to be positioned on opposite sides of the user's head. Each arm is provided with at least one ultrasonic device 106.3, 106.4.

Fig. 6 illustrates an embodiment with one arm provided with a plurality of ultrasonic devices 106.5, 106.6, 106.7... distributed along one side of the user's head and neck.

In order to achieve efficient propagation and reflection of the ultrasonic signals in the area of the mouth and throat, it may be advantageous if the detecting unit 104 is designed such that the ultrasonic device (s) 106, being used in a suitable manner, can rest against the outside of the user's face and neck. Since ultrasonic sound is rather direction dependent, it may be advantageous to apply the ultrasonic device 106 close to the mouth, such that the sound can more easily propagate through the mouth cavity. Since the detecting unit 104 is located entirely outside the mouth cavity, it is convenient to carry and does not affect/disturb the speech of the user.

Naturally, different combinations of the above- described embodiments are possible within the scope of the invention, which is not limited by the illustrated embodiments . With reference to Fig. 7, the working procedure of the processing unit 112 will now be described more closely. The processing unit 112 shown in Fig's 1-3 has the task of translating reflection signals from the ultrasonic device 106 into speech signals, intended to be fed into the speech generating device 114. Fig. 7 illustrates a flow chart of an example of how the processing unit 112 basically may work to accomplish this transformation.

It is assumed that a detecting unit 104 is applied to a "speaking" user 102, e.g. as described above. In a first step 700, reflection signals are received from the ultrasonic device 106, which may comprise a predetermined sequence of a plurality of separate reflection registrations made during a certain period, or continuously at regular intervals, by one or more ultrasonic devices in a predetermined order. If reflections are registered in sequence by plural ultrasonic devices distributed along an arm, such as illustrated for example in Fig. 6, a sweeping or scanning examination of the area of the mouth and throat is accomplished.

In this example, the received reflection signals are analogue, and these are filtered and digitised in a next step 702 into a form suitable for further treatment. The digital signals then constitute data in the form of a number of samples or measured values, representing different conditions in the read area with respect to the shapes and relative positions of the lips, tongue, mouth cavity and throat and other parts of the area moving as the user speaks. Alternatively, the ultrasonic device 106 may be adapted to supply the reflection signals directly in digital form. In step 704, received data is analysed to associate the measure values with specific conditions in the mouth and/or throat area. In this way, an image representation or stereogram of the examined area can be created. It may be mentioned that today, within the field of medical research, techniques are developed for registering and analysing such internal conditions by means of ultrasonic scanning.

Next in step 706, these conditions are translated into speech signals, intended to be supplied to the speech generating device 114 for creating speech, e.g. according to any of the above-described applications. The translation into speech signals can be made since different conditions in the mouth and/or throat area are associated with specific speech sounds, by comparison with reference data stored in the database of the processing unit 112, e.g. in the form of a table. More specifically, the logic unit performs a comparison and matching of each condition with reference data stored in the database. These associations have been duly created previously by means of a calibration procedure, when the processing unit 112 is trained to create correct speech signals by the user reading a predetermined text. Thereby, the produced speech signals are adapted to a specific user and his/her personal manner of "speaking", such that the simulated speech can be created as natural as possible.

The produced speech signals are finally sent to the speech generating device 114 in step 708, for generating simulated speech. As mentioned above, the speech signals may be supplied in analogue or digital form.

In an alternative embodiment, the process of Fig. 7 may be modified and simplified, as described below with reference to Fig. 8 illustrating a flow chart of another example of how the processing unit 112 basically may work to translate refection signals into simulated speech. In this case, the ultrasonic device 106 emits at least one carrier wave of a specific frequency, and as in the previous example, reflection signals from the ultrasonic device 106 are received, in a first step 800. Next in a step 802, reflection signals are processed by filtration and adjustments. For example, the signals may be cleaned from artefacts, and may be restored to eliminate distortions and defects in the signals. This signal processing may be performed either in analogue form or after being digitised. In a next step 804, the treated ultrasonic signals are transposed into a frequency range of natural speech, optionally after decryption if encryption has been used, in order to create speech signals. In this way, the speech signals correspond directly to the user' s natural forming of words, phonemes and syllables. Finally, the finished speech signals are supplied to the speech generating unit 114, in a step 806, for generating simulated speech. Also in this case, the speech signals may be optionally supplied in analogue or digital form. A good result from the transposition can be achieved, since at least one carrier wave has been used. By means of the present invention, simulated speech can be produced with good quality, without the user necessarily speaking out loud, or without being affected by a noisy environment. When compared with using a conventional microphone-speaker technique, the invention eliminates the influence of disturbances, by exclusively picking up the registered speech, independent of other disturbing sounds. The invention is useful in several different situations and usage fields, chiefly when sounding speech cannot be used, e.g. if the speaker has been deprived his/her natural speech ability, and in environments where it is not permitted or appropriate to speak out loud. Furthermore, the invention is useful when normal speech cannot be perceived, such as in extreme environments in a medium not allowing that speech is conveyed, e.g. under the water and in vacuum, where a suitable sealing cover or the like containing air may be worn. Furthermore, the invention is simple and comfortable to use for the speaker, since no parts must be applied inside the mouth cavity or throat, and since only fairly normal speech movements are required, and without having to strain the voice. The user's hands are even free, since the detecting unit can be designed as a headset or the like.

The invention is not limited to the described embodiments, but is generally defined by the following claims .

Claims

1. A system for generating simulated speech, characterised by: - a detecting unit (104) , intended to be carried by a user (102) for detecting movements and conditions of the user when performing speech movements, wherein the detecting unit is provided with at least one ultrasonic device (106) adapted to emit ultrasonic signals from the outside mainly in a direction towards the mouth and/or throat area of the user and to register reflected ultrasonic signals, and

- a processing unit (112) adapted to transform registered reflection signals into speech signals corresponding to speech, and to supply the created speech signals to a speech generating device (114) , for generating simulated speech.

2. A system according to claim 1, characterised in that the detecting unit (104) is adapted to be carried on the head of a user, and comprises at least one arm to which the ultrasonic device (106) is attached, such that the arm in use extends over the outside mainly along the user's mouth and/or neck area.

3. A system according to claim 1 or 2, characterised in that the detecting unit (104) comprises at least two arms to be placed on opposite sides of the head of a user, wherein each arm is provided with at least one ultrasonic device (106) .

4. A system according to claim 2 or 3, characterised in that a plurality of ultrasonic devices (106) are attached along the arm.

5. A system according to any of claims 1 - 4, characterised in that the ultrasonic device (106) is adapted to rest mainly against the face and/or neck of a user.

6. A system according to any of claims 1 - 5, characterised in that the detecting unit (104) is provided with a transmitter (108) for wireless transmission of the reflected ultrasonic signals to the processing unit (112) .

7. A system according to any of claims 1 - 5, characterised in that the processing unit (112) is arranged integrated with the detecting unit (104) .

8. A system according to claim 7, characterised in that the detecting unit (104) is provided with a transmitter (108) for wireless transmission of speech signals to the speech generating device (114) .

9. A system according to any of claims 1-8, characterised in that the processing unit (112) is adapted to transform received reflection signals into data representing different conditions in the mouth and/or throat area, and to analyse and translate transformed data into speech signals, such that each condition in the user's mouth and/or throat area is associated with a corresponding speech state.

A system according to any of claims 1-8, characterised in that the ultrasonic signals which the ultrasonic device is adapted to emit comprise at least one carrier wave of a specific frequency, and that the processing unit (112) is further adapted to transpose received reflection signals from the carrier wave into speech signals in a frequency range of natural speech.

A system according to any of claims 1-10, characterised in that the processing unit (112) is adapted to process received reflection signals by filtration and adjustments .

A method of generating simulated speech, characterised by the following steps:

- emitting ultrasonic signals from the outside mainly in a direction towards the mouth and/or throat area of the user,

- receiving and registering reflected ultrasonic signals for detecting movements and conditions of the user when speech movements are performed,

- transforming the registered reflection signals into speech signals corresponding to speech, and

- supplying the created speech signals to a speech generating device, for generating simulated speech.

A method according to claim 12, characterised in that the created speech signals are supplied to a speech generator (116) provided with a loudspeaker (116a) , adapted to generate simulated speech.

14.A method according to claim 12, characterised in that the created speech signals are supplied in analogue form to a telephone (118), adapted to transform the speech signals into digitally encoded speech for communication over a telephony network.

15.A method according to claim 12, characterised in that the created speech signals are supplied as digitally encoded speech to a telephone (118), adapted to communicate the digitally encoded speech over a telephony network.

16.A method according to claim 12, characterised in that the created speech signals are supplied to a voice control device (120) , for controlling by voice an apparatus or machine.

17. method according to any of claims 12-16, characterised in that the received reflection signals are transformed into data representing different conditions in the mouth and/or throat area, wherein transformed data is analysed and translated into speech signals, such that each condition in the user' s mouth and/or throat area is associated with a corresponding speech state.

18. method according to claim 17, characterised in that the associations between conditions in the user's mouth and/or throat area and corresponding speech states are created by means of a calibration procedure when the user reads a predetermined text.

19.A method according to any of claims 12-16, characterised in that the emitted ultrasonic signals comprise at least one carrier wave of a specific frequency, and that received reflection signals from the carrier wave are transposed into speech signals in a frequency range of natural speech.

A method according to any of claims 12-19, characterised in that received reflection signals are processed by filtration and adjustments.