CN108304389B

CN108304389B - Interactive voice translation method and device

Info

Publication number: CN108304389B
Application number: CN201711287987.XA
Authority: CN
Inventors: 刘俊华; 孟廷; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2021-06-08
Anticipated expiration: 2037-12-07
Also published as: CN108304389A

Abstract

The embodiment of the invention provides an interactive voice translation method and device, and belongs to the technical field of language processing. The method comprises the following steps: if the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text; if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognition text; if the fact that the user inputs the key text is detected, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text. Due to the fact that a new voice translation interaction mode is provided, the translation result is more accurate.

Description

Interactive voice translation method and device

Technical Field

The embodiment of the invention relates to the technical field of language processing, in particular to an interactive voice translation method and device.

Background

The traditional language service industry adopts manual accompanying interpretation, alternate interpretation, simultaneous interpretation and the like to solve the problem of language communication barrier, but is limited by insufficient manpower and cost limitation, and cannot meet the requirement of common people on communication of different languages. The development of the voice translation technology makes a beneficial supplement to the traditional language interpretation service industry, provides another way for ordinary people to communicate daily, and has more advantages in the aspects of cost, timeliness and the like.

The speech translation process generally consists of three parts, speech recognition, machine translation and speech synthesis, respectively. The speech translation usually adopts a one-way transmission mode of translation results, namely when an error occurs in speech recognition or machine translation, transmission of error information is caused. Especially for some personal names, place names and organization names, because most of the entity nouns belong to rare vocabularies, the occurrence proportion of the entity nouns in the training corpus of speech recognition and machine translation is small, even never occurs, so that errors are easy to make in the recognition and translation processes, and the effect of speech translation in practical application is influenced.

Disclosure of Invention

To solve the above problems, embodiments of the present invention provide an interactive speech translation method and apparatus that overcome the above problems or at least partially solve the above problems.

According to a first aspect of the embodiments of the present invention, there is provided an interactive speech translation method, including:

if the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text;

if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognition text, and detecting the operation executed by the user based on the prompt;

if the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

According to the method provided by the embodiment of the invention, when the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, the first target language text is translated to obtain a second recognition text. And if the semantics of the first recognized text and the semantics of the second recognized text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognized text, and detecting the operation executed by the user based on the prompt. If the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text. Because the user can be prompted to input the key text in an interactive mode with the user when the translation result cannot be determined to be correct, the key nouns of the key text can be used as a whole participle, and the key nouns are translated according to the types of the key nouns, so that the translation result is more accurate.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, before prompting the user of the translation difficulty level corresponding to the first recognition text, the method further includes:

vectorizing the first recognition text and the second recognition text respectively to obtain a first word vector sequence and a second word vector sequence, and calculating the distance between the first word vector sequence and the second word vector sequence;

and if the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value, determining semantic inequivalence between the first recognized text and the second recognized text.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, translating a key noun based on a type of the key noun to obtain a first translation result includes:

and determining a corresponding placeholder according to the type of the key nouns, converting the placeholder into a translated target language noun, and taking the translated target language noun as a first translation result.

With reference to the first possible implementation manner of the first aspect, in a fourth possible implementation manner, after prompting the user of the translation difficulty level number corresponding to the first recognition text, the method further includes:

if the operation executed by the user is detected to be re-inputting of the voice signal, acquiring a third recognition text, and re-executing the text translation process based on the third recognition text; the third recognition text is the recognition text corresponding to the re-input voice signal, and text data between the first recognition text and the third recognition text are different and semantically equivalent.

With reference to the first possible implementation manner of the first aspect, in a fifth possible implementation manner, the method further includes:

if the recognition confidence of the first recognition text is not greater than the first preset threshold, prompting the user to confirm the first recognition text again;

and if the error instruction and the text modification instruction are detected in the confirmation of the first identification text input by the user, modifying the first identification text according to the text modification instruction, and re-executing the text translation process based on the modified first identification text.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, after prompting, in a preset manner, a user to confirm the first recognition text again, the method further includes:

if the user is detected to input the first recognition text, the user inputs the voice signal again, a fourth recognition text is obtained, and a text translation process is executed again based on the fourth recognition text; and the fourth recognition text is the recognition text corresponding to the re-input voice signal.

With reference to the fifth possible implementation manner of the first aspect, in a seventh possible implementation manner, after prompting, in a preset manner, the user to confirm the first recognition text again, the method further includes:

and if the confirmation error-free instruction that the user inputs the first recognition text is detected, resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient, and executing the text translation process again.

According to a second aspect of the embodiments of the present invention, there is provided an interactive speech translation apparatus, including:

the first translation module is used for translating the first target language text to obtain a second recognition text when the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text;

the prompting module is used for prompting the translation difficulty degree corresponding to the first recognition text of the user when the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, and detecting the operation executed by the user based on the prompt;

the second translation module is used for performing semantic analysis on the key text to obtain key nouns and types of the key nouns when detecting that the operation executed by the user is to input the key text, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

According to a third aspect of embodiments of the present invention, there is provided an interactive speech translation apparatus including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the interactive speech translation method provided by any of the various possible implementations of the first aspect.

According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the interactive speech translation method provided in any one of the various possible implementations of the first aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.

Drawings

FIG. 1 is a flowchart illustrating an interactive speech translation method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a speech translation process according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an interactive speech translation method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an interactive speech translation method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an interactive speech translation method according to an embodiment of the present invention;

FIG. 6 is a block diagram of an interactive speech translation apparatus according to an embodiment of the present invention;

fig. 7 is a block diagram of an interactive speech translation apparatus according to an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following examples are intended to illustrate the examples of the present invention, but are not intended to limit the scope of the examples of the present invention.

The traditional language service industry adopts manual accompanying interpretation, alternate interpretation, simultaneous interpretation and the like to solve the problem of language communication barrier, but is limited by insufficient manpower and cost limitation, and cannot meet the requirement of common people on communication of different languages. The development of the voice translation technology makes beneficial supplement to the traditional language interpretation service industry, provides another way for daily communication of ordinary people, and has more advantages in the aspects of cost, timeliness and the like.

Speech translation refers to the process of automatically translating a speech signal in a source language into a speech signal in a target language. Speech translation generally includes three main components, speech recognition, machine translation, and speech synthesis. Specifically, when a speech signal of a source language is given, firstly, a recognition text of the source language is obtained through a speech recognition system, secondly, the recognition text is translated into a text of a target language through a machine translation system, and finally, the text of the target language is synthesized into the speech signal of the target language through a speech synthesis system. The speech translation usually adopts a one-way transmission mode of translation results, namely when an error occurs in speech recognition or machine translation, transmission of error information is caused. Especially for some personal names, place names and organization names, because most of the entity nouns belong to rare vocabularies, the occurrence proportion of the entity nouns in the training corpus of speech recognition and machine translation is small, even never occurs, so that errors are easy to make in the recognition and translation processes, and the effect of speech translation in practical application is influenced. In addition, in the current manual translation process, translation personnel usually perform multiple rounds of communication to translate the entity nouns, and the translation efficiency is not high.

In view of the above situation, an embodiment of the present invention provides an interactive speech translation method. The method can be used in a speech translation scene, namely, firstly, the recognition text is obtained through speech recognition, and then, the recognition text is translated to obtain the target language text. Referring to fig. 1, the method includes: 101. if the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text; 102. if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognition text, and detecting the operation executed by the user based on the prompt; 103. if the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

Before the step 101 is executed, the speech signal of the source language may be received through the audio acquisition module, and then speech recognition may be performed on the speech signal of the source language to obtain the first recognition text. When the first recognition text is translated, the first recognition text can be input into a translation coding and decoding recurrent neural network, so that the first target language text is output. The above process can be illustrated by the following example, where user a (chinese) needs to deliver information to user B (english), and intermediate translation by machine is required due to the non-communication of languages a and B. As shown in fig. 2, normally, user a directly speaks chinese to the machine in speech "ask brooklyn how to go? After the machine performs speech recognition to obtain a first recognition text, the machine translates the first recognition text into a target language, such as "How can I get to Brooklyn? And transmitting the first target language text obtained by translation to the user B, and finishing single translation. When the first target language text is transmitted to the user B, the first target language text may be transmitted in an interface display manner, and may also be transmitted in a voice synthesis and broadcast manner, which is not specifically limited in this embodiment of the present invention.

In the above example, the place name "brooklin" is a rare entity, and when speech recognition is performed on "brooklin", recognition may be wrong. In addition, translation errors may occur even if the recognition is correct. Based on the above situation, for the speech translation scenario, before performing step 101, it can be determined whether a recognition error and a translation error occur, and step 101, step 102, and step 103 are performed sequentially according to the determination result.

As can be seen from the above, for the speech translation scenario, before performing step 101, it can be determined whether a recognition error and a translation error occur respectively. Specifically, can be obtained firstTaking the recognition confidence score of the first recognition text_asrAnd a translation confidence score for the first target language text_mt. Wherein the recognition confidence score_asrFor representing the degree of confidence of the first recognized text as a result of speech recognition, the translation confidence score_mtFor indicating the trustworthiness of the first target language text as a result of the translation. Recognition confidence score of the first recognized text_asrGreater than a first preset threshold value T_asrWhen (i.e. score)_asr＞T_asr) Then the first recognized text may be considered to be recognized correctly. Otherwise, the first recognized text may be considered as a recognition error. Confidence score of translation of text in first target language_mtIs greater than a second preset threshold t_mtWhen (i.e. score)_mt＞T_mt) Then the first target language text translation may be deemed correct. Otherwise, the first target language text translation may be considered to be incorrect.

In the step 101, if the recognition confidence of the first recognized text is greater than the first preset threshold and the translation confidence of the first target language text is not greater than the second preset threshold, it indicates that the recognition is correct but the translation is incorrect. At this time, the first target language text may be used as an input of machine translation, and the first target language text may be reversely translated to obtain the second recognized text. Due to the fact that the translation is carried out in the reverse direction, the first recognition text and the second recognition text correspond to the same language. After the first recognized text and the second recognized text are obtained, whether the first recognized text and the second recognized text are semantically equivalent can be judged. If the two are not semantically equivalent, the translation error is indicated, and the reason of the translation error can be two situations, namely the expression form of the first recognized text is not favorable for correct translation, and the second situation is that some key nouns which are difficult to translate exist in the first recognized text. The type of the key noun may be a name of a person, a place, a name of an organization, or some noun phrases, which is not specifically limited in the embodiment of the present invention. In addition, the number of the key nouns included in the first recognition text may be one or more. When the first recognition text includes a plurality of key terms, the plurality of key terms may be a plurality of types of key terms, which is not particularly limited in the embodiment of the present invention.

After determining that the semantics of the first recognized text are not equivalent to the semantics of the second recognized text, prompting the user for a translation difficulty level corresponding to the first recognized text. In both cases, the translation difficulty is higher when the first recognized text is translated in any case, so that the user can be prompted that the translation difficulty of the current first recognized text is higher, and the user can supplement the related explanation of the first recognized text or replace other recognized texts which are convenient to translate.

For the second case, namely, some key nouns which are difficult to translate exist in the first recognition text, when the first recognition text is translated, after the key text input by the user is detected, the key nouns and the types of the key nouns are obtained through semantic parsing from the key text, so that the key nouns can be subsequently segmented as a whole, and the key nouns are translated according to the types of the key nouns to obtain a first translation result. And translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text. It should be noted that, for chinese, a word may be composed of multiple words, so that a keyword as a whole needs to be segmented by means of segmentation. For other languages, such as English, the key noun may be a word or noun phrase (e.g., Los Angeles). When the key nouns are noun phrases, the noun phrases can also be segmented as a whole by means of word segmentation.

Based on the content of the above embodiment, before prompting the user of the translation difficulty degree corresponding to the first recognized text, it may also be determined whether the semantics of the first recognized text are equivalent to the semantics of the second recognized text. Correspondingly, as an optional embodiment, the embodiment of the invention also provides a method for judging whether the text semantics are equivalent. Referring to fig. 3, the method includes: 301. vectorizing the first recognition text and the second recognition text respectively to obtain a first word vector sequence and a second word vector sequence, and calculating the distance between the first word vector sequence and the second word vector sequence; 302. And if the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value, determining semantic inequivalence between the first recognized text and the second recognized text.

In step 301, when the first identification text and the second identification text are vectorized, the first identification text and the second identification text may be encoded respectively based on an encoding module in a translation system of the recurrent neural network, and the hidden state output value at the last time in the recurrent neural network is used as the vectorization representation of the first identification text and the second identification text, so as to obtain a first word vector sequence corresponding to the first identification text and a second word vector sequence corresponding to the second identification text. When the distance between the first word vector sequence and the second word vector sequence is calculated, a Dynamic Time Warping (DTW) algorithm may be used for calculation, a cosine distance between the two word vector sequences may also be calculated, or distance calculation may also be performed after two word vector sequences are abstractly characterized by using CNN/RNN. In addition, in step 302, if the distance between the first word vector sequence and the second word vector sequence is smaller than a third preset threshold, it is determined that the semantic equivalence is between the first recognized text and the second recognized text. And if the semantic equivalence between the first recognized text and the second recognized text is determined, the credibility of the first target language text as the translation result is higher. At this point, the first target language text may be used directly as the final translation result. And if the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value, determining semantic inequivalence between the first recognized text and the second recognized text. At this time, it is indicated that the first target language text is less reliable as a translation result.

In the method provided by the embodiment of the invention, the first recognition text and the second recognition text are respectively vectorized to obtain the first word vector sequence and the second word vector sequence, and the distance between the first word vector sequence and the second word vector sequence is calculated. And if the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value, determining semantic inequivalence between the first recognized text and the second recognized text. After the correct recognition and the wrong translation are determined, secondary judgment can be carried out on whether the first target language text is used as the translation result or not according to the judgment result of whether the semantics between the first recognized text and the second recognized text are equivalent or not, so that the probability of wrong information transmission in the voice translation process is reduced, and the translation result is more accurate.

In the above embodiment, when the first recognized text is translated based on the key nouns and the types of the key nouns, the key nouns may be individually translated as a whole participle, and the other contents except the key nouns in the first recognized text are simultaneously translated, and finally the translation results of the two parts are combined to obtain a complete translation result. However, considering that the translation result of the key nouns is usually fixed, such as the names of people, places, organizations, and noun phrases, the translation result is not affected by other contents in the text. Based on the principle, the translation processes can be distinguished, namely, when the specific translation is carried out, the key nouns can be translated after other contents in the first recognition text are translated. Accordingly, as an alternative embodiment, the method for obtaining the first translation result without translating the key noun based on the type of the key noun in the embodiment of the present invention is specifically limited, and includes but is not limited to: and determining a corresponding placeholder according to the type of the key nouns, converting the placeholder into a translated target language noun, and taking the translated target language noun as a first translation result.

Based on the content of the foregoing embodiment, as an optional embodiment, the embodiment of the present invention does not specifically limit the method for translating the first recognized text based on the placeholder. Referring to fig. 4, the method includes: 1031. determining a corresponding placeholder according to the type of the key noun, and replacing the key noun with the placeholder corresponding to the key noun according to the position of the key noun in the first recognition text to obtain a replaced first recognition text; 1032. inputting the replaced first recognition text into a translation system, and outputting a third target language text, wherein the third target language text comprises placeholders corresponding to key nouns; 1033. and converting the placeholders contained in the third target language text into translated target language nouns to obtain a second target language text.

In step 1031, the placeholder may be a predefined character string, or the user may customize the placeholder according to the requirement, which is not specifically limited in the embodiment of the present invention. For example, the key term "Brooklyn" is a place name, and its placeholder may be "$ _ LOC _". It should be noted that, as can be seen from the name of the placeholder, the key noun type corresponding to the placeholder is a place name. In addition, when two key nouns as place names appear in the first recognition text, such as "brooklyn" and "boston", the placeholders of the two can be "$ _ LOC _ 1" and "$ _ LOC _ 2", respectively, for distinction.

With the first recognized text as "ask how did brooklyn ask? For example, it can be determined that the key term "brooklyn" is in the first recognition text "ask how did brooklyn to go? "the position of the key noun is determined, and the placeholder can be replaced by the key noun according to the position of the key noun, so that a replaced first recognition text" ask $ _ LOC _ how should you go? ". After the replaced first recognition text is translated, a third target language text containing placeholders can be obtained. For example, after the first recognition text is translated, a third target language text "How can I get to $ _ LOC? ".

It should be noted that, for chinese, a word may be composed of multiple words, so that it is necessary to divide a key noun into a whole word by means of word division and replace the whole word by a placeholder. For other languages, such as English, the key noun is typically a word, and may be a noun phrase (e.g., Los Angeles). When the key nouns are noun phrases, the noun phrases can be segmented as a whole by means of word segmentation and replaced by placeholders. For example, the whole participle Los Angeles can be replaced by a placeholder.

After the third target language text is obtained, the third target language text may be converted into a translated target noun. Specifically, the placeholder in the third target language text may be translated according to a pre-trained key noun translation model, and the placeholder in the third target language text is replaced by the translated target language noun to obtain the second target language text. It should be noted that, when training the key noun translation model, a modeling unit smaller than a word, such as a single word, a phoneme, or the like, may be used to perform model construction, and this is not specifically limited in the embodiment of the present invention.

According to the method provided by the embodiment of the invention, the placeholder corresponding to the key nouns is determined according to the types of the key nouns, and the placeholder corresponding to the key nouns is replaced by the key nouns according to the positions of the key nouns in the first recognition text, so that the replaced first recognition text is obtained. And inputting the replaced first recognition text into a translation system, and outputting a third target language text. And converting the placeholders contained in the third target language text into translated target language nouns to obtain a second target language text. When the first recognition text is translated, the key nouns which are easy to make mistakes in the translation can be replaced by the placeholders in a targeted manner, and the key nouns corresponding to the placeholders are translated independently, so that the translation effect of the key nouns is improved, and the translation result is more accurate. Meanwhile, the placeholder corresponding to the key noun can be customized by the user, so that the personalized customization requirement of the user in the voice translation process can be met.

Based on the content of the foregoing embodiment, as an alternative embodiment, the embodiment of the present invention does not specifically limit the method for obtaining the key nouns and the types of the key nouns, and includes but is not limited to: and acquiring a key text, and performing semantic analysis on the key text to obtain key nouns in the key text and the type of each key noun.

The key text may be input by a user, such as voice input or text input, which is not specifically limited in this embodiment of the present invention. For example, the user may input the key text as "brooklyn is a place name" by voice, and "brooklyn" may be resolved into a place name by the semantic resolution tool, so that the key noun may be determined as "brooklyn" and the type may be "place name".

It should be noted that, when the user inputs the key text, the key text may be input according to an expression form of "XXX is XXX", for example, "brukelin is a place name", "clinton is a person name", "world environment protection organization is an organization name", and the like. Of course, other expressions may be used for inputting, such as "the place name is brueckelin in the sentence", and the embodiment of the present invention is not particularly limited to this.

According to the method provided by the embodiment of the invention, the key text is obtained, and the semantic analysis is performed on the key text, so that the key nouns in the key text and the type of each key noun are obtained. Because the user can input the key text according to the self-defined expression form, the requirements of the user for personalized customization can be met while the key nouns and the types of the key nouns are obtained based on the key text.

Based on the content of the foregoing embodiment, after the first target language text is translated to obtain the second recognized text, if the semantics of the first recognized text and the second recognized text are not equivalent, it is indicated that some key terms that are difficult to translate may exist in the first recognized text, so that the credibility of the first target language text as a translation result is not high. This situation corresponds to the second case in the above-described embodiment. In the first case of the foregoing embodiment, if the semantics of the first recognized text are not equivalent to the semantics of the second recognized text, it is indicated that the expression form of the first recognized text may not be favorable for correct translation, and thus the confidence level of the first target language text obtained by translation based on the first recognized text is not high as the translation result.

In the second case, the processing can be performed as in the above embodiment. For the first situation, as an optional embodiment, after prompting the user of the difficulty level of translation corresponding to the first recognized text, an embodiment of the present invention further provides an interactive speech translation method, where the method includes: if the operation executed by the user is detected to be re-inputting of the voice signal, acquiring a third recognition text, and re-executing the text translation process based on the third recognition text; the third recognition text is the recognition text corresponding to the re-input voice signal, and text data between the first recognition text and the third recognition text are different and semantically equivalent.

Specifically, the user may re-input the voice signal, and after recognizing the re-input voice signal to obtain a third recognized text, may translate the third recognized text to obtain a corresponding target language text, so that based on the third recognized text and the corresponding target language text, the text translation process is re-executed from step 101 in the foregoing embodiment. Compared with the speech signal corresponding to the first recognition text, when the user re-inputs the speech signal, the sentence expression form can be changed or the sentence length can be shortened, so that the third recognition text obtained by recognition is different from the first recognition text in expression form (text data is different), but the third recognition text is the same in substance (namely, semantically equivalent) so as to facilitate subsequent translation.

According to the method provided by the embodiment of the invention, when the semantics of the first recognition text and the second recognition text are not equivalent, the third recognition text is obtained, and the text translation process is executed again based on the third recognition text. Because the user can input the voice signal again by adjusting the expression form or the sentence length and execute the voice translation process again, a new voice translation interaction mode is provided when the semantics between the first recognized text and the second recognized text are not equivalent, so that the translation result is more accurate.

The text translation process in the above embodiment mainly aims at the situation that the recognition is correct and the translation is wrong. However, in an actual speech translation scenario, a situation may arise where recognition is erroneous. In order to avoid causing the transmission of error messages, further processing of the first recognized text is required. Correspondingly, as an optional embodiment, the embodiment of the invention also provides an interactive voice translation method. Referring to fig. 5, the method includes: 501. if the recognition confidence of the first recognition text is not greater than the first preset threshold, prompting the user to confirm the first recognition text again; 502. And if the error instruction and the text modification instruction are detected in the confirmation of the first identification text input by the user, modifying the first identification text according to the text modification instruction, and re-executing the text translation process based on the modified first identification text.

Specifically, if the recognition confidence of the first recognized text is not greater than the first preset threshold, it indicates that the recognition may be wrong. In order to determine whether the recognition result of the first recognized text as the speech signal is really wrong, the user can be prompted to confirm the first recognized text again by means of speech or interface prompting so as to determine whether the first recognized text is error-free. And if the first recognition text is detected to have a wrong instruction in the confirmation, the user determines that the first recognition text has a mistake. If a text modification instruction for the first recognition text is detected, the first recognition text can be modified according to the text modification instruction, and the text translation process is re-executed based on the modified first recognition text. Wherein, the text modification instruction may be a voice instruction input by a user. For example, if the first recognized text is "ask how do you ask bizarre khalin? "the user may enter a voice command" change immature to brue, cotton cloth, luban's rue "to modify the first recognized text. Of course, besides modifying the first recognized text through the voice instruction, manual modification and the like may also be adopted, and this is not particularly limited in the embodiment of the present invention.

In the case where it is detected that the user has input a confirmation of the first recognized text with an erroneous instruction, the first recognized text is modified mainly by a text modification instruction in the above-described embodiment to correct the recognition error. In an actual implementation scenario, the user may also choose to replace the recognition text. Correspondingly, as an optional embodiment after prompting the user to confirm the first recognized text again in a preset manner, the embodiment of the present invention further provides an interactive speech translation method, including: if the user is detected to input the first recognition text, the user inputs the voice signal again, a fourth recognition text is obtained, and a text translation process is executed again based on the fourth recognition text; and the fourth recognition text is the recognition text corresponding to the re-input voice signal.

Based on the content of the above embodiment, as an optional embodiment after prompting the user to confirm the first recognized text again in a preset manner, an embodiment of the present invention further provides an interactive voice translation method, including: and if the confirmation error-free instruction that the user inputs the first recognition text is detected, resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient, and executing the text translation process again.

And if the confirmation error-free instruction of the first recognition text is detected, the user determines that the first recognition text is identified without errors. At this time, the recognition confidence of the first recognized text may be reset to the maximum value of the recognition confidence, and the interactive speech translation process is executed again from step 101 in the above embodiment.

For example, the value range of the recognition confidence is [0, 1 ]. If the recognition confidence of the first recognized text is 0.3 and the first preset threshold is 0.6, the recognition confidence of the first recognized text 0.3 is smaller than the first preset threshold 0.6. If the error-free confirmation instruction of the first recognition text is detected, the recognition confidence 0.3 of the first recognition text may be reset to 1, and step 101 is executed again. Because the recognition confidence coefficient after the first recognition text is reset is 1 which is greater than the first preset threshold value 0.6, whether the translation confidence coefficient of the first target language text is greater than the second preset threshold value can be continuously judged. If the recognition confidence of the first recognition text is greater than the first preset threshold and the translation confidence of the first target language text is less than the second preset threshold, the text translation process may be continuously executed according to the contents of the above embodiment.

According to the method provided by the embodiment of the invention, when the recognition confidence coefficient of the first recognition text is not greater than the first preset threshold value, the user is prompted to confirm the first recognition text again. And if the error instruction and the text modification instruction are detected in the confirmation of the first identification text input by the user, modifying the first identification text according to the text modification instruction, and re-executing the text translation process based on the modified first identification text. And if the user is detected to input the first recognition text and the user inputs the voice signal again, acquiring a fourth recognition text, and executing the text translation process again based on the fourth recognition text. And if the confirmation error-free instruction that the user inputs the first recognition text is detected, resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient, and executing the text translation process again. Under the condition of wrong recognition, a new speech translation interaction mode is provided for text translation, so that the translation result is more accurate.

Based on the content of the foregoing embodiment, as an optional embodiment, before translating the first target language text, an embodiment of the present invention further provides a method for calculating a recognition confidence and a translation confidence, where the method includes: calculating the recognition confidence coefficient of the first recognition text according to the posterior probability of each word in the first recognition text and the number of the words; and calculating the translation confidence of the first target language text according to the translation probability of each target word in the first target language text and the number of the target words.

Wherein, the posterior probability of each word in the first recognition text is used for representing the possibility of each word as the recognition result. The translation probability of each target participle in the first target language text is used to represent the likelihood of each target participle as a translation result.

When calculating the recognition confidence of the first recognition text, the posterior probability of each word in the first recognition text may be averaged based on the number of the words in the first recognition text, and the specific calculation process may refer to the following formula:

in the above formula, the first recognition text may be expressed as x ═ x (x)₁，x₂，x₃，...，x_n)。 score_asrThe recognition confidence of the first recognized text is represented, and N represents the number of word segments in the first recognized text. O denotes a speech signal corresponding to the first recognized text, P (x)_n| O) represents the nth participle x_nPosterior probability of occurrence.

When calculating the translation confidence of the first target language text, the posterior probability of each target participle in the first target language text may be averaged based on the number of target participles in the first target language text, and the specific calculation process may refer to the following formula:

in the above formula, the first target language text may be expressed as y ═ y (y)₁，y₂，y₃，...，y_m)。score_mtThe translation confidence of the first target language text is represented, and M represents the number of target word segmentation in the first target language text. X denotes the first recognized text, P (y)_m| x) represents the mth participle y_mProbability of translation occurring.

In addition, after the second target language text is obtained through the above embodiment, the second target language text may be delivered to the target language user as a translation result, and feedback information of the target language user may be detected to determine whether the target language user can understand the second target language text. When the target language user cannot understand the second target language text, it indicates whether the recognition is correct or the translation is wrong, so that the text translation can be continued according to the branch logic corresponding to the above embodiment that "the recognition confidence of the first recognized text is greater than the first preset threshold and the translation confidence of the first target language text is not greater than the second preset threshold". Of course, different branch logics or different processing manners in the above embodiments may also be selected according to requirements to continue text translation, which is not specifically limited in the embodiment of the present invention. For example, the target language is English. When the target language user feeds back "Pardon" or "I can't understand", it may be determined that the target language user cannot correctly understand the second target language text.

Secondly, vectorizing the first recognition text and the second recognition text respectively to obtain a first word vector sequence and a second word vector sequence, and calculating the distance between the first word vector sequence and the second word vector sequence. And if the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value, determining semantic inequivalence between the first recognized text and the second recognized text. After the correct recognition and the wrong translation are determined, secondary judgment can be carried out on whether the first target language text is used as the translation result or not according to the judgment result of whether the semantics between the first recognized text and the second recognized text are equivalent or not, so that the probability of wrong information transmission in the voice translation process is reduced, and the translation result is more accurate.

And determining corresponding placeholders according to the types of the key nouns, and replacing the key nouns with the placeholders corresponding to the key nouns according to the positions of the key nouns in the first recognition text to obtain the replaced first recognition text. And inputting the replaced first recognition text into a translation system, and outputting a third target language text. And converting the placeholders contained in the third target language text into translated target language nouns to obtain a second target language text. When the first recognition text is translated, the key nouns which are easy to make mistakes in the translation can be replaced by the placeholders in a targeted manner, and the key nouns corresponding to the placeholders are translated independently, so that the translation effect of the key nouns is improved, and the translation result is more accurate. Meanwhile, the placeholder corresponding to the key noun can be customized by the user, so that the personalized customization requirement of the user in the voice translation process can be met.

And thirdly, obtaining the key texts and performing semantic analysis on the key texts to obtain the key nouns in the key texts and the type of each key noun. Because the user can input the key text according to the self-defined expression form, the requirements of the user for personalized customization can be met while the key nouns and the types of the key nouns are obtained based on the key text.

In addition, when the semantics of the first recognized text and the semantics of the second recognized text are not equivalent, a third recognized text is obtained, and the text translation process is executed again based on the third recognized text. Because the user can input the voice signal again by adjusting the expression form or the sentence length and execute the voice translation process again, a new voice translation interaction mode is provided when the semantics between the first recognized text and the second recognized text are not equivalent, so that the translation result is more accurate.

And finally, prompting the user to confirm the first recognition text again when the recognition confidence coefficient of the first recognition text is not greater than a first preset threshold value. And if the error instruction and the text modification instruction are detected in the confirmation of the first identification text input by the user, modifying the first identification text according to the text modification instruction, and re-executing the text translation process based on the modified first identification text. And if the user is detected to input the first recognition text and the user inputs the voice signal again, acquiring a fourth recognition text, and executing the text translation process again based on the fourth recognition text. And if the confirmation error-free instruction that the user inputs the first recognition text is detected, resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient, and executing the text translation process again. Under the condition of wrong recognition, a new speech translation interaction mode is provided for text translation, so that the translation result is more accurate.

It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.

Based on the content of the foregoing embodiments, an embodiment of the present invention provides an interactive speech translation apparatus, where the interactive speech translation apparatus is configured to execute the interactive speech translation method provided in the foregoing method embodiment. Referring to fig. 6, the apparatus includes:

the first translation module 601 is configured to translate the first target language text to obtain a second recognition text when the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, where the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text;

a first prompting module 602, configured to prompt a user of a translation difficulty level corresponding to a first recognized text when semantic inequivalence exists between the first recognized text and a second recognized text, and detect an operation performed by the user based on the prompt;

the second translation module 603 is configured to, when it is detected that the operation performed by the user is to input a key text, perform semantic parsing on the key text to obtain a key noun and a type of the key noun, translate the key noun based on the type of the key noun to obtain a first translation result, translate other contents in the first recognition text except the key noun to obtain a second translation result, and combine the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

As an alternative embodiment, the apparatus further comprises:

the calculation module is used for respectively vectorizing the first recognition text and the second recognition text to obtain a first word vector sequence and a second word vector sequence, and calculating the distance between the first word vector sequence and the second word vector sequence;

and the determining module is used for determining semantic inequivalence between the first recognition text and the second recognition text when the distance between the first word vector sequence and the second word vector sequence is not less than a third preset threshold value.

As an alternative embodiment, the second translation module 603 is configured to determine a corresponding placeholder according to the type of the key noun, convert the placeholder into a translated target language noun, and use the translated target language noun as the first translation result.

As an alternative embodiment, the apparatus further comprises:

the first text translation module is used for acquiring a third recognition text when detecting that the operation executed by the user is to input the voice signal again, and re-executing a text translation process based on the third recognition text; the third recognition text is the recognition text corresponding to the re-input voice signal, and text data between the first recognition text and the third recognition text are different and semantically equivalent.

As an alternative embodiment, the apparatus further comprises:

the second prompting module is used for prompting the user to confirm the first recognition text again when the recognition confidence coefficient of the first recognition text is not greater than a first preset threshold value;

and the second text translation module is used for modifying the first recognition text according to the text modification instruction when detecting that the user inputs the first recognition text and confirms the error instruction and the text modification instruction, and re-executing the text translation process based on the modified first recognition text.

As an alternative embodiment, the apparatus further comprises:

the third text translation module is used for acquiring a fourth recognition text and re-executing a text translation process based on the fourth recognition text when the fact that the user inputs the first recognition text is determined to have an error instruction and the user re-inputs the voice signal is detected; and the fourth recognition text is the recognition text corresponding to the re-input voice signal.

As an alternative embodiment, the apparatus further comprises:

and the fourth text translation module is used for resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient and re-executing the text translation process when detecting that the user inputs a command of confirming the first recognition text without errors.

According to the device provided by the embodiment of the invention, when the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, the first target language text is translated to obtain a second recognition text. And if the semantics of the first recognized text and the semantics of the second recognized text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognized text, and detecting the operation executed by the user based on the prompt. If the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text. Because the user can be prompted to input the key text in an interactive mode with the user when the translation result cannot be determined to be correct, the key nouns of the key text can be used as a whole participle, and the key nouns are translated according to the types of the key nouns, so that the translation result is more accurate.

The embodiment of the invention provides interactive voice translation equipment. Referring to fig. 7, the apparatus includes: a processor (processor)701, a memory (memory)702, and a bus 703;

the processor 701 and the memory 702 complete mutual communication through the bus 703;

the processor 701 is configured to call the program instructions in the memory 702 to execute the interactive speech translation method provided by the above embodiments, for example, including: if the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text; if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognition text, and detecting the operation executed by the user based on the prompt; if the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

An embodiment of the present invention provides a non-transitory computer-readable storage medium, which stores computer instructions, where the computer instructions cause a computer to execute the interactive speech translation method provided in the foregoing embodiment, for example, including: if the recognition confidence of the first recognition text is greater than a first preset threshold and the translation confidence of the first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text; if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting the user of the translation difficulty degree corresponding to the first recognition text, and detecting the operation executed by the user based on the prompt; if the operation executed by the user is detected to be inputting a key text, performing semantic analysis on the key text to obtain key nouns and types of the key nouns, translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein, the first identification text contains key nouns.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the interactive speech translation apparatus and the like are merely illustrative, where units illustrated as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the various embodiments or some parts of the methods of the embodiments.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the embodiments of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims

1. An interactive speech translation method, comprising:

if the recognition confidence of a first recognition text is greater than a first preset threshold and the translation confidence of a first target language text is not greater than a second preset threshold, translating the first target language text to obtain a second recognition text, wherein the first recognition text and the second recognition text correspond to the same language, and the first target language text is obtained by translating the first recognition text;

if the semantics of the first recognition text and the semantics of the second recognition text are not equivalent, prompting a user of the translation difficulty degree corresponding to the first recognition text, and detecting the operation executed by the user based on the prompt;

if the operation executed by the user is detected to be key text input, performing semantic analysis on the key text to obtain key nouns and types of the key nouns; wherein the key noun is an integral participle; translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein the first recognition text contains the key noun.

2. The method of claim 1, wherein before prompting the user for the ease of translation corresponding to the first recognized text, further comprising:

and if the distance between the first word vector sequence and the second word vector sequence is not smaller than a third preset threshold value, determining that the semantic inequivalence between the first recognized text and the second recognized text is not equal.

3. The method of claim 1, wherein translating the key noun based on the type of the key noun to obtain a first translation result comprises:

and determining a corresponding placeholder according to the type of the key noun, converting the placeholder into a translated target language noun, and taking the translated target language noun as the first translation result.

4. The method of claim 1, wherein after prompting the user for the ease of translation corresponding to the first recognized text, further comprising:

if the operation executed by the user is detected to be re-inputting of a voice signal, a third recognition text is obtained, and a text translation process is re-executed based on the third recognition text; the third recognition text is the recognition text corresponding to the re-input voice signal, and text data between the first recognition text and the third recognition text are different and semantically equivalent.

5. The method of claim 1, further comprising:

if the recognition confidence of the first recognition text is not greater than a first preset threshold value, prompting the user to confirm the first recognition text again;

and if the user is detected to input the first identification text, confirming that the first identification text has the error instruction and the text modification instruction, modifying the first identification text according to the text modification instruction, and re-executing the text translation process based on the modified first identification text.

6. The method of claim 5, wherein after prompting the user to reconfirm the first recognized text, further comprising:

if the user is detected to input the first recognition text and the user inputs the voice signal again, acquiring a fourth recognition text, and executing a text translation process again based on the fourth recognition text; and the fourth recognition text is the recognition text corresponding to the re-input voice signal.

7. The method of claim 5, wherein after prompting the user to reconfirm the first recognized text, further comprising:

if the error-free confirmation instruction input by the user to the first recognition text is detected, resetting the recognition confidence coefficient of the first recognition text to the maximum value of the recognition confidence coefficient, and re-executing a text translation process.

8. An interactive speech translation apparatus, comprising:

the second translation module is used for performing semantic analysis on the key text to obtain key nouns and types of the key nouns when the fact that the operation executed by the user is the input of the key text is detected; wherein the key noun is an integral participle; translating the key nouns based on the types of the key nouns to obtain a first translation result, translating other contents except the key nouns in the first recognition text to obtain a second translation result, and combining the first translation result and the second translation result to obtain a second target language text; wherein the first recognition text contains the key noun.

9. An interactive speech translation device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.