WO2018157789A1

WO2018157789A1 - Speech recognition method, computer, storage medium, and electronic apparatus

Info

Publication number: WO2018157789A1
Application number: PCT/CN2018/077413
Authority: WO
Inventors: 康战辉
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-03-02
Filing date: 2018-02-27
Publication date: 2018-09-07
Also published as: CN108538286A

Abstract

A speech recognition method, the method comprising: receiving a speech signal (301); on the basis of the speech signal, acquiring a corresponding preliminary recognition text (302); acquiring a topic word in the preliminary recognition text, the topic word being a key information word in the preliminary recognition text (303); on the basis of the topic word, acquiring target relevant information, the target relevant information being context information corresponding to the topic word (304); and, on the basis of the target relevant information, establishing a target speech bank (305). The present method is used for displaying accurate recognition of a topic word or a topic word related to the topic word in a recognition text acquired on the basis of the next received speech signal, thereby improving the accuracy of speech recognition.

Description

Method, computer, storage medium and electronic device for speech recognition

This application claims the priority of the Chinese Patent Application, filed on March 2, 2017, filed on Jan. 2,,,,,,,,,,,,,,,,,,,,,,,,, in.

Technical field

The embodiments of the present invention relate to the field of computers, and in particular, to a voice recognition method, a computer, a storage medium, and an electronic device.

Background technique

A general speech recognition system includes at least two parts of an acoustic model and a language model. The acoustic model mainly converts the input speech signal into a topN candidate language sequence; and the language model determines the probability that the candidate language sequence conforms to a normal sentence. At this point, a common language model is often constructed by massive (hundreds of billions, even billions, billions of) natural text statistics of the probability of occurrence of different length fragments (Ngram).

A disadvantage of the related art is that a common language model often has a problem of biased data recognition. For example, in a voice transfer scenario, specifically, in a professional academic lecture scenario, a user needs to automatically perform a conference record through a voice recognition system. At this time, if some small and professional vocabulary (such as the name of a certain protein) is mentioned in the conference speech, the general speech recognition system may not be correctly recognized because the language model may not involve the corpus in this aspect. .

Summary of the invention

An embodiment of the present invention provides a method, a computer, a storage medium, and an electronic device for voice recognition, for identifying a keyword or a keyword related to the keyword in the identification text acquired according to the voice signal received next time. The recognition will show accurate recognition and improve the accuracy of speech recognition.

A first aspect of the embodiments of the present invention provides a method for voice recognition, which may include:

Obtaining a keyword in the preliminary identification text, the keyword is a word of the key information in the preliminary identification text, and the preliminary identification text is a text recognized according to the voice signal;

Obtaining target related information according to the keyword, the target related information is context information corresponding to the keyword;

The target language library is built based on the relevant information of the target.

A second aspect of the embodiments of the present invention provides a computer, which may include:

a first acquiring module, configured to acquire a keyword in the preliminary identification text, where the keyword is a word of the key information in the preliminary identification text, and the preliminary identification text is a text identified according to the voice signal;

a second obtaining module, configured to acquire target related information according to the keyword, where the target related information is context information corresponding to the keyword;

A module is created for establishing a target language library based on the relevant information of the target.

A third aspect of the embodiments of the present invention provides a storage medium in which a computer program is stored, wherein the computer program is configured to execute the above method at runtime.

A fourth aspect of the embodiments of the present invention provides an electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the above method by the computer program.

It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:

In the embodiment of the present invention, the keyword in the preliminary identification text is obtained, the keyword is a word of the key information in the preliminary identification text, and the preliminary identification text is a text recognized according to the voice signal; and the target is related according to the keyword Information, the target related information is context information corresponding to the keyword; and the target language library is established according to the target related information. In the process of using the computer, the computer can receive the voice signal, obtain the corresponding preliminary identification text according to the voice signal, obtain the keyword according to the preliminary identification text, and then obtain the target related information according to the keyword, and can establish the target language according to the related information. The library, the target language library is used for identifying the keyword or the related topic words in the recognition text acquired according to the next received voice signal, and the accurate recognition is improved, and the accuracy of the voice recognition is improved. rate.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments and the related art description will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings can also be obtained from these figures.

1 is a schematic diagram of a general speech recognition system according to an embodiment of the present invention;

2 is a schematic diagram of a frame of a voice recognition system applied in an embodiment of the present invention;

3 is a schematic diagram of an embodiment of a method for voice recognition according to an embodiment of the present invention;

4 is a schematic diagram of voice recognition in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a computer according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of another embodiment of a computer according to an embodiment of the present invention; FIG.

FIG. 7 is a schematic diagram of another embodiment of a computer in an embodiment of the present invention.

detailed description

An embodiment of the present invention provides a method for voice recognition and a computer, wherein in the identification text acquired according to the voice signal received next time, the identification of the keyword or the keyword related to the topic word is displayed. Accurate recognition improves the accuracy of speech recognition.

In order to make those skilled in the art better understand the solution of the present invention, the technical solutions in the embodiments of the present invention will be described below in conjunction with the drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the present invention. Embodiments, rather than all of the embodiments. Embodiments based on the present invention are all within the scope of the present invention.

Natural Language is actually a human language. Natural Language Processing (NLP) is the processing of human language, of course, mainly using computers. Natural language processing is an interdisciplinary subject between computer science and linguistics. Common research tasks include: Word Segmentation or Word Breaker (WB); Information Extraction (IE); Relation Extraction (RE); Named Entity Recognition (NER); Part Of Speech Tagging (POS); Coreference Resolution; Parsing; Word Sense Disambiguation (WSD); Speech Recognition (Speech) Recognition);Text To Speech (TTS); Machine Translation (MT); Automatic Summarization; Question Answering; Natural Language Understanding; Optical Character Recognition (Optical) Character Recognition, OCR); Information Retrieval (IR).

Simply put, the language model is the model used to calculate the probability of a sentence, namely P(W1, W2...Wk). Using a language model, you can determine which word sequence is more likely, or given a number of words to predict the next most likely word. For an example of a phonetic conversion, the input pinyin string is nixianzaiganshenme, and the corresponding output can have various forms, such as what you are doing now, what you are going to do in Xi'an, etc., then which one is the correct conversion result, use The language model, we know that the probability of the former is greater than the latter, so the conversion to the former is more reasonable in most cases. Another example of machine translation, given a Chinese sentence for Li Ming is watching TV at home, can be translated as Li Ming is watching TV at home, Li Ming at home is watching TV, etc., also according to the language model, we know The probability of the former is greater than the latter, so it is more reasonable to translate into the former.

As shown in FIG. 1 , a schematic diagram of a general speech recognition system includes at least two parts: an acoustic model and a language model. The acoustic model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accent, and the like. And the language model is a knowledge representation of a sequence of words. Common language models often have problems with data recognition. For example, in a voice transfer scenario, specifically, in a professional academic lecture scenario, a voice recognition system is required to automatically perform a conference record. At this time, if the speech mentions some niche and professional vocabulary (such as the name of a certain protein), the general speech recognition system, because the language model may not involve the corpus in this aspect, and often cannot be correctly identified. The above-mentioned niche, professional vocabulary, long tail corpus is not exhaustive (or the cost of exhaustive is not necessary).

FIG. 2 is a schematic diagram of a frame of a voice recognition system according to an embodiment of the present invention, including voice recognition input, voice recognition system, preliminary recognition text, extracted topic words, full network search top result summary, training context, and domain language. model. The solution to be solved by the embodiment of the present invention is to add domain-related long tail corpus in real time in a general language model system, so as to solve the professional vocabulary in the field of voice transcription writing, which is not recognized in the past several times in the general voice recognition system, but with With the advancement of transfer (speech), the system can automatically and accurately extract the language model corpus of the corresponding field in real time, and then the speaker can refer to the vocabulary again, and even the vocabulary related to the vocabulary can be effectively recognized.

The technical solution of the embodiment of the present invention is further described in the following embodiments. As shown in FIG. 3, it is a schematic diagram of an embodiment of a voice recognition method according to an embodiment of the present invention, including:

301. Receive a voice signal;

In the embodiment of the present invention, the computer receives the voice signal. Illustratively, the voice signal herein may be a voice of a related worker in a conference scene, a voice signal received by a computer, or may be an academic report, a topic research report, or A series of scenes, such as a lecture on professional knowledge, a piece of speech signal received by a computer. Among them, the acoustic model can be trained with lstm+ctc to obtain the mapping of phonetic features to phonemes; the language model can be trained by SRILM tool LM (language mode) to get 3-gram and 4-gram, which are words and words. , the mapping of words and sentences, the dictionary is the phoneme index set corresponding to the words, is the mapping between words and phonemes.

The so-called acoustic model is to classify the acoustic features of speech into units of (decoding) phonemes or words; the language model then decodes the words into a complete sentence.

Let's first talk about the language model. The language model represents the probability of a sequence of words. Generally, the chain rule is used to disassemble the probability of a sentence into the product of the probability of each word in the device. Let W be composed of w1, w2, ..., wn, then P(W) can be split (by conditional probability formula and multiplication formula):

P(W)=P(w1)P(w2/w1)P(w3/w1,w2)...P(wn/w1,w2,...wn-1), each item is all before The probability of the current word under the probability condition of the word. The most common practice of the Markov model is to use the N-gram grammar, which assumes that the output of a word is only related to the probability of the occurrence of the previous N-1 words. This language model is called the n-gram model ( Usually n takes 3, ie t rigram), then we can say this:

P(W)=P(w1)P(w2|w1)P(w3|w1,w2)P(w4|w1,w2,w3)...P(wn/wn-1,wn-2,.. ., w1), when the condition is too long, the probability is not easy to estimate. The ternary grammar only takes the first two words.

P(W)=P(w1)P(w2|w1)P(w3|w1,w2)P(w4|w2,w3)...P(wn/wn-1,wn-2),

For each of these conditional probabilities, the Bayesian formula can be used to calculate the probability of occurrence of adjacent words in all corpora, and then the probability of occurrence of a single word can be counted and substituted.

It should be noted that the n-gram here is based on the sequence of strings, so an n-gram is equivalent to a phrase, there must be some phrases that have not appeared, but there is also a probability of occurrence, so the algorithm needs to generate these uncommon The probability of a phrase.

In addition to the acoustic model, the task of the acoustic model is to calculate P(X/W), which is the probability that the speech will be emitted after a given text (finally using Bayesian, when P(X/W) is used). First of all, the first question: How do you know what sounds are emitted by each word? This requires another module, called the dictionary. Seeing the source code of eesen is to first find the dictionary of the corresponding phoneme in the data preparation stage. Its function is to convert the word string into a phoneme string, and then obtain the language model and the training acoustic model. (Use the lstm+ctc (long short-term memory) to train the acoustic model). With the help of the dictionary, the acoustic model knows which sounds are given in a given string of text.

302. Acquire a corresponding preliminary identification text according to the voice signal.

In the embodiment of the present invention, after receiving the voice signal, the computer may further acquire a corresponding preliminary identification text according to the voice signal. That is, the speech signal can obtain the corresponding preliminary identification text through the acoustic model and the general language model in the speech recognition system.

Specifically, a speech signal is input, and a sequence of words (consisting of words or words) is found, and the sequence of characters found has the highest degree of matching with the speech signal. The degree of matching is generally expressed by probability. When X is used to represent the speech signal and W is used to represent the sequence of words, the following problem is solved:

However, the general speech is generated by words, and the known words can emit speech. Therefore, for the above conditional probability formula, we want to know the probability of occurrence under this condition. At this time, we naturally think of the Bayesian formula:

Since we want to optimize W, P(X) can be regarded as a constant, and the denominator can be omitted.

From the above steps, the steps of finding the character string, calculating the language model probability, finding the phoneme string, finding the phoneme boundary point, and calculating the acoustic model probability seem to be sequential. In fact, in the actual coding process, because the text string and the phoneme boundary point have many kinds of possibilities, the enumeration is unrealistic. In practice, these steps are carried out at the same time and restrict each other, and the possibility of not being good enough is cut off at any time, and finally the optimal solution is obtained within an acceptable time, as follows:

W*=argmaxP(W|X).

For example, for the word conversion problem, input pinyin nixianzaiganshenme may correspond to many conversion results. For this example, the possible conversion results are shown in Figure 4 (only some of the word nodes are drawn), and the nodes are formed. In the complex network structure, any path from the beginning to the end is a possible conversion result, and the process of selecting the most appropriate result from many conversion results requires a decoding algorithm.

A commonly used decoding algorithm is the viterbi algorithm, which uses the principle of dynamic programming to quickly determine the most appropriate path.

303. Obtain a keyword in the preliminary identification text, where the keyword is a word that initially identifies key information in the text, and the preliminary identification text is a text that is identified according to the voice signal;

In the embodiment of the present invention, after the computer obtains the corresponding preliminary identification text according to the voice signal, the keyword in the preliminary identification text may be obtained, and the keyword is a word that initially identifies the key information in the text. Among them, the subject words can be understood as the core theme of the meeting discussion, or the focus of the meeting report.

Obtaining the keyword in the preliminary identification text may include: obtaining the keyword according to formula 1 according to the preliminary identification text, wherein formula 1 is:

Score(i)=tf(i)*idf(i) (Equation 1)

Where i refers to the i-th word in the preliminary identification text, tf(i) refers to the number of times the i-th word appears in the preliminary recognized text, and idf(i) refers to the inverse document frequency of the i-th word in the preliminary recognized text.

Further, idf(i) is obtained by offline statistics of a large amount of text data, and formula 2 for calculating idf(i) is:

Where |D| is the number of documents in the document set, d _j is the jth document, and t _i is the i-th word in the j-th document.

The extraction of the topic words can also be based on the TextRank algorithm, that is, the task of keyword extraction is to automatically extract a number of meaningful words or phrases from a given text. The TextRank algorithm uses the relationship between local vocabularies (co-occurrence window) to sort subsequent keywords and extract them directly from the text itself. The main steps are as follows:

(1) Dividing a given text T into complete sentences, ie T = [S ₁ , S ₂ , ..., S _m ];

(2) For each sentence S _i ∈T, perform word segmentation and part-of-speech tagging, and filter out the stop words, and only retain words with specified part of speech, such as nouns, verbs, adjectives, ie S _i =[t _i,1 , t _i,2 ,...t _i,n ], where t _i,j ∈S _j are reserved candidate keywords;

(3) Construct candidate keyword graph G=(V, E), where V is a node set, consisting of candidate keywords generated in step (2), and then using co-occurrence to construct between two points Edge, there is an edge between two nodes only when their corresponding vocabulary co-occur in the window of length K, K represents the window size, that is, a maximum of K words;

(4) Iteratively propagate the weights of each node according to the above formula until convergence;

(5) Sorting the node weights in reverse order to obtain the most important T words as candidate keywords;

(6) The most important T words are obtained from step (5), marked in the original text, and if adjacent phrases are formed, combined into multi-word keywords. For example, there is a sentence in the text "Matlab code for plotting ambiguity function". If both "Matlab" and "code" belong to candidate keywords, then the combination of "Matlab code" is added to the keyword sequence.

Among them, the textRank source code parsing is as follows: read the text, and cut the word, the statistical co-occurrence relationship of the word cut result, the window defaults to 5, save the large cm.

It should be noted that the extraction of the keyword includes, but is not limited to, the several implementations mentioned above, and the number of keywords obtained by the computer is not limited.

In speech recognition systems, there is often a need to sort a large number (such as hundreds of thousands, or even millions) of objects, and then only need to take the top N top N as the leaderboard data, which is Is a TopN algorithm. There are three common solutions:

(1) Directly use the Sort method of the List for processing.

(2) Sort using the sort binary tree, and then take the top N names.

(3) Use the maximum heap sort, then remove the top N.

304. Obtain target related information according to the keyword, and the target related information is context information corresponding to the keyword;

In the embodiment of the present invention, after the computer obtains the keyword in the preliminary identification text, the target related information may be acquired according to the keyword, and the target related information is context information corresponding to the keyword.

Obtaining target related information based on the keyword can include:

(1) Obtain target related information through the whole network search according to the keyword.

(2) Extracting target related information corresponding to the keyword in the preset related information set.

Further, obtaining the target related information through the whole network search according to the keyword may include: searching for the corresponding search result according to the keyword through the whole network; matching the search result to determine the target related information.

It should be noted that the target related information herein can be simply understood as the title of each article on the page displayed by the search keyword, or the abstract of each article, or all the contents of each article. However, it should be understood that if the target-related information is all the content of each article, the resources consumed are relatively large.

Exemplarily, if the obtained keyword is a filter, the filter displayed here may be the correct phrase or may not be the correct phrase, and the computer may automatically obtain the content of the page related to the filter through some search software, such as display. Is a high-pass filter, low-pass filter, band-pass filter, and band-stop filter hyperlink content, you can associate the title of these hyperlinks or the abstract in each hyperlink content as the target of the keyword "filter" information.

305. Establish a target language library according to the target related information.

In the embodiment of the present invention, after the computer acquires the target related information according to the keyword, the target language library is established according to the target related information. Specifically, the method may include: training according to the target related information, and establishing a target language library. It should be understood that the target language library here is the domain language model established at the core of this conference or the core of this report. That is, a series of operations such as filtering and cleaning, domain matching, and the like can be performed on the target related information, and the domain language model can be obtained by training. Exemplarily, the training can be performed according to the summary information in the hyperlink content of the high-pass filter, the low-pass filter, the band-pass filter and the band-stop filter, and the language model about the filter field can be obtained, and the filter field will be The language model is added to the general language model shown in Figure 2 above.

Then, in the subsequent speech recognition, the relevant information about the filter will appear in the speech recognition system first, because the speech recognition system has previously added a language model about the filter field, so the computer can be accurate. Identification, specifically whether it is a high pass filter, a low pass filter, a band pass filter or a band stop filter.

The Ngram statistical language model can be used in the embodiment of the present invention. The n-gram model is also called the n-1 order Markov model. It has a finite historical hypothesis: the probability of occurrence of the current word is only related to the previous n-1 words. So P(S) can be approximated as:

When n is 1, 2, and 3, the n-gram models are called unigram, bigram, and trigram language models, respectively. The parameters of the n-gram model are the conditional probability P(W _i |W _i-n+1 ,...,W _i-1 ). Assuming that the size of the vocabulary is 100,000, then the number of parameters of the n-gram model is 100000 nth power. The larger n is, the more accurate and complex the model is, and the greater the amount of computation required. In the embodiment of the present invention, the selected n is 3 as an example, that is, the trigram language model. In more detail, the ngram language model, that is, the above P(S) model, generally uses the maximum likelihood estimation for parameter estimation. The difference between various model algorithms often lies in which data smoothing algorithm is used to solve the data when n is increased. Sparse problem (that is, to solve the problem that the whole P(S) tends to zero due to the fact that the above-mentioned probability formula is expanded due to the fact that the statistical frequency of an item approaches 0 in the corpus. The Katz smoothing algorithm can be used in the embodiment of the present invention, and different algorithms such as addition smoothing, Good-Turing smoothing, and interpolation smoothing exist in the corresponding industry.

The method for voice recognition in the embodiment of the present invention is specifically described below in the actual application scenario, as follows:

Assume that Qiu Xiang is a broadcast host. In the program, I need to read an article. This long article is "China Bee Farming". We are going to use computer to extract its keywords. An easy way to think of is to find the words that appear the most. If a word is important, it should appear multiple times in this article. So, we carry out the "Term Frequency" (TF) statistics.

As a result, everyone must have guessed that the most frequently used words are: "," "yes", "in" and so on. They are called "stop words" and are words that do not help to find results and must be filtered out.

Suppose we filter them all out and only consider the remaining meaningful words. In this way, we will encounter another problem. We may find that the three words “China”, “bee” and “farming” appear as many times. Does this mean that, as keywords, their importance is the same?

Obviously not the case. Because "China" is a very common word, relatively speaking, "bee" and "farming" are less common. If these three words appear as many times in an article, there is reason to believe that "bees" and "farming" are more important than "China", that is, in the order of keywords, "bees" and " "culture" should be ranked in front of "China."

Therefore, we need an importance adjustment factor to measure whether a word is a common word. If a word is rare, but it appears multiple times in this article, then it is likely to reflect the characteristics of this article, which is exactly what we need.

Expressed in a statistical language, on the basis of word frequency, each word is assigned an "importance" weight. The most common words ("," "yes", "at") give the least weight, the more common words ("China") give less weight, less common words ("bees", "farming" ) Give greater weight. This weight is called "Inverse Document Frequency" (IDF) and its size is inversely proportional to the common degree of a word.

After knowing the word frequency (TF) and the "inverse document frequency" (IDF), multiplying these two values yields the TF-IDF value of a word. The higher the importance of a word to an article, the greater its TF-IDF value. Therefore, the first few words are the key words of this article.

The first step is to calculate the word frequency;

Word Frequency (TF) = number of occurrences of a word in an article

Considering the length and length of the article, in order to facilitate the comparison of different articles, the word frequency is standardized.

Word Frequency (TF) = number of occurrences of a word in an article / total number of words in an article

or,

Word Frequency (TF) = number of occurrences of a word in an article / number of occurrences of the word with the most occurrences of the article

The second step is to calculate the inverse document frequency;

At this time, a corpus is needed to simulate the language usage environment.

Inverse Document Frequency (IDF) = log (total number of documents in the corpus / (number of documents containing the word + 1))

If a word is more common, the denominator is larger, and the inverse of the document frequency is closer to zero. The reason why the denominator is added 1 is to avoid the denominator being 0 (that is, all documents do not contain the word). Log indicates the logarithm of the obtained value.

The third step is to calculate the TF-IDF.

TF-IDF=Word Frequency (TF)* Inverse Document Frequency (IDF)

It can be seen that TF-IDF is proportional to the number of occurrences of a word in the document and inversely proportional to the number of occurrences of the word in the entire language. Therefore, the algorithm for automatically extracting keywords is very clear, that is, the TF-IDF value of each word of the document is calculated, and then arranged in descending order, taking the top words.

Take "China's Bee Breeding" as an example. Suppose the length of the article is 1000 words, "China", "bee" and "farming" appear 20 times each, then the "word frequency" (TF) of these three words are 0.02. Then, Google search found that there were 25 billion pages with the word "", assuming that this is the total number of Chinese pages. There were 6.23 billion pages containing “China”, 0.484 million pages containing “bees”, and 97.3 million pages containing “breeding”. Then their inverse document frequency (IDF) and TF-IDF are shown in Table 1 below:

Table 1

It can be seen from Table 1 above that "bee" has the highest TF-IDF value, followed by "culture" and "China". (If you also calculate the TF-IDF of the "" word, it will be a value very close to 0.) So, if you only select one word, "bee" is the keyword for this article.

In addition to automatically extracting keywords, the TF-IDF algorithm can be used in many other places. For example, in information retrieval, for each document, a set of search words ("China", "bee", "farming") TF-IDF can be calculated separately, and they can be added to obtain the TF- of the entire document. IDF. The document with the highest value is the one most relevant to the search term.

Therefore, the “bees” and “farming” here searched as the subject words, obtained context information about “bees” and “farming”, and trained the searched context information to obtain a language model in the field of bee farming.

When the subsequent speech recognition on bee culture occurs in the subsequent article on bee culture, it can be accurately identified by the language model of bee farming.

The method for the voice recognition in the embodiment of the present invention is described above. The following describes the computer in the embodiment of the present invention. As shown in FIG. 5, it is a schematic diagram of an embodiment of a computer in the embodiment of the present invention, including:

The first obtaining module 501 is configured to initially identify a keyword in the text, the keyword is a word that initially identifies key information in the text, and the preliminary identification text is a text that is identified according to the voice signal;

The second obtaining module 502 is configured to acquire target related information according to the keyword, where the target related information is context information corresponding to the keyword;

The establishing module 503 is configured to establish a target language library according to the target related information.

Optionally, in some embodiments of the invention,

The first obtaining module 501 is specifically configured to obtain the keyword according to the formula 1 according to the preliminary identification text, where the formula 1 is:

Score(i)=tf(i)*idf(i), where i refers to the i-th word in the preliminary identification text, tf(i) refers to the number of times the i-th word appears in the preliminary recognized text, idf(i) Refers to the inverse document frequency of the i-th word in the preliminary identification text.

Optionally, in some embodiments of the present invention, based on the foregoing FIG. 5, as shown in FIG. 6, which is a schematic diagram of another embodiment of a computer in the embodiment of the present invention, the computer further includes:

The receiving module 504 is configured to receive a voice signal;

The third obtaining module 505 is configured to obtain a corresponding preliminary identification text according to the voice signal.

Optionally, in some embodiments of the invention,

The second obtaining module 502 is specifically configured to obtain target related information by searching through the entire network according to the keyword.

Optionally, in some embodiments of the invention,

The second obtaining module 502 is further configured to: obtain a corresponding search result by searching through the entire network according to the keyword, and match the search result to determine the target related information.

Optionally, in some embodiments of the invention,

The second obtaining module 502 is further configured to extract target related information corresponding to the keyword in the preset related information set.

Optionally, in some embodiments of the invention,

The establishing module 503 is specifically configured to perform training according to the target related information to establish a target language library.

The embodiment of the invention further provides a storage medium, wherein the storage medium stores a computer program, wherein the computer program is set to execute the above method when it is running.

Embodiments of the present invention also provide an electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the above method by the computer program. The electronic device may be the computer shown in FIG. 7, and the processor may be the central processor shown in FIG.

FIG. 7 is a schematic diagram of another embodiment of a computer in the present invention.

Computer 700 may vary considerably depending on configuration or performance, and may include one or more central processing units (CPU) 722 (eg, one or more processors) and memory 732, one or more A storage medium 730 storing storage application 742 or data 744 (eg, one or one storage device in Shanghai). Among them, the memory 732 and the storage medium 730 may be short-term storage or persistent storage. The program stored on storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations on the computer. Still further, central processor 722 can be configured to communicate with storage medium 730, executing a series of instruction operations in storage medium 730 on computer 700.

Computer 700 may also include one or more power sources 726, one or more wired or wireless network interfaces 750, one or more input and output interfaces 758, and/or one or more operating systems 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and more.

In the embodiment of the present invention, the central processing unit 722 is further configured to perform the following functions: for acquiring a keyword in the preliminary identification text, the keyword is a word for initially identifying key information in the text, and the preliminary identification text is obtained according to the voice signal. The text; the target related information is obtained according to the keyword, the target related information is context information corresponding to the keyword; and the target language library is established according to the target related information.

Optionally, in some embodiments of the invention,

The central processing unit 722 is specifically configured to obtain the keyword according to the formula 1 according to the preliminary identification text, where the formula 1 is:

Optionally, in some embodiments of the invention,

The central processing unit 722 is further configured to receive a voice signal, and obtain a corresponding preliminary identification text according to the voice signal.

Optionally, in some embodiments of the invention,

The central processing unit 722 is specifically configured to obtain target related information through a full network search according to the keyword.

Optionally, in some embodiments of the invention,

The central processing unit 722 is specifically configured to obtain a corresponding search result by searching through the entire network according to the keyword, and matching the search result to determine the target related information.

Optionally, in some embodiments of the invention,

The central processing unit 722 is further configured to extract target related information corresponding to the keyword in the preset related information set.

Optionally, in some embodiments of the invention,

The central processing unit 722 is specifically configured to perform training according to the target related information to establish a target language library.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may contribute to the related art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.

Claims

A method for speech recognition, comprising:

Obtaining a keyword in the preliminary identification text, the keyword is a word of the key information in the preliminary identification text, and the preliminary identification text is a text recognized according to the voice signal;

Obtaining target related information according to the keyword, the target related information being context information corresponding to the keyword;

Establishing a target language library according to the target related information;

The keyword or the keyword related to the keyword is identified by the target language library in the identification text acquired by the next received voice signal.
The method of claim 1, wherein the obtaining the keyword in the preliminary identification text comprises:

Obtaining the keyword according to formula 1 according to the preliminary identification text, wherein the formula 1 is:

Score(i)=tf(i)*idf(i), where i refers to the i-th word in the preliminary identification text, and tf(i) refers to the number of times the i-th word appears in the preliminary recognized text, Idf(i) refers to the inverse document frequency of the i-th word in the preliminary identification text.
The method of claim 1, wherein before the obtaining the keyword in the preliminary identification text, the method further comprises:

Receiving a voice signal;

Corresponding preliminary identification text is obtained according to the voice signal.
The method according to any one of claims 1-3, wherein the obtaining target related information according to the keyword includes:

Obtaining the target related information through a full network search according to the keyword.
The method according to claim 4, wherein the obtaining the target related information through a full network search according to the keyword includes:

Searching through the entire network according to the keyword to obtain a corresponding search result;

The search results are matched to determine the target related information.
The method according to any one of claims 1-3, wherein the obtaining target related information according to the keyword includes:

In the preset related information set, target related information corresponding to the keyword is extracted.
The method according to any one of claims 1-3, wherein the establishing a target language library according to the target related information comprises:

Training is performed according to the target related information to establish the target language library.
A computer comprising:

a first obtaining module, configured to acquire a keyword in the preliminary identification text, where the keyword is a word of the key information in the preliminary identification text, and the preliminary identification text is a text identified according to the voice signal;

a second acquiring module, configured to acquire target related information according to the keyword, where the target related information is context information corresponding to the keyword;

And a establishing module, configured to establish a target language library according to the target related information, and use the target language library to identify the keyword or the keyword related to the keyword in the recognized text acquired by the next received voice signal.
The computer according to claim 8, wherein

The first obtaining module is configured to obtain the keyword according to the preliminary identification text according to the formula 1, wherein the formula 1 is:

Score(i)=tf(i)*idf(i), where i refers to the i-th word in the preliminary identification text, and tf(i) refers to the number of times the i-th word appears in the preliminary recognized text, Idf(i) refers to the inverse document frequency of the i-th word in the preliminary identification text.
The computer of claim 8 wherein said computer further comprises:

a receiving module, configured to receive a voice signal;

And a third acquiring module, configured to acquire a corresponding preliminary identification text according to the voice signal.
A computer according to any one of claims 8 to 10, wherein

The second obtaining module is specifically configured to obtain the target related information by searching through the entire network according to the keyword.
The computer according to claim 11, wherein

The second obtaining module is further configured to: obtain a corresponding search result by searching through the entire network according to the keyword, and match the search result to determine the target related information.
A computer according to any one of claims 8 to 10, wherein

The second acquiring module is further configured to extract, in the preset related information set, target related information corresponding to the keyword.
A computer according to any one of claims 8 to 10, wherein

The establishing module is specifically configured to perform training according to the target related information to establish the target language library.
A storage medium, characterized in that a computer program is stored in the storage medium, wherein the computer program is arranged to execute the method of any one of claims 1 to 7 at runtime.
An electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being arranged to perform the method of any one of claims 1 to 7 by the computer program Methods.