WO2019136993A1

WO2019136993A1 - Text similarity calculation method and device, computer apparatus, and storage medium

Info

Publication number: WO2019136993A1
Application number: PCT/CN2018/099994
Authority: WO
Inventors: 艾明
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2018-01-12
Filing date: 2018-08-10
Publication date: 2019-07-18
Also published as: CN108304378A; CN108304378B

Abstract

A text similarity calculation method comprises: obtaining a character sequence to be matched and a target character sequence; respectively preprocessing the character sequence to be matched and the target character sequence to obtain a corresponding word sequence to be matched and a corresponding target word sequence; performing calculation, by means of a first similarity algorithm, on a word to be matched contained in the word sequence to be matched and a target word contained in the target word sequence, so as to obtain a first similarity degree; extracting all of the words to be matched, so as to form a set of words to be matched, and extracting all of the target words to form a target word set; performing calculation on the set of words to be matched and the target word set by means of a second similarity algorithm, so as to obtain a second similarity degree; and calculating, according to the first similarity degree and the second similarity degree, a text similarity degree between the character sequence to be matched and the target character sequence.

Description

Text similarity calculation method, device, computer device and storage medium

This application claims to be submitted to the Chinese Patent Office on January 12, 2018, the application number is 2018100317700, and the priority of the Chinese patent application entitled "Text Similarity Calculation Method, Apparatus, Computer Equipment and Storage Medium" is applied. The citations are incorporated herein by reference.

Technical field

The present application relates to a text similarity calculation method, apparatus, computer device and storage medium.

Background technique

With the development of chat bot technology, the concept of string fuzzy search has appeared, and the edit distance algorithm is usually used to achieve string matching. The edit distance, also known as the Levenshtein distance, refers to the minimum number of edit operations required between two strings, one from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character. The larger the edit distance value, the smaller the similarity between strings.

However, due to the complexity of the language, the same meaning can be expressed by different texts, while the texts on the surface are very similar, and the meanings expressed may be very different. The traditional edit distance algorithm is usually in units of single characters. The edit distance between each character sequence is calculated, and the calculated edit distance is only the distance of the text surface, resulting in low accuracy of the calculated text similarity.

Summary of the invention

According to various embodiments disclosed herein, a text similarity calculation method, apparatus, computer device, and storage medium capable of improving text similarity are provided.

A text similarity calculation method includes: acquiring a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words; And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; and all the to-be-matched words are formed to form a to-be-matched word set. And extracting all target words to form a target word set; calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity; and according to the first similarity and the first The two similarities are calculated to obtain a text similarity of the sequence of characters to be matched and the sequence of target characters.

A text similarity calculation device includes: a character sequence acquisition module, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module, configured to pre-process the to-be-matched character sequence and the target character sequence respectively Processing, obtaining a corresponding sequence of to-be-matched words and a sequence of target words; a first similarity calculation module, configured to pass the to-be-matched words included in the sequence of to-be-matched words and the target words included in the target word sequence through the first The similarity algorithm performs calculation to obtain a first similarity degree; a word set forming module is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all target words to form a target word set; and a second similarity calculation module for And the text similarity calculation module is configured to use the first similarity and the second similarity by using a second similarity algorithm to calculate a second similarity degree; Performing a calculation to obtain a text similarity of the sequence of characters to be matched and the sequence of target characters.

A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps: obtaining a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of a word to be matched and a sequence of target words; The to-be-matched word included in the target word sequence and the target word included in the target word sequence are calculated by the first similarity algorithm to obtain a first similarity degree; all the to-be-matched words are formed to form a to-be-matched word set, and all target word formation targets are extracted a set of words; the set of the to-be-matched words and the set of target words are calculated by a second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity, The text similarity of the sequence of characters to be matched and the sequence of target characters.

One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: acquiring characters to be matched Sequence and target character sequence; respectively preprocessing the to-be-matched character sequence and the target character sequence to obtain a corresponding candidate word sequence and a target word sequence; and the to-be-matched words included in the to-be-matched word sequence The target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; all the to-be-matched words are formed to form a to-be-matched word set, and all target words are extracted to form a target word set; The matching word set and the target word set are calculated by the second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity to obtain the to-be-matched character sequence and The text similarity of the target character sequence.

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.

1 is an application scenario diagram of a text similarity calculation method in accordance with one or more embodiments.

2 is a flow diagram of a text similarity calculation method in accordance with one or more embodiments.

3A is a schematic diagram of a word tree derived from a physical substance in accordance with one or more embodiments.

FIG. 3B is a schematic diagram of a word tree derived from a virtual event in accordance with one or more embodiments.

4 is a flow diagram of a text similarity calculation method in accordance with another or more embodiments.

FIG. 5 is a block diagram showing the structure of a text similarity calculation apparatus according to one or more embodiments.

6 is a diagram showing the internal structure of a computer device in accordance with one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

It will be understood that the terms "first", "second" and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, the first similarity may be referred to as a second similarity without departing from the scope of the present application, and similarly, the second similarity may be referred to as a first similarity. Both the first similarity and the second similarity are similarities, but they are not the same similarity.

The text similarity calculation method provided by the present application can be applied to the application environment as shown in FIG. 1. Terminal 102 communicates with server 104 over a network over a network. For example, the server 104 can receive a sequence of characters to be matched sent by the terminal 102. The terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a text similarity calculation method is provided, which is applied to the server 104 in FIG. 1 as an example, and includes the following steps:

Step 202: Acquire a sequence of characters to be matched and a sequence of target characters.

The sequence of characters to be matched refers to a sequence of characters that need to be matched. The target character sequence refers to a preset sequence of characters in the database for matching the sequence of characters to be matched. A sequence of characters refers to a sequence formed in order of characters, and the characters may be at least one of letters, Arabic numerals, Chinese characters, and punctuation marks. The sequence of characters includes, but is not limited to, a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation.

Step 204: Perform preprocessing on the sequence of the matched character and the sequence of the target character respectively to obtain a corresponding sequence of the word to be matched and the sequence of the target word.

Preprocessing refers to the process of converting, reducing, splitting, and the like of at least one of a sequence of matching characters and a sequence of target characters. The sequence of words to be matched refers to a sequence of words obtained by preprocessing the sequence of characters to be matched. The target word sequence refers to a sequence of words obtained by preprocessing the target character sequence. A word sequence refers to a sequence formed in order of words. The sequence of to-be-matched words refers to a sequence formed in order of the words to be matched. The target word sequence refers to a sequence formed in order in units of target words. The to-be-matched word and the target word may be simple words composed of one or more characters, or may be composite words composed of two or more simple words.

In one embodiment, the step 204 includes: deleting the unrelated characters included in the sequence of characters to be matched and the irrelevant characters included in the target character sequence; and respectively performing the word segmentation of the to-be-matched character sequence and the target character sequence after deleting the unrelated characters, The corresponding sequence of words to be matched and the sequence of target words are obtained.

Unrelated characters are characters that do not affect the calculation of text similarity, including but not limited to punctuation and deactivation. Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics are used to classify the character sequence to be matched and the target character sequence after deleting the irrelevant character.

Step 206: Calculate the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence by using a first similarity algorithm to obtain a first similarity.

The first similarity algorithm refers to an algorithm that calculates the similarity after word-by-word comparison in the order of the words in the two word sequences. The sequence of the word to be matched and the sequence of the target word are respectively used as a one-dimensional word sequence, and are calculated by the first similarity algorithm according to the order of the words to be matched and the order of the target words, to obtain the first similarity. The similarity calculation is performed in the one-dimensional form of the sequence of the word to be matched and the target word sequence, which can save the storage space of the system and reduce the time complexity.

In one embodiment, step 206 includes: calculating a to-be-matched word included in the sequence of words to be matched and a target word included in the target word sequence by using an edit distance formula to obtain a sequence between the sequence of the word to be matched and the target word sequence. Editing the distance; obtaining a first number of words to be matched included in the sequence of words to be matched, and a second quantity of the target words included in the sequence of target words; calculating according to the editing distance, the first quantity, and the second quantity, obtaining the first Similarity.

The edit distance is the minimum number of edit operations required between two word sequences, one from one to another. Calculating the editing distance between two word sequences in terms of words can reduce the influence of the semantics of word sequences on the editing of word sequences, and improve the accuracy of calculating the similarity of word sequences.

For example, the to-be-matched word sequence S of length |S| contains |S| to be matched words, and the target word sequence T of length |T| contains |T| target words. The edit distance lev _{S, T} (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula

Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T. When at least one of i and j is 0, the edit distance lev _{S, T} (i, j) takes the maximum value of i and j; otherwise, the edit distance lev _{S, T} (i, j) takes lev _{S, T} (i, j-1) +1, lev _{S, T} (i-1, j) +1, lev _{S, T} (i-1, j-1) +1 the minimum value. Based on the calculated edit distance, the formula can be calculated by the following similarity

The first similarity sim1 _{S, T} (i, j) is calculated. Max(|S|, |T|) represents the maximum value in |S|, |T|. And the first similarity sim1 _{S, T} (i, j) takes a value of 0 to 1.

Step 208: Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.

The set of to-be-matched words refers to a set consisting of all the to-be-matched words contained in the sequence of words to be matched. The target word set refers to a set consisting of all target words contained in the target word sequence. The words to be matched in the set of words to be matched do not have an order, and accordingly, the target words in the set of target words do not have an order.

In one of the embodiments, the literally expressed number can also be converted to an Arabic numeral, for example, "Thirty-three" can be converted to "33". Unification into Arabic numerals makes it easier to match numbers and improve the accuracy of text similarity.

Step 210: Calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity.

The second two similarity algorithm is a similarity algorithm that compares all the to-be-matched words and all the target words as a whole. Including but not limited to lexical similarity algorithm based on semantic dictionary and lexical similarity algorithm based on corpus statistics.

In one embodiment, step 210 includes: matching the to-be-matched word set with the target word set, and counting the number of matches between the to-be-matched word and the target word; and counting the number of to-be-matched words and the target word set of the to-be-matched word set. The number of target words; calculated according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.

Step 212: Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.

The text similarity refers to the similarity between the sequence of characters to be matched and the sequence of target characters. After calculating the first similarity and the second similarity, the first similarity may be multiplied by the second similarity as the text similarity. The first weight corresponding to the first similarity and the second weight corresponding to the second similarity may be preset, and the first similarity and the second similarity are weighted and summed to obtain a text similarity.

In the above text similarity calculation method, after obtaining the character sequence to be matched and the target character sequence, by performing preprocessing on the matching character sequence and the target character sequence, the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained. The first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence. The second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence. The similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.

In one embodiment, the extraneous characters include a deactivated character and the same character; deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters, including: deactivating the inclusion of the sequence of characters to be matched Deletion characters included in the character and target character sequences are deleted; it is judged whether the same character is present in the character sequence to be matched and the target character sequence after deleting the deactivated character; the same character is the sequence of characters to be matched after deleting the deactivated character and In the target character sequence, the same character at the same position; if yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.

Deactivating characters means that in information retrieval, in order to save storage space and improve search efficiency, certain words or words can be filtered out before the sequence of characters is processed. The deactivated character library can be preset for filtering deactivated characters. For example, Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah", "bar", "哎", "of", "further", "but" and the like. When a stop character is detected, the character to be matched or the stop character contained in the target character sequence can be deleted.

Since the character to be matched included in the character sequence to be matched and the target character included in the target character sequence have an order, the character to be matched and the target character are matched in order, and will be the same in the sequence of the character to be matched and the target character sequence. The same character in the position is the same character. The same character in the character sequence to be matched and the target character in the target character sequence are respectively deleted. For example, the sequence of characters to be matched is “how to optimize this algorithm”, and the target character sequence is “how to do the optimization algorithm”. By matching, “calculation” and “method” are in the sequence of characters to be matched and the sequence of target characters. The same location, so you can delete "calculation" and "method". The sequence of characters to be matched after deleting the same character is "How to optimize this", and the target character sequence is "Optimize what to do."

In the above embodiment, by deleting the unrelated characters such as the stop character and the same character, the length of the word sequence participating in the text similarity calculation can be reduced, which can save the text similarity calculation time, reduce the memory space occupied by the calculation, and improve the text similarity calculation. effectiveness.

In one embodiment, the unrelated characters included in the sequence of characters to be matched or the sequence of target characters may be replaced with preset characters, and the preset characters included in the sequence of characters to be matched or the sequence of target characters may be all cleared after replacement. For example, after the extraneous character included in the character sequence S1 to be matched is replaced with a space character, the character sequence S2 to be matched containing the space character is obtained, and the space characters included in the character sequence S2 to be matched are all cleared to obtain a space character containing no space character. The character sequence S3 to be matched. A word segmentation can be performed on the matching character sequence S3 to obtain a to-be-matched word sequence S4.

In one of the embodiments, before the second similarity is calculated by the semantic dictionary-based lexical similarity algorithm, the word tree can be constructed for the upper and lower hierarchical relationships of the words in the semantic dictionary, as shown in FIG. 3A and FIG. 3B, FIG. 3A The words in the word tree are derived from the physical material, and the words in Figure 3B are the word trees derived from the virtual events. The words corresponding to the parent node and the words corresponding to the child nodes have a relationship of upper and lower positions. The semantic distance between words can be calculated according to the word tree, and the higher the level, the larger the path parameter, the lower the level, and the smaller the path parameter. The greater the distance, the smaller the similarity. Calculate the path length of the word A and the word B in the word tree according to the word tree, that is, after the semantic distance is d, the similarity between the word A and the word B can be calculated according to the formula.

α is a parameter.

In one embodiment, the to-be-matched word set and the target word set may be matched, and the second similarity algorithm is used to calculate each of the to-be-matched words in the to-be-matched word set and each target word in the target word set. The second son is similar. The second similarity can be calculated from all the calculated second sub-similarities.

In one embodiment, after all the words to be matched are extracted to form a set of words to be matched, and all the target words are extracted to form a target word set, the second sub-similarity corresponding to the preset sub-similarity threshold is counted. The number of matches of the words to be matched is Q(S, T), and counts the number of words to be matched |S| included in the set of words to be matched and the number of target words |T| contained in the target word set. The second similarity sim2 can be passed through the formula

Calculated. Where Max(|S|, |T|) represents the maximum value of the number of words to be matched |S| and the number of target words |T|.

In an embodiment, after obtaining the to-be-matched character sequence and the target character sequence, the method further includes: acquiring a to-be-matched pinyin sequence corresponding to the to-be-matched character sequence and a target pinyin sequence corresponding to the target character sequence; and including the to-be-matched pinyin sequence The target pinyin included in the matched pinyin and target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated, and the character sequence to be matched and the target character sequence are obtained. The text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining text similarity of the character sequence to be matched and the target character sequence.

The pinyin sequence to be matched refers to a sequence composed of pinyin corresponding to the character to be matched in the sequence of characters to be matched. The target pinyin sequence is a sequence of pinyin corresponding to the target character in the target character sequence. The pinyin to be matched may be generated by acquiring the pinyin corresponding to the character to be matched input by the user when the user performs an input operation. The target pinyin sequence may be a sequence corresponding to the target character sequence preset in the database. The pinyin sequence to be matched and the target character sequence may be calculated by the first similarity algorithm to obtain a third similarity in units of pinyin corresponding to each character.

For example, the pinyin sequence to be matched corresponding to the character sequence "Your name" is "ni ming zi ao kou", and the target pinyin sequence corresponding to the target character sequence "You are too stubborn" is "ni tai zhi niu" . Although the characters to be matched and the target character sequence all contain the character "拗", the pinyin corresponding to the "pin" in the pinyin sequence and the target pinyin sequence are "ao" and "niu", respectively. Therefore, by calculating the text similarity with the pinyin sequence and the target pinyin sequence to be matched, the error caused by the multi-phonetic word "拗" can be reduced.

In the above embodiment, by introducing the pinyin sequence to be matched and the target pinyin sequence, it is possible to detect a situation in which the same character but different semantics are caused by the multi-syllable word, thereby reducing the text similarity error caused by the multi-syllable word.

In one embodiment, the sequence of the character to be matched and the sequence of the target character are obtained, including: receiving a sequence of characters to be matched sent by the terminal; acquiring a plurality of target character sequences from the database according to the sequence of characters to be matched; The second similarity is calculated, and after obtaining the text similarity of the character sequence to be matched and the target character sequence, the method further includes: querying a related resource corresponding to the target character sequence whose text similarity is greater than a preset similarity threshold; and sending the related resource to terminal.

The text similarity calculation is performed on the character sequence to be matched and the plurality of target character sequences, and the target character sequence with the highest similarity to the text of the character sequence to be matched can also be determined. The target character sequence can be associated with text, images, links, audio, video, and other related resources. For example, the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions. The sequence of target characters can be a sequence of characters associated with the corresponding answer text. After determining the target character sequence having the highest similarity to the character sequence to be matched, the character sequence of the corresponding answer text associated with the target character sequence may be transmitted to the terminal.

In one embodiment, as shown in FIG. 4, another text similarity calculation method is provided, the method comprising the following steps:

Step 402: Acquire a sequence of characters to be matched and a sequence of target characters.

The sequence of characters to be matched and the sequence of target characters may be a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation marks.

For example, the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions. For example, the sequence of characters to be matched can be "How much is the cost of 3 computers?". The target character sequence can be a sequence of characters of a question template pre-stored in the database. For example, the target character sequence can be "3 computer prices?". After receiving the sequence of characters to be matched sent by the terminal, the target character sequence preset in the database can be searched.

Step 404, deleting the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the sequence of target characters.

Unrelated characters include, but are not limited to, punctuation and deactivation characters. The deactivated character library can be preset for filtering deactivated characters. Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah", "bar", "哎", "of", "further", "but", and the like. When a stop character is detected, the character to be matched or the stop character contained in the target character sequence can be deleted.

For example, the sequence of characters to be matched is "How much is the 3 computers?". The punctuation mark "?" is included, and the stop character "Yes" is included, and the irrelevant characters contained in the character sequence to be matched are deleted to obtain "how much is the 3 computers".

Step 406: Perform word segmentation on the to-be-matched character sequence and the target character sequence after deleting the irrelevant characters, to obtain a corresponding to-be-matched word sequence and a target word sequence.

Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The sequence of to-be-matched words refers to a sequence formed in order of the words to be matched. The target word sequence refers to a sequence formed in order in units of target words.

For example, if the character sequence to be matched after the deletion of the unrelated character is segmented, the sequence of the word to be matched may be obtained as “3|set|computer|how much|money”, where “|” represents a word separator, which is used to distinguish the to-be-matched. Different words in the word sequence. The sequence of to-be-matched words includes five to-be-matched words of “3”, “Taiwan”, “Computer”, “How much”, and “Money”.

Step 408: Calculate the to-be-matched word and the target word included in the target word sequence included in the sequence of the word to be matched by the edit distance formula, and obtain an edit distance between the sequence of the word to be matched and the target word sequence.

The edit distance formula is a formula that calculates the minimum number of edit operations required to convert one word to another from one word sequence. The minimum number of edit operations is the edit distance. A licensed editing operation involves replacing one word with another, inserting a word, and deleting a word.

For example, the sequence of words to be matched is “3|set|computer|how much|money”, and the target word sequence is “3|set|computer|price”. Converting the sequence of words to be matched into the target word sequence requires three operations, including replacing "computer" with "computer", deleting "how much", and replacing "money" with "price." It is also possible to preset a synonym vocabulary. Since "computer" is equivalent to "computer", the replacement process of "computer" and "computer" may not be counted in the editing distance.

Step 410: Acquire a first quantity of the to-be-matched words included in the sequence of the to-be-matched words, and a second quantity of the target words included in the target word sequence.

Step 412: Perform calculation according to the edit distance, the first quantity, and the second quantity to obtain a first similarity.

Step 414: Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.

For example, the sequence of to-be-matched words is “3|set|computer|how much|money”, and all the to-be-matched words contained in the sequence of the to-be-matched words can be extracted, and the set of to-be-matched words is formed as {"3", "set", "Computer", "how much", "money"}. The five to-be-matched words “3”, “Taiwan”, “Computer”, “How much” and “Money” are juxtaposed and have no order.

Step 416: Match the to-be-matched word set and the target word set, and count the matching number of the to-be-matched word and the target word.

A word tree can be constructed for the upper and lower hierarchical relationship of words in the semantic dictionary, and the second sub-similarity between the to-be-matched word and the target word is calculated by the path distance between the to-be-matched word and the target word in the word tree. The to-be-matched word corresponding to the second sub-similarity greater than the preset sub-similarity threshold is determined to be matched with the target word, and the number of matches between the to-be-matched word and the target word in the to-be-matched word set and the target word set is counted.

Step 418: Count the number of words to be matched and the number of target words of the target word set of the set of words to be matched.

Step 420: Calculate according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.

For example, the set of words to be matched is {"computer", "how much", "money"}, the target word set is {"computer", "price"}, and the similarity between "computer" and "computer" can be calculated. ₁₁ , the similarity between "computer" and "price" sim ₁₂ , "how much" and "computer" similarity sim ₂₁ , "how much" and "price" similarity sim ₂₂ , "money" and "computer" similar Degree sim ₃₁ , the similarity between "money" and "price" sim ₃₂ . Multiplying each of the to-be-matched words by the maximum second sub-similarity calculated by the target words of the target word set to obtain a second similarity sim2. For example, the maximum second sub-similarity corresponding to "computer", "how much", and "money" is sim ₁₁ , sim ₂₂ , and sim _{32 respectively} , and the second similarity sim2 can be calculated by the following multiplication formula to obtain sim2. =sim ₁₁ ×sim ₂₂ ×sim ₃₂ .

Step 422: Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.

For example, after the first similarity sim1 and the second similarity sim2 are calculated, the first similarity sim1 and the second similarity sim2 may be multiplied by the corresponding first weight w1 and second weight w2, and calculated. The text similarity sim(S,T)=sim1×w1+sim2×w2.

It should be understood that although the various steps in the flowcharts of FIGS. 2 and 4 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in Figures 2 and 4 may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 5, a text similarity calculation device 500 is provided. The device includes: a character sequence acquisition module 502, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module. 504. Perform pre-processing on the sequence of the matched character and the target character, respectively, to obtain a sequence of the to-be-matched word and the target word. The first similarity calculation module 506 is configured to match the to-be-matched word included in the sequence of the word to be matched. And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; the word set forming module 508 is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all the target words to form a target. a second similarity calculation module 510, configured to calculate a to-be-matched word set and a target word set by using a second similarity algorithm to obtain a second similarity; a text similarity calculation module 512, configured to use the first similarity The degree and the second similarity are calculated to obtain the text similarity of the character sequence to be matched and the target character sequence.

In one embodiment, the word sequence obtaining module 504 is further configured to delete the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the target character sequence; the sequence of characters to be matched and the sequence of target characters after deleting the unrelated characters The word segmentation is performed separately, and the corresponding sequence of the word to be matched and the sequence of the target word are obtained.

In one embodiment, the unrelated character includes a deactivated character and the same character; the word sequence obtaining module 504 is further configured to delete the deactivated character included in the character sequence to be matched and the deactivated character included in the target character sequence; Whether the same character exists in the character sequence to be matched and the target character sequence after the character is deactivated; the same character refers to the same character in the same position in the character sequence to be matched and the target character sequence after the deactivated character is deleted; if so, The character sequence to be matched and the same character included in the target character sequence after deleting the deactivated character are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.

In one embodiment, the first similarity calculation module 506 is further configured to calculate, by using an edit distance formula, the to-be-matched words and the target words included in the target word sequence included in the sequence of to-be-matched words, to obtain a sequence of to-be-matched words and An edit distance between the target word sequences; a first number of words to be matched included in the sequence of words to be matched, and a second quantity of target words included in the target word sequence; according to the edit distance, the first quantity, and the second quantity The calculation is performed to obtain the first similarity.

In one embodiment, the second similarity calculation module 510 is further configured to match the to-be-matched word set and the target word set, and count the number of matches between the to-be-matched word and the target word; and count the number of to-be-matched words in the to-be-matched word set. And the number of target words in the target word set; the second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.

In one embodiment, the apparatus further includes a third similarity calculation module 514, configured to acquire a pinyin sequence to be matched and a target pinyin sequence corresponding to the target character sequence corresponding to the character sequence to be matched; and to include the pinyin sequence to be matched The target pinyin included in the matched pinyin and the target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated according to the first similarity and the second similarity, and the character sequence to be matched and the target character sequence are obtained. The text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining a text similarity of the character sequence to be matched and the target character sequence.

In one embodiment, the character sequence obtaining module 502 is further configured to receive a sequence of characters to be matched sent by the terminal, and obtain a plurality of target character sequences from the database according to the sequence of characters to be matched; the device further includes a related resource sending module, The related resources corresponding to the target character sequence whose query text similarity is greater than the preset similarity threshold are sent; the related resources are sent to the terminal.

For the specific definition of the text similarity calculation device, reference may be made to the definition of the text similarity calculation method in the above, and details are not described herein again. Each of the above-described text similarity computing devices may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in FIG. 6. The computer device includes a processor, memory, network interface, and database connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-transitory computer readable storage medium, an internal memory. The non-transitory computer readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium. The database of the computer device is used to store a sequence of target characters. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by the processor to implement a text similarity calculation method.

It will be understood by those skilled in the art that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.

In one embodiment, a computer apparatus is provided comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, implement any of the embodiments of the present application The steps of the text similarity calculation method provided.

In one of the embodiments, there is provided one or more non-transitory computer readable storage mediums storing computer readable instructions that, when executed by one or more processors, cause one or more processes The steps of the text similarity calculation method provided in any one of the embodiments of the present application are implemented.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database, or other medium used in the various embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.

The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A method for calculating text similarity, comprising:

Obtaining a sequence of characters to be matched and a sequence of target characters;

Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;

And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;

Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;

And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;

Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
The method according to claim 1, wherein the pre-processing of the to-be-matched character sequence and the target character sequence respectively obtains a corresponding to-be-matched word sequence and a target word sequence, including:

Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and

The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
The method according to claim 2, wherein said unrelated character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched and an unrelated character contained in said target character sequence Delete, including:

Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and

Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;

If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
The method according to claim 1, wherein said calculating a word to be matched included in said sequence of words to be matched and a target word included in said target word sequence are calculated by a first similarity algorithm to obtain a first A similarity, including:

And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;

Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and

Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
The method according to claim 1, wherein the calculating the set of to-be-matched words and the set of target words by using a second similarity algorithm to obtain a second similarity comprises:

Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;

Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and

The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
The method according to any one of claims 1 to 5, further comprising: after the obtaining the sequence of characters to be matched and the sequence of target characters, further comprising:

Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;

And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;

Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:

Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
The method according to any one of claims 1 to 5, wherein the obtaining the sequence of characters to be matched and the sequence of target characters comprises:

Receiving a sequence of characters to be matched sent by the terminal;

Obtaining a plurality of target character sequences from the database according to the sequence of characters to be matched;

After the calculating, according to the first similarity and the second similarity, the text similarity between the to-be-matched character sequence and the target character sequence, the method further includes:

Querying related resources corresponding to the target character sequence whose text similarity is greater than the preset similarity threshold; and

Transmitting the related resource to the terminal.
A text similarity calculation device includes:

a character sequence obtaining module, configured to obtain a sequence of characters to be matched and a sequence of target characters;

a word sequence obtaining module, configured to preprocess the to-be-matched character sequence and the target character sequence respectively, to obtain a corresponding candidate word sequence and a target word sequence;

a first similarity calculation module, configured to calculate, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;

a word set forming module, configured to extract all the to-be-matched words to form a set of to-be-matched words, and extract all target words to form a target word set;

a second similarity calculation module, configured to calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;

And a text similarity calculation module, configured to calculate, according to the first similarity and the second similarity, a text similarity between the to-be-matched character sequence and the target character sequence.
A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors to cause the one or more The processors perform the following steps:

Obtaining a sequence of characters to be matched and a sequence of target characters;

Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;

And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;

Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;

And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;

Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
The computer device according to claim 9, wherein the step of separately preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words, including performing The following steps:

Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and

The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
The computer device according to claim 10, wherein said irrelevant character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched is irrelevant to said target character sequence The steps to delete characters include performing the following steps:

Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and

Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;

If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
The computer device according to claim 9, wherein the calculating the word to be matched included in the sequence of words to be matched and the target word included in the sequence of target words are calculated by a first similarity algorithm to obtain The first similarity step includes performing the following steps:

And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;

Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and

Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
The computer device according to claim 9, wherein the step of calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity comprises performing the following steps :

Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;

Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and

The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
The computer device according to any one of claims 9 to 13, wherein after the step of acquiring the sequence of characters to be matched and the sequence of target characters, the method further comprises the step of:

Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;

And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;

Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:

Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtaining a sequence of characters to be matched and a sequence of target characters;

Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;

And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;

Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;

And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;

Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
The storage medium according to claim 15, wherein the step of preprocessing the sequence of the character to be matched and the sequence of the target character to obtain a corresponding sequence of the word to be matched and the sequence of the target word includes performing The following steps:

Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and

The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
The storage medium according to claim 16, wherein said irrelevant character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched is irrelevant to said target character sequence The steps to delete characters include performing the following steps:

Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and

Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;

If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
The storage medium according to claim 15, wherein the calculating the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence is calculated by a first similarity algorithm to obtain The first similarity step includes performing the following steps:

And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;

Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and

Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
The storage medium according to claim 15, wherein the step of calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity comprises performing the following steps :

Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;

Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and

The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
The storage medium according to any one of claims 15 to 19, further comprising, after the step of acquiring the sequence of characters to be matched and the sequence of target characters, performing the following steps:

Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;

And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;

Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:

Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.