+

WO2019136993A1 - Text similarity calculation method and device, computer apparatus, and storage medium - Google Patents

Text similarity calculation method and device, computer apparatus, and storage medium Download PDF

Info

Publication number
WO2019136993A1
WO2019136993A1 PCT/CN2018/099994 CN2018099994W WO2019136993A1 WO 2019136993 A1 WO2019136993 A1 WO 2019136993A1 CN 2018099994 W CN2018099994 W CN 2018099994W WO 2019136993 A1 WO2019136993 A1 WO 2019136993A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
matched
target
word
similarity
Prior art date
Application number
PCT/CN2018/099994
Other languages
French (fr)
Chinese (zh)
Inventor
艾明
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2019136993A1 publication Critical patent/WO2019136993A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to a text similarity calculation method, apparatus, computer device and storage medium.
  • the edit distance also known as the Levenshtein distance, refers to the minimum number of edit operations required between two strings, one from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character. The larger the edit distance value, the smaller the similarity between strings.
  • the traditional edit distance algorithm is usually in units of single characters. The edit distance between each character sequence is calculated, and the calculated edit distance is only the distance of the text surface, resulting in low accuracy of the calculated text similarity.
  • a text similarity calculation method, apparatus, computer device, and storage medium capable of improving text similarity are provided.
  • a text similarity calculation method includes: acquiring a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words; And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; and all the to-be-matched words are formed to form a to-be-matched word set.
  • a text similarity calculation device includes: a character sequence acquisition module, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module, configured to pre-process the to-be-matched character sequence and the target character sequence respectively Processing, obtaining a corresponding sequence of to-be-matched words and a sequence of target words; a first similarity calculation module, configured to pass the to-be-matched words included in the sequence of to-be-matched words and the target words included in the target word sequence through the first The similarity algorithm performs calculation to obtain a first similarity degree; a word set forming module is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all target words to form a target word set; and a second similarity calculation module for And the text similarity calculation module is configured to use the first similarity and the second similarity by using a second similarity algorithm to calculate a second similarity degree; Performing a calculation to obtain a text similarity of the sequence of characters to be matched and
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps: obtaining a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of a word to be matched and a sequence of target words;
  • the to-be-matched word included in the target word sequence and the target word included in the target word sequence are calculated by the first similarity algorithm to obtain a first similarity degree; all the to-be-matched words are formed to form a to-be-matched word set, and all target word formation targets are extracted a set of words; the set of the to-be-matched words and the set of target words are calculated by a second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity, The text similarity of the sequence of characters to be matched and the sequence of target characters.
  • One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: acquiring characters to be matched Sequence and target character sequence; respectively preprocessing the to-be-matched character sequence and the target character sequence to obtain a corresponding candidate word sequence and a target word sequence; and the to-be-matched words included in the to-be-matched word sequence
  • the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; all the to-be-matched words are formed to form a to-be-matched word set, and all target words are extracted to form a target word set;
  • the matching word set and the target word set are calculated by the second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity to obtain the to-be-matched character sequence and The text similarity of the target character sequence.
  • 1 is an application scenario diagram of a text similarity calculation method in accordance with one or more embodiments.
  • FIG. 2 is a flow diagram of a text similarity calculation method in accordance with one or more embodiments.
  • 3A is a schematic diagram of a word tree derived from a physical substance in accordance with one or more embodiments.
  • FIG. 3B is a schematic diagram of a word tree derived from a virtual event in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of a text similarity calculation method in accordance with another or more embodiments.
  • FIG. 5 is a block diagram showing the structure of a text similarity calculation apparatus according to one or more embodiments.
  • FIG. 6 is a diagram showing the internal structure of a computer device in accordance with one or more embodiments.
  • first may be referred to as a second similarity without departing from the scope of the present application, and similarly, the second similarity may be referred to as a first similarity. Both the first similarity and the second similarity are similarities, but they are not the same similarity.
  • Terminal 102 communicates with server 104 over a network over a network.
  • the server 104 can receive a sequence of characters to be matched sent by the terminal 102.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
  • a text similarity calculation method is provided, which is applied to the server 104 in FIG. 1 as an example, and includes the following steps:
  • Step 202 Acquire a sequence of characters to be matched and a sequence of target characters.
  • the sequence of characters to be matched refers to a sequence of characters that need to be matched.
  • the target character sequence refers to a preset sequence of characters in the database for matching the sequence of characters to be matched.
  • a sequence of characters refers to a sequence formed in order of characters, and the characters may be at least one of letters, Arabic numerals, Chinese characters, and punctuation marks.
  • the sequence of characters includes, but is not limited to, a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation.
  • Step 204 Perform preprocessing on the sequence of the matched character and the sequence of the target character respectively to obtain a corresponding sequence of the word to be matched and the sequence of the target word.
  • Preprocessing refers to the process of converting, reducing, splitting, and the like of at least one of a sequence of matching characters and a sequence of target characters.
  • the sequence of words to be matched refers to a sequence of words obtained by preprocessing the sequence of characters to be matched.
  • the target word sequence refers to a sequence of words obtained by preprocessing the target character sequence.
  • a word sequence refers to a sequence formed in order of words.
  • the sequence of to-be-matched words refers to a sequence formed in order of the words to be matched.
  • the target word sequence refers to a sequence formed in order in units of target words.
  • the to-be-matched word and the target word may be simple words composed of one or more characters, or may be composite words composed of two or more simple words.
  • the step 204 includes: deleting the unrelated characters included in the sequence of characters to be matched and the irrelevant characters included in the target character sequence; and respectively performing the word segmentation of the to-be-matched character sequence and the target character sequence after deleting the unrelated characters, The corresponding sequence of words to be matched and the sequence of target words are obtained.
  • Unrelated characters are characters that do not affect the calculation of text similarity, including but not limited to punctuation and deactivation.
  • Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics are used to classify the character sequence to be matched and the target character sequence after deleting the irrelevant character.
  • Step 206 Calculate the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence by using a first similarity algorithm to obtain a first similarity.
  • the first similarity algorithm refers to an algorithm that calculates the similarity after word-by-word comparison in the order of the words in the two word sequences.
  • the sequence of the word to be matched and the sequence of the target word are respectively used as a one-dimensional word sequence, and are calculated by the first similarity algorithm according to the order of the words to be matched and the order of the target words, to obtain the first similarity.
  • the similarity calculation is performed in the one-dimensional form of the sequence of the word to be matched and the target word sequence, which can save the storage space of the system and reduce the time complexity.
  • step 206 includes: calculating a to-be-matched word included in the sequence of words to be matched and a target word included in the target word sequence by using an edit distance formula to obtain a sequence between the sequence of the word to be matched and the target word sequence. Editing the distance; obtaining a first number of words to be matched included in the sequence of words to be matched, and a second quantity of the target words included in the sequence of target words; calculating according to the editing distance, the first quantity, and the second quantity, obtaining the first Similarity.
  • the edit distance is the minimum number of edit operations required between two word sequences, one from one to another. Calculating the editing distance between two word sequences in terms of words can reduce the influence of the semantics of word sequences on the editing of word sequences, and improve the accuracy of calculating the similarity of word sequences.
  • contains
  • contains
  • the edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T.
  • the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value.
  • the formula can be calculated by the following similarity
  • the first similarity sim1 S, T (i, j) is calculated.
  • ) represents the maximum value in
  • Step 208 Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
  • the set of to-be-matched words refers to a set consisting of all the to-be-matched words contained in the sequence of words to be matched.
  • the target word set refers to a set consisting of all target words contained in the target word sequence.
  • the words to be matched in the set of words to be matched do not have an order, and accordingly, the target words in the set of target words do not have an order.
  • the literally expressed number can also be converted to an Arabic numeral, for example, "Thirty-three” can be converted to "33". Unification into Arabic numerals makes it easier to match numbers and improve the accuracy of text similarity.
  • Step 210 Calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity.
  • the second two similarity algorithm is a similarity algorithm that compares all the to-be-matched words and all the target words as a whole. Including but not limited to lexical similarity algorithm based on semantic dictionary and lexical similarity algorithm based on corpus statistics.
  • step 210 includes: matching the to-be-matched word set with the target word set, and counting the number of matches between the to-be-matched word and the target word; and counting the number of to-be-matched words and the target word set of the to-be-matched word set.
  • the number of target words calculated according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
  • Step 212 Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
  • the text similarity refers to the similarity between the sequence of characters to be matched and the sequence of target characters.
  • the first similarity may be multiplied by the second similarity as the text similarity.
  • the first weight corresponding to the first similarity and the second weight corresponding to the second similarity may be preset, and the first similarity and the second similarity are weighted and summed to obtain a text similarity.
  • the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained.
  • the first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence.
  • the second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence.
  • the similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
  • the extraneous characters include a deactivated character and the same character; deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters, including: deactivating the inclusion of the sequence of characters to be matched Deletion characters included in the character and target character sequences are deleted; it is judged whether the same character is present in the character sequence to be matched and the target character sequence after deleting the deactivated character; the same character is the sequence of characters to be matched after deleting the deactivated character and In the target character sequence, the same character at the same position; if yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  • Deactivating characters means that in information retrieval, in order to save storage space and improve search efficiency, certain words or words can be filtered out before the sequence of characters is processed.
  • the deactivated character library can be preset for filtering deactivated characters.
  • Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah”, “bar”, “ ⁇ ”, "of", “further", “but” and the like.
  • the character to be matched included in the character sequence to be matched and the target character included in the target character sequence have an order
  • the character to be matched and the target character are matched in order, and will be the same in the sequence of the character to be matched and the target character sequence.
  • the same character in the position is the same character.
  • the same character in the character sequence to be matched and the target character in the target character sequence are respectively deleted.
  • the sequence of characters to be matched is “how to optimize this algorithm”, and the target character sequence is “how to do the optimization algorithm”.
  • “calculation” and “method” are in the sequence of characters to be matched and the sequence of target characters. The same location, so you can delete "calculation” and "method”.
  • the sequence of characters to be matched after deleting the same character is "How to optimize this", and the target character sequence is "Optimize what to do.”
  • the length of the word sequence participating in the text similarity calculation can be reduced, which can save the text similarity calculation time, reduce the memory space occupied by the calculation, and improve the text similarity calculation. effectiveness.
  • the unrelated characters included in the sequence of characters to be matched or the sequence of target characters may be replaced with preset characters, and the preset characters included in the sequence of characters to be matched or the sequence of target characters may be all cleared after replacement.
  • the character sequence S2 to be matched containing the space character is obtained, and the space characters included in the character sequence S2 to be matched are all cleared to obtain a space character containing no space character.
  • a word segmentation can be performed on the matching character sequence S3 to obtain a to-be-matched word sequence S4.
  • the word tree can be constructed for the upper and lower hierarchical relationships of the words in the semantic dictionary, as shown in FIG. 3A and FIG. 3B, FIG. 3A
  • the words in the word tree are derived from the physical material, and the words in Figure 3B are the word trees derived from the virtual events.
  • the words corresponding to the parent node and the words corresponding to the child nodes have a relationship of upper and lower positions.
  • the semantic distance between words can be calculated according to the word tree, and the higher the level, the larger the path parameter, the lower the level, and the smaller the path parameter. The greater the distance, the smaller the similarity.
  • Calculate the path length of the word A and the word B in the word tree according to the word tree, that is, after the semantic distance is d the similarity between the word A and the word B can be calculated according to the formula.
  • is a parameter.
  • the to-be-matched word set and the target word set may be matched, and the second similarity algorithm is used to calculate each of the to-be-matched words in the to-be-matched word set and each target word in the target word set.
  • the second son is similar.
  • the second similarity can be calculated from all the calculated second sub-similarities.
  • the second sub-similarity corresponding to the preset sub-similarity threshold is counted.
  • the number of matches of the words to be matched is Q(S, T), and counts the number of words to be matched
  • the second similarity sim2 can be passed through the formula Calculated. Where Max(
  • the method further includes: acquiring a to-be-matched pinyin sequence corresponding to the to-be-matched character sequence and a target pinyin sequence corresponding to the target character sequence; and including the to-be-matched pinyin sequence
  • the target pinyin included in the matched pinyin and target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated, and the character sequence to be matched and the target character sequence are obtained.
  • the text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining text similarity of the character sequence to be matched and the target character sequence.
  • the pinyin sequence to be matched refers to a sequence composed of pinyin corresponding to the character to be matched in the sequence of characters to be matched.
  • the target pinyin sequence is a sequence of pinyin corresponding to the target character in the target character sequence.
  • the pinyin to be matched may be generated by acquiring the pinyin corresponding to the character to be matched input by the user when the user performs an input operation.
  • the target pinyin sequence may be a sequence corresponding to the target character sequence preset in the database.
  • the pinyin sequence to be matched and the target character sequence may be calculated by the first similarity algorithm to obtain a third similarity in units of pinyin corresponding to each character.
  • the pinyin sequence to be matched corresponding to the character sequence "Your name” is "ni ming zi ao kou”
  • the target pinyin sequence corresponding to the target character sequence "You are too stubborn” is "ni tai zhi niu”
  • the characters to be matched and the target character sequence all contain the character " ⁇ ”
  • the pinyin corresponding to the "pin” in the pinyin sequence and the target pinyin sequence are "ao” and "niu”, respectively. Therefore, by calculating the text similarity with the pinyin sequence and the target pinyin sequence to be matched, the error caused by the multi-phonetic word " ⁇ " can be reduced.
  • the sequence of the character to be matched and the sequence of the target character are obtained, including: receiving a sequence of characters to be matched sent by the terminal; acquiring a plurality of target character sequences from the database according to the sequence of characters to be matched; The second similarity is calculated, and after obtaining the text similarity of the character sequence to be matched and the target character sequence, the method further includes: querying a related resource corresponding to the target character sequence whose text similarity is greater than a preset similarity threshold; and sending the related resource to terminal.
  • the text similarity calculation is performed on the character sequence to be matched and the plurality of target character sequences, and the target character sequence with the highest similarity to the text of the character sequence to be matched can also be determined.
  • the target character sequence can be associated with text, images, links, audio, video, and other related resources.
  • the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions.
  • the sequence of target characters can be a sequence of characters associated with the corresponding answer text. After determining the target character sequence having the highest similarity to the character sequence to be matched, the character sequence of the corresponding answer text associated with the target character sequence may be transmitted to the terminal.
  • FIG. 4 another text similarity calculation method is provided, the method comprising the following steps:
  • Step 402 Acquire a sequence of characters to be matched and a sequence of target characters.
  • the sequence of characters to be matched and the sequence of target characters may be a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation marks.
  • the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions.
  • the sequence of characters to be matched can be "How much is the cost of 3 computers?".
  • the target character sequence can be a sequence of characters of a question template pre-stored in the database.
  • the target character sequence can be "3 computer prices?".
  • Step 404 deleting the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the sequence of target characters.
  • Unrelated characters include, but are not limited to, punctuation and deactivation characters.
  • the deactivated character library can be preset for filtering deactivated characters.
  • Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah”, “bar”, “ ⁇ ”, “of”, “further”, “but”, and the like.
  • the sequence of characters to be matched is "How much is the 3 computers?".
  • the punctuation mark "?” is included, and the stop character "Yes” is included, and the irrelevant characters contained in the character sequence to be matched are deleted to obtain "how much is the 3 computers”.
  • Step 406 Perform word segmentation on the to-be-matched character sequence and the target character sequence after deleting the irrelevant characters, to obtain a corresponding to-be-matched word sequence and a target word sequence.
  • Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule.
  • the sequence of to-be-matched words refers to a sequence formed in order of the words to be matched.
  • the target word sequence refers to a sequence formed in order in units of target words.
  • the sequence of the word to be matched may be obtained as “3
  • the sequence of to-be-matched words includes five to-be-matched words of “3”, “Taiwan”, “Computer”, “How much”, and “Money”.
  • Step 408 Calculate the to-be-matched word and the target word included in the target word sequence included in the sequence of the word to be matched by the edit distance formula, and obtain an edit distance between the sequence of the word to be matched and the target word sequence.
  • the edit distance formula is a formula that calculates the minimum number of edit operations required to convert one word to another from one word sequence.
  • the minimum number of edit operations is the edit distance.
  • a licensed editing operation involves replacing one word with another, inserting a word, and deleting a word.
  • the sequence of words to be matched is “3
  • the target word sequence is “3
  • Step 410 Acquire a first quantity of the to-be-matched words included in the sequence of the to-be-matched words, and a second quantity of the target words included in the target word sequence.
  • Step 412 Perform calculation according to the edit distance, the first quantity, and the second quantity to obtain a first similarity.
  • contains
  • contains
  • the edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T.
  • the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value.
  • the formula can be calculated by the following similarity
  • the first similarity sim1 S, T (i, j) is calculated.
  • ) represents the maximum value in
  • Step 414 Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
  • the sequence of to-be-matched words is “3
  • the five to-be-matched words “3”, “Taiwan”, “Computer”, “How much” and “Money” are juxtaposed and have no order.
  • Step 416 Match the to-be-matched word set and the target word set, and count the matching number of the to-be-matched word and the target word.
  • a word tree can be constructed for the upper and lower hierarchical relationship of words in the semantic dictionary, and the second sub-similarity between the to-be-matched word and the target word is calculated by the path distance between the to-be-matched word and the target word in the word tree.
  • the to-be-matched word corresponding to the second sub-similarity greater than the preset sub-similarity threshold is determined to be matched with the target word, and the number of matches between the to-be-matched word and the target word in the to-be-matched word set and the target word set is counted.
  • Step 418 Count the number of words to be matched and the number of target words of the target word set of the set of words to be matched.
  • Step 420 Calculate according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
  • the set of words to be matched is ⁇ "computer", “how much”, “money” ⁇
  • the target word set is ⁇ "computer", “price” ⁇
  • the similarity between "computer” and “computer” can be calculated. 11 , the similarity between "computer” and “price” sim 12 , “how much” and “computer” similarity sim 21 , “how much” and “price” similarity sim 22 , “money” and “computer” similar Degree sim 31 , the similarity between "money” and “price” sim 32 . Multiplying each of the to-be-matched words by the maximum second sub-similarity calculated by the target words of the target word set to obtain a second similarity sim2.
  • the maximum second sub-similarity corresponding to "computer”, "how much”, and “money” is sim 11 , sim 22 , and sim 32 respectively
  • Step 422 Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
  • the first similarity sim1 and the second similarity sim2 may be multiplied by the corresponding first weight w1 and second weight w2, and calculated.
  • the text similarity sim(S,T) sim1 ⁇ w1+sim2 ⁇ w2.
  • the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained.
  • the first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence.
  • the second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence.
  • the similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
  • a text similarity calculation device 500 includes: a character sequence acquisition module 502, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module. 504. Perform pre-processing on the sequence of the matched character and the target character, respectively, to obtain a sequence of the to-be-matched word and the target word.
  • the first similarity calculation module 506 is configured to match the to-be-matched word included in the sequence of the word to be matched.
  • the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity;
  • the word set forming module 508 is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all the target words to form a target.
  • a second similarity calculation module 510 configured to calculate a to-be-matched word set and a target word set by using a second similarity algorithm to obtain a second similarity;
  • a text similarity calculation module 512 configured to use the first similarity The degree and the second similarity are calculated to obtain the text similarity of the character sequence to be matched and the target character sequence.
  • the word sequence obtaining module 504 is further configured to delete the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the target character sequence; the sequence of characters to be matched and the sequence of target characters after deleting the unrelated characters The word segmentation is performed separately, and the corresponding sequence of the word to be matched and the sequence of the target word are obtained.
  • the unrelated character includes a deactivated character and the same character
  • the word sequence obtaining module 504 is further configured to delete the deactivated character included in the character sequence to be matched and the deactivated character included in the target character sequence; Whether the same character exists in the character sequence to be matched and the target character sequence after the character is deactivated; the same character refers to the same character in the same position in the character sequence to be matched and the target character sequence after the deactivated character is deleted; if so, The character sequence to be matched and the same character included in the target character sequence after deleting the deactivated character are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  • the first similarity calculation module 506 is further configured to calculate, by using an edit distance formula, the to-be-matched words and the target words included in the target word sequence included in the sequence of to-be-matched words, to obtain a sequence of to-be-matched words and An edit distance between the target word sequences; a first number of words to be matched included in the sequence of words to be matched, and a second quantity of target words included in the target word sequence; according to the edit distance, the first quantity, and the second quantity The calculation is performed to obtain the first similarity.
  • the second similarity calculation module 510 is further configured to match the to-be-matched word set and the target word set, and count the number of matches between the to-be-matched word and the target word; and count the number of to-be-matched words in the to-be-matched word set. And the number of target words in the target word set; the second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
  • the apparatus further includes a third similarity calculation module 514, configured to acquire a pinyin sequence to be matched and a target pinyin sequence corresponding to the target character sequence corresponding to the character sequence to be matched; and to include the pinyin sequence to be matched
  • the target pinyin included in the matched pinyin and the target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated according to the first similarity and the second similarity, and the character sequence to be matched and the target character sequence are obtained.
  • the text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining a text similarity of the character sequence to be matched and the target character sequence.
  • the character sequence obtaining module 502 is further configured to receive a sequence of characters to be matched sent by the terminal, and obtain a plurality of target character sequences from the database according to the sequence of characters to be matched; the device further includes a related resource sending module, The related resources corresponding to the target character sequence whose query text similarity is greater than the preset similarity threshold are sent; the related resources are sent to the terminal.
  • Each of the above-described text similarity computing devices may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • a computer device which may be a server, and its internal structure diagram may be as shown in FIG. 6.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-transitory computer readable storage medium, an internal memory.
  • the non-transitory computer readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium.
  • the database of the computer device is used to store a sequence of target characters.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by the processor to implement a text similarity calculation method.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer apparatus comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, implement any of the embodiments of the present application The steps of the text similarity calculation method provided.
  • one or more non-transitory computer readable storage mediums storing computer readable instructions that, when executed by one or more processors, cause one or more processes
  • the steps of the text similarity calculation method provided in any one of the embodiments of the present application are implemented.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text similarity calculation method comprises: obtaining a character sequence to be matched and a target character sequence; respectively preprocessing the character sequence to be matched and the target character sequence to obtain a corresponding word sequence to be matched and a corresponding target word sequence; performing calculation, by means of a first similarity algorithm, on a word to be matched contained in the word sequence to be matched and a target word contained in the target word sequence, so as to obtain a first similarity degree; extracting all of the words to be matched, so as to form a set of words to be matched, and extracting all of the target words to form a target word set; performing calculation on the set of words to be matched and the target word set by means of a second similarity algorithm, so as to obtain a second similarity degree; and calculating, according to the first similarity degree and the second similarity degree, a text similarity degree between the character sequence to be matched and the target character sequence.

Description

文本相似度计算方法、装置、计算机设备和存储介质Text similarity calculation method, device, computer device and storage medium
本申请要求于2018年01月12日提交中国专利局,申请号为2018100317700,申请名称为“文本相似度计算方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be submitted to the Chinese Patent Office on January 12, 2018, the application number is 2018100317700, and the priority of the Chinese patent application entitled "Text Similarity Calculation Method, Apparatus, Computer Equipment and Storage Medium" is applied. The citations are incorporated herein by reference.
技术领域Technical field
本申请涉及一种文本相似度计算方法、装置、计算机设备和存储介质。The present application relates to a text similarity calculation method, apparatus, computer device and storage medium.
背景技术Background technique
随着聊天机器人技术的发展,出现了字符串模糊搜索的概念,通常会采取编辑距离算法实现字符串匹配。编辑距离又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。编辑距离值越大,则字符串之间的相似度越小。With the development of chat bot technology, the concept of string fuzzy search has appeared, and the edit distance algorithm is usually used to achieve string matching. The edit distance, also known as the Levenshtein distance, refers to the minimum number of edit operations required between two strings, one from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character. The larger the edit distance value, the smaller the similarity between strings.
然而,由于语言的复杂性,相同的意思可以通过不同的文本来进行表达,而表面上很相似的文本,所表达出的含义也可能大不相同,传统的编辑距离算法通常以单个字符为单位计算各个字符序列之间的编辑距离,计算出来的编辑距离只是文字表面的距离,导致计算得到的文本相似度的准确度较低。However, due to the complexity of the language, the same meaning can be expressed by different texts, while the texts on the surface are very similar, and the meanings expressed may be very different. The traditional edit distance algorithm is usually in units of single characters. The edit distance between each character sequence is calculated, and the calculated edit distance is only the distance of the text surface, resulting in low accuracy of the calculated text similarity.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种能够提高文本相似度的文本相似度计算方法、装置、计算机设备和存储介质。According to various embodiments disclosed herein, a text similarity calculation method, apparatus, computer device, and storage medium capable of improving text similarity are provided.
一种文本相似度计算方法,包括:获取待匹配字符序列和目标字符序列;对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。A text similarity calculation method includes: acquiring a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words; And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; and all the to-be-matched words are formed to form a to-be-matched word set. And extracting all target words to form a target word set; calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity; and according to the first similarity and the first The two similarities are calculated to obtain a text similarity of the sequence of characters to be matched and the sequence of target characters.
一种文本相似度计算装置,包括:字符序列获取模块,用于获取待匹配字符序列和目标字符序列;词序列获取模块,用于对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;第一相似度计算模块,用于将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;词集合形成模块,用于提取所有待匹配词形成待匹配词集合, 并提取所有目标词形成目标词集合;第二相似度计算模块,用于将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及文本相似度计算模块,用于根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。A text similarity calculation device includes: a character sequence acquisition module, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module, configured to pre-process the to-be-matched character sequence and the target character sequence respectively Processing, obtaining a corresponding sequence of to-be-matched words and a sequence of target words; a first similarity calculation module, configured to pass the to-be-matched words included in the sequence of to-be-matched words and the target words included in the target word sequence through the first The similarity algorithm performs calculation to obtain a first similarity degree; a word set forming module is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all target words to form a target word set; and a second similarity calculation module for And the text similarity calculation module is configured to use the first similarity and the second similarity by using a second similarity algorithm to calculate a second similarity degree; Performing a calculation to obtain a text similarity of the sequence of characters to be matched and the sequence of target characters.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:获取待匹配字符序列和目标字符序列;对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps: obtaining a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of a word to be matched and a sequence of target words; The to-be-matched word included in the target word sequence and the target word included in the target word sequence are calculated by the first similarity algorithm to obtain a first similarity degree; all the to-be-matched words are formed to form a to-be-matched word set, and all target word formation targets are extracted a set of words; the set of the to-be-matched words and the set of target words are calculated by a second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity, The text similarity of the sequence of characters to be matched and the sequence of target characters.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:获取待匹配字符序列和目标字符序列;对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: acquiring characters to be matched Sequence and target character sequence; respectively preprocessing the to-be-matched character sequence and the target character sequence to obtain a corresponding candidate word sequence and a target word sequence; and the to-be-matched words included in the to-be-matched word sequence The target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; all the to-be-matched words are formed to form a to-be-matched word set, and all target words are extracted to form a target word set; The matching word set and the target word set are calculated by the second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity to obtain the to-be-matched character sequence and The text similarity of the target character sequence.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.
图1为根据一个或多个实施例中文本相似度计算方法的应用场景图。1 is an application scenario diagram of a text similarity calculation method in accordance with one or more embodiments.
图2为根据一个或多个实施例中文本相似度计算方法的流程示意图。2 is a flow diagram of a text similarity calculation method in accordance with one or more embodiments.
图3A为根据一个或多个实施例中实体物质衍生的词语树的示意图。3A is a schematic diagram of a word tree derived from a physical substance in accordance with one or more embodiments.
图3B为根据一个或多个实施例中虚拟事件衍生的词语树的示意图。FIG. 3B is a schematic diagram of a word tree derived from a virtual event in accordance with one or more embodiments.
图4为根据另一个或多个实施例中文本相似度计算方法的流程示意图。4 is a flow diagram of a text similarity calculation method in accordance with another or more embodiments.
图5为根据一个或多个实施例中文本相似度计算装置的结构框图。FIG. 5 is a block diagram showing the structure of a text similarity calculation apparatus according to one or more embodiments.
图6为根据一个或多个实施例中计算机设备的内部结构图。6 is a diagram showing the internal structure of a computer device in accordance with one or more embodiments.
具体实施方式Detailed ways
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语的限制。这些术语仅用于将第一个元件与另一个元件区分。举例来说,在不脱离本申请的范围的情况下,可以将第一相似度称为第二相似度,且类似地,可将第二相似度称为第一相似度。第一相似度和第二相似度两者都是相似度,但其不是同一相似度。It will be understood that the terms "first", "second" and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, the first similarity may be referred to as a second similarity without departing from the scope of the present application, and similarly, the second similarity may be referred to as a first similarity. Both the first similarity and the second similarity are similarities, but they are not the same similarity.
本申请提供的文本相似度计算方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。比如说服务器104可接收终端102发送的待匹配字符序列。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The text similarity calculation method provided by the present application can be applied to the application environment as shown in FIG. 1. Terminal 102 communicates with server 104 over a network over a network. For example, the server 104 can receive a sequence of characters to be matched sent by the terminal 102. The terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
在其中一个实施例中,如图2所示,提供了一种文本相似度计算方法,以该方法应用于图1中的服务器104为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a text similarity calculation method is provided, which is applied to the server 104 in FIG. 1 as an example, and includes the following steps:
步骤202,获取待匹配字符序列和目标字符序列。Step 202: Acquire a sequence of characters to be matched and a sequence of target characters.
待匹配字符序列是指需要进行匹配的字符序列。目标字符序列是指数据库中预设的字符序列,用于与待匹配字符序列进行匹配。字符序列是指以字符为单位按顺序形成的序列,字符可以是字母、阿拉伯数字、汉字和标点符号中的至少一种。字符序列包括但不限于字母、阿拉伯数字、汉字和标点符号等其中一种或多种的组合。The sequence of characters to be matched refers to a sequence of characters that need to be matched. The target character sequence refers to a preset sequence of characters in the database for matching the sequence of characters to be matched. A sequence of characters refers to a sequence formed in order of characters, and the characters may be at least one of letters, Arabic numerals, Chinese characters, and punctuation marks. The sequence of characters includes, but is not limited to, a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation.
步骤204,对待匹配字符序列和目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列。Step 204: Perform preprocessing on the sequence of the matched character and the sequence of the target character respectively to obtain a corresponding sequence of the word to be matched and the sequence of the target word.
预处理是指对待匹配字符序列和目标字符序列进行转换、缩减、拆分等其中至少一种操作的过程。待匹配词序列是指由待匹配字符序列经过预处理得到的词序列。目标词序列是指由目标字符序列经过预处理得到的词序列。词序列是指以词语为单位按顺序形成的序列。待匹配词序列是指以待匹配词为单位按顺序形成的序列。目标词序列是指以目标词为单位按顺序形成的序列。待匹配词和目标词可为由一个或多个字符构成的单纯词,也可为由两个及以上的单纯词构成的合成词。Preprocessing refers to the process of converting, reducing, splitting, and the like of at least one of a sequence of matching characters and a sequence of target characters. The sequence of words to be matched refers to a sequence of words obtained by preprocessing the sequence of characters to be matched. The target word sequence refers to a sequence of words obtained by preprocessing the target character sequence. A word sequence refers to a sequence formed in order of words. The sequence of to-be-matched words refers to a sequence formed in order of the words to be matched. The target word sequence refers to a sequence formed in order in units of target words. The to-be-matched word and the target word may be simple words composed of one or more characters, or may be composite words composed of two or more simple words.
在其中一个实施例中,步骤204包括:将待匹配字符序列中包含的无关字符和目标字符序列中包含的无关字符删除;对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。In one embodiment, the step 204 includes: deleting the unrelated characters included in the sequence of characters to be matched and the irrelevant characters included in the target character sequence; and respectively performing the word segmentation of the to-be-matched character sequence and the target character sequence after deleting the unrelated characters, The corresponding sequence of words to be matched and the sequence of target words are obtained.
无关字符是指不影响文本相似度计算的字符,包括但不限于标点符号及停用字符。 分词是指将字符序列按照一定的规律转换成词序列的处理过程。可采用基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法等其中一种或多种的分词方法,对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词。Unrelated characters are characters that do not affect the calculation of text similarity, including but not limited to punctuation and deactivation. Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics are used to classify the character sequence to be matched and the target character sequence after deleting the irrelevant character.
步骤206,将待匹配词序列中包含的待匹配词和目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度。Step 206: Calculate the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence by using a first similarity algorithm to obtain a first similarity.
第一相似度算法是指按照两个词序列中词语的顺序进行逐词比较之后计算得到相似度的算法。将待匹配词序列与目标词序列分别作为一维词序列,按照待匹配词的顺序和目标词的顺序通过第一相似度算法进行计算,得到第一相似度。将待匹配词序列与目标词序列以一维的形式进行相似度计算,能够节约系统的存储空间,降低时间复杂度。The first similarity algorithm refers to an algorithm that calculates the similarity after word-by-word comparison in the order of the words in the two word sequences. The sequence of the word to be matched and the sequence of the target word are respectively used as a one-dimensional word sequence, and are calculated by the first similarity algorithm according to the order of the words to be matched and the order of the target words, to obtain the first similarity. The similarity calculation is performed in the one-dimensional form of the sequence of the word to be matched and the target word sequence, which can save the storage space of the system and reduce the time complexity.
在其中一个实施例中,步骤206包括:将待匹配词序列中包含的待匹配词和目标词序列中包含的目标词通过编辑距离公式进行计算,得到待匹配词序列与目标词序列之间的编辑距离;获取待匹配词序列中包含的待匹配词的第一数量,和目标词序列中包含的目标词的第二数量;根据编辑距离、第一数量和第二数量进行计算,得到第一相似度。In one embodiment, step 206 includes: calculating a to-be-matched word included in the sequence of words to be matched and a target word included in the target word sequence by using an edit distance formula to obtain a sequence between the sequence of the word to be matched and the target word sequence. Editing the distance; obtaining a first number of words to be matched included in the sequence of words to be matched, and a second quantity of the target words included in the sequence of target words; calculating according to the editing distance, the first quantity, and the second quantity, obtaining the first Similarity.
编辑距离是指两个词序列之间,由一个转成另一个所需的最少编辑操作次数。以词语为单位计算两个词序列之间的编辑距离,能够降低词序列的语义对词序列编辑的影响,提高计算词序列相似度的准确性。The edit distance is the minimum number of edit operations required between two word sequences, one from one to another. Calculating the editing distance between two word sequences in terms of words can reduce the influence of the semantics of word sequences on the editing of word sequences, and improve the accuracy of calculating the similarity of word sequences.
举例来说,长度为|S|的待匹配词序列S中包含|S|个待匹配词,长度为|T|的目标词序列T中包含|T|个目标词。待匹配词序列S与目标词序列T的编辑距离lev S,T(i,j)可通过公式
Figure PCTCN2018099994-appb-000001
计算得到。i表示待匹配词序列S中第i个待匹配词,j表示目标词序列T中第j个目标词。当i和j中存在至少一个为0时,编辑距离lev S,T(i,j)取i和j中的最大值;否则,编辑距离lev S,T(i,j)取lev S,T(i,j-1)+1、lev S,T(i-1,j)+1、lev S,T(i-1,j-1)+1中的最小值。基于计算得到的编辑距离,可通过下列相似度计算公式
Figure PCTCN2018099994-appb-000002
计算得到第一相似度sim1 S,T(i,j)。Max(|S|,|T|)表示|S|、|T|中的最大值。且第一相似度sim1 S,T(i,j)取值为0至1。
For example, the to-be-matched word sequence S of length |S| contains |S| to be matched words, and the target word sequence T of length |T| contains |T| target words. The edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula
Figure PCTCN2018099994-appb-000001
Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T. When at least one of i and j is 0, the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value. Based on the calculated edit distance, the formula can be calculated by the following similarity
Figure PCTCN2018099994-appb-000002
The first similarity sim1 S, T (i, j) is calculated. Max(|S|, |T|) represents the maximum value in |S|, |T|. And the first similarity sim1 S, T (i, j) takes a value of 0 to 1.
步骤208,提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合。Step 208: Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
待匹配词集合是指由待匹配词序列中包含的所有待匹配词构成的集合。待目标词集合是指由目标词序列中包含的所有目标词构成的集合。待匹配词集合中的待匹配词不具有 顺序,相应地,目标词集合中的目标词也不具有顺序。The set of to-be-matched words refers to a set consisting of all the to-be-matched words contained in the sequence of words to be matched. The target word set refers to a set consisting of all target words contained in the target word sequence. The words to be matched in the set of words to be matched do not have an order, and accordingly, the target words in the set of target words do not have an order.
在其中一个实施例中,还可将文字表达的数字转换为阿拉伯数字,比如说可将“三十三”转换为“33”。统一为阿拉伯数字,可以更快捷地进行数字的匹配,提高文本相似度的准确性。In one of the embodiments, the literally expressed number can also be converted to an Arabic numeral, for example, "Thirty-three" can be converted to "33". Unification into Arabic numerals makes it easier to match numbers and improve the accuracy of text similarity.
步骤210,将待匹配词集合和目标词集合通过第二相似度算法进行计算,得到第二相似度。Step 210: Calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity.
第其中二相似度算法是将所有的待匹配词和所有的目标词分别作为一个整体进行比较的相似度算法。包括但不限于基于语义词典的词汇相似度算法及基于语料统计的词汇相似度算法等。The second two similarity algorithm is a similarity algorithm that compares all the to-be-matched words and all the target words as a whole. Including but not limited to lexical similarity algorithm based on semantic dictionary and lexical similarity algorithm based on corpus statistics.
在其中一个实施例中,步骤210,包括:将待匹配词集合和目标词集合进行匹配,统计待匹配词与目标词的匹配数量;统计待匹配词集合的待匹配词数量和目标词集合的目标词数量;根据匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。In one embodiment, step 210 includes: matching the to-be-matched word set with the target word set, and counting the number of matches between the to-be-matched word and the target word; and counting the number of to-be-matched words and the target word set of the to-be-matched word set. The number of target words; calculated according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
步骤212,根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度。Step 212: Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
文本相似度是指待匹配字符序列和目标字符序列之间的相似度。计算得到第一相似度和第二相似度之后,可将第一相似度乘以第二相似度作为文本相似度。还可预设与第一相似度的对应的第一权值,与第二相似度对应的第二权值,对第一相似度与第二相似度进行加权求和计算,得到文本相似度。The text similarity refers to the similarity between the sequence of characters to be matched and the sequence of target characters. After calculating the first similarity and the second similarity, the first similarity may be multiplied by the second similarity as the text similarity. The first weight corresponding to the first similarity and the second weight corresponding to the second similarity may be preset, and the first similarity and the second similarity are weighted and summed to obtain a text similarity.
上述文本相似度计算方法中,在获取待匹配字符序列和目标字符序列之后,通过对匹配字符序列和目标字符序列进行预处理,得到以词语为单位按顺序形成的待匹配词序列和目标词序列,通过考虑词语顺序的第一相似度算法计算得到第一相似度,再根据待匹配词序列中包含的待匹配词和目标词序列中包含的目标词分别形成待匹配词集合与目标词集合,通过不考虑词语顺序的第二相似度算法计算得到第二相似度,然后综合第一相似度和第二相似度计算得到待匹配字符序列和目标字符序列之间的文本相似度。通过以词语为单位进行相似度计算,并综合两种相似度算法计算文本相似度,降低了以单字符通过单一相似度算法导致的误差,提高了文本相似度计算的准确性。In the above text similarity calculation method, after obtaining the character sequence to be matched and the target character sequence, by performing preprocessing on the matching character sequence and the target character sequence, the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained. The first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence. The second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence. The similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
在其中一个实施例中,无关字符包括停用字符和相同字符;将待匹配字符序列中包含的无关字符和目标字符序列中包含的无关字符删除,包括:将待匹配字符序列中包含的停用字符和目标字符序列中包含的停用字符删除;判断删除停用字符后的待匹配字符序列和目标字符序列中是否存在相同字符;相同字符是指在删除停用字符后的待匹配字符序列和目标字符序列中,处于相同位置的相同字符;若是,则将删除停用字符后的待匹配字符序列和目标字符序列中包含的相同字符删除,得到相应的待匹配词序列和目标词序列。In one embodiment, the extraneous characters include a deactivated character and the same character; deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters, including: deactivating the inclusion of the sequence of characters to be matched Deletion characters included in the character and target character sequences are deleted; it is judged whether the same character is present in the character sequence to be matched and the target character sequence after deleting the deactivated character; the same character is the sequence of characters to be matched after deleting the deactivated character and In the target character sequence, the same character at the same position; if yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
停用字符是指在信息检索中,为节省存储空间和提高搜索效率,在处理字符序列之前可滤掉某些字或词。针对过滤停用字符可预设停用字符库。比如中文停用字符包括但不限于语气词、连接词及转折词等,比如“啊”、“吧”、“哎”、“的”、“此外”、“但是”等。 当检测到停用字符时,可将待匹配字符序列或目标字符序列中包含的停用字符删除。Deactivating characters means that in information retrieval, in order to save storage space and improve search efficiency, certain words or words can be filtered out before the sequence of characters is processed. The deactivated character library can be preset for filtering deactivated characters. For example, Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah", "bar", "哎", "of", "further", "but" and the like. When a stop character is detected, the character to be matched or the stop character contained in the target character sequence can be deleted.
由于待匹配字符序列中包含的待匹配字符和目标字符序列中包含的目标字符都具有顺序,可将待匹配字符和目标字符按顺序进行匹配,将在待匹配字符序列和目标字符序列中处于相同位置上的相同的字符作为相同字符。分别将待匹配字符序列中的相同字符与目标字符序列中的目标字符删除。举例来说,待匹配字符序列为“这个算法该如何优化”,目标字符序列为“优化算法该怎么做”,通过匹配可知,“算”和“法”处于待匹配字符序列和目标字符序列中的相同位置,因此,可以删除“算”和“法”。删除相同字符之后的待匹配字符序列为“这个该如何优化”,目标字符序列为“优化该怎么做”。Since the character to be matched included in the character sequence to be matched and the target character included in the target character sequence have an order, the character to be matched and the target character are matched in order, and will be the same in the sequence of the character to be matched and the target character sequence. The same character in the position is the same character. The same character in the character sequence to be matched and the target character in the target character sequence are respectively deleted. For example, the sequence of characters to be matched is “how to optimize this algorithm”, and the target character sequence is “how to do the optimization algorithm”. By matching, “calculation” and “method” are in the sequence of characters to be matched and the sequence of target characters. The same location, so you can delete "calculation" and "method". The sequence of characters to be matched after deleting the same character is "How to optimize this", and the target character sequence is "Optimize what to do."
上述实施例中,通过删除停用字符和相同字符等无关字符,可以缩减参与文本相似度计算的词序列长度,能够节约文本相似度计算时间,减少计算所占用的内存空间,提高文本相似度计算效率。In the above embodiment, by deleting the unrelated characters such as the stop character and the same character, the length of the word sequence participating in the text similarity calculation can be reduced, which can save the text similarity calculation time, reduce the memory space occupied by the calculation, and improve the text similarity calculation. effectiveness.
在其中一个实施例中,还可以将待匹配字符序列或目标字符序列中包含的无关字符替换为预设字符,替换之后可将待匹配字符序列或目标字符序列中包含的预设字符全部清除。比如可将待匹配字符序列S1中包含的无关字符替换为空格符之后,得到包含空格符的待匹配字符序列S2,将待匹配字符序列S2中包含的空格符全部清除,得到不包含空格符的待匹配字符序列S3。可对待匹配字符序列S3进行分词,得到待匹配词序列S4。In one embodiment, the unrelated characters included in the sequence of characters to be matched or the sequence of target characters may be replaced with preset characters, and the preset characters included in the sequence of characters to be matched or the sequence of target characters may be all cleared after replacement. For example, after the extraneous character included in the character sequence S1 to be matched is replaced with a space character, the character sequence S2 to be matched containing the space character is obtained, and the space characters included in the character sequence S2 to be matched are all cleared to obtain a space character containing no space character. The character sequence S3 to be matched. A word segmentation can be performed on the matching character sequence S3 to obtain a to-be-matched word sequence S4.
在其中一个实施例中,通过基于语义词典的词汇相似度算法计算第二相似度之前,可针对语义词典中词语的上下位层次关系可构建词语树,如图3A和图3B所示,图3A中的词语为实体物质衍生的词语树,图3B中的词语为虚拟事件衍生的词语树。父节点所对应的词语与子节点所对应的词语具有上下位的关系。可根据词语树计算词语之间的语义距离,且层次越高路径参数越大,层次越低,路径参数越小。距离越大,相似度越小。根据词语树计算词语A和词语B在词语树中的路径长度,即语义距离为d之后,可根据公式计算词语A和词语B的相似度
Figure PCTCN2018099994-appb-000003
α为参数。
In one of the embodiments, before the second similarity is calculated by the semantic dictionary-based lexical similarity algorithm, the word tree can be constructed for the upper and lower hierarchical relationships of the words in the semantic dictionary, as shown in FIG. 3A and FIG. 3B, FIG. 3A The words in the word tree are derived from the physical material, and the words in Figure 3B are the word trees derived from the virtual events. The words corresponding to the parent node and the words corresponding to the child nodes have a relationship of upper and lower positions. The semantic distance between words can be calculated according to the word tree, and the higher the level, the larger the path parameter, the lower the level, and the smaller the path parameter. The greater the distance, the smaller the similarity. Calculate the path length of the word A and the word B in the word tree according to the word tree, that is, after the semantic distance is d, the similarity between the word A and the word B can be calculated according to the formula.
Figure PCTCN2018099994-appb-000003
α is a parameter.
在其中一个实施例中,可将待匹配词集合与目标词集合进行匹配,通过第二相似度算法计算得到待匹配词集合中的每个待匹配词与目标词集合中每个目标词的第二子相似度。根据计算得到的所有的第二子相似度可计算得到第二相似度。In one embodiment, the to-be-matched word set and the target word set may be matched, and the second similarity algorithm is used to calculate each of the to-be-matched words in the to-be-matched word set and each target word in the target word set. The second son is similar. The second similarity can be calculated from all the calculated second sub-similarities.
在其中一个实施例中,还可将在提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合之后,统计大于预设子相似度阈值的第二子相似度所对应的待匹配词的匹配数量Q(S,T),并统计待匹配词集合中包含的待匹配词数量|S|和目标词集合中包含的目标词数量|T|。第二相似度sim2可通过公式
Figure PCTCN2018099994-appb-000004
计算得到。其中Max(|S|,|T|)表示待匹配词数量|S|和目标词数量|T|中的最大值。
In one embodiment, after all the words to be matched are extracted to form a set of words to be matched, and all the target words are extracted to form a target word set, the second sub-similarity corresponding to the preset sub-similarity threshold is counted. The number of matches of the words to be matched is Q(S, T), and counts the number of words to be matched |S| included in the set of words to be matched and the number of target words |T| contained in the target word set. The second similarity sim2 can be passed through the formula
Figure PCTCN2018099994-appb-000004
Calculated. Where Max(|S|, |T|) represents the maximum value of the number of words to be matched |S| and the number of target words |T|.
在其中一个实施例中,在获取待匹配字符序列和目标字符序列之后,还包括:获取待匹配字符序列对应的待匹配拼音序列和目标字符序列对应的目标拼音序列;将待匹配拼音序列中包含的待匹配拼音和目标拼音序列中包含的目标拼音通过第一相似度算法进行计算,得到第三相似度;根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度,包括:根据第一相似度、第二相似度和第三相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度。In an embodiment, after obtaining the to-be-matched character sequence and the target character sequence, the method further includes: acquiring a to-be-matched pinyin sequence corresponding to the to-be-matched character sequence and a target pinyin sequence corresponding to the target character sequence; and including the to-be-matched pinyin sequence The target pinyin included in the matched pinyin and target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated, and the character sequence to be matched and the target character sequence are obtained. The text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining text similarity of the character sequence to be matched and the target character sequence.
待匹配拼音序列是指待匹配字符序列中的待匹配字符所对应的拼音构成的序列。目标拼音序列是指目标字符序列中的目标字符所对应的拼音构成的序列。可通过在用户进行输入操作时,获取用户输入的待匹配字符所对应的拼音,生成待匹配拼音序列。目标拼音序列可为数据库中预设的与目标字符序列所对应的序列。可以将待匹配拼音序列与目标字符序列以每个字符所对应的拼音为单位,通过第一相似度算法进行计算得到第三相似度。The pinyin sequence to be matched refers to a sequence composed of pinyin corresponding to the character to be matched in the sequence of characters to be matched. The target pinyin sequence is a sequence of pinyin corresponding to the target character in the target character sequence. The pinyin to be matched may be generated by acquiring the pinyin corresponding to the character to be matched input by the user when the user performs an input operation. The target pinyin sequence may be a sequence corresponding to the target character sequence preset in the database. The pinyin sequence to be matched and the target character sequence may be calculated by the first similarity algorithm to obtain a third similarity in units of pinyin corresponding to each character.
举例来说,待匹配字符序列“你名字拗口”所对应的待匹配拼音序列为“ni ming zi ao kou”,目标字符序列“你太执拗”所对应的目标拼音序列为“ni tai zhi niu”。虽然待匹配字符序列和目标字符序列中都包含“拗”这个字符,但是由于“拗”在待匹配拼音序列和目标拼音序列中所对应的拼音分别为“ao”和“niu”,差别很大,因此通过以待匹配拼音序列和目标拼音序列进行文本相似度的计算,可以降低“拗”这个多音字带来的误差。For example, the pinyin sequence to be matched corresponding to the character sequence "Your name" is "ni ming zi ao kou", and the target pinyin sequence corresponding to the target character sequence "You are too stubborn" is "ni tai zhi niu" . Although the characters to be matched and the target character sequence all contain the character "拗", the pinyin corresponding to the "pin" in the pinyin sequence and the target pinyin sequence are "ao" and "niu", respectively. Therefore, by calculating the text similarity with the pinyin sequence and the target pinyin sequence to be matched, the error caused by the multi-phonetic word "拗" can be reduced.
上述实施例中,通过引入待匹配拼音序列和目标拼音序列,能够检测到由于多音字导致相同文字但语义不同的情况,从而降低因多音字导致的文本相似度误差。In the above embodiment, by introducing the pinyin sequence to be matched and the target pinyin sequence, it is possible to detect a situation in which the same character but different semantics are caused by the multi-syllable word, thereby reducing the text similarity error caused by the multi-syllable word.
在其中一个实施例中,获取待匹配字符序列和目标字符序列,包括:接收终端发送的待匹配字符序列;根据待匹配字符序列从数据库中获取多个目标字符序列;在根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度之后,还包括:查询文本相似度大于预设相似度阈值的目标字符序列所对应的相关资源;将相关资源发送至终端。In one embodiment, the sequence of the character to be matched and the sequence of the target character are obtained, including: receiving a sequence of characters to be matched sent by the terminal; acquiring a plurality of target character sequences from the database according to the sequence of characters to be matched; The second similarity is calculated, and after obtaining the text similarity of the character sequence to be matched and the target character sequence, the method further includes: querying a related resource corresponding to the target character sequence whose text similarity is greater than a preset similarity threshold; and sending the related resource to terminal.
将待匹配字符序列与多个目标字符序列进行文本相似度计算,还可以确定与待匹配字符序列的文本相似度最高的目标字符序列。目标字符序列可关联文本、图片、链接、音频、视频等相关资源。举例来说,待匹配字符序列可为用户通过终端发送的用于咨询问题的字符序列。目标字符序列可为关联相应答案文本的字符序列。当确定了与待匹配字符序列文本相似度最高的目标字符序列之后,可将目标字符序列关联的相应答案文本的字符序列发送至终端。The text similarity calculation is performed on the character sequence to be matched and the plurality of target character sequences, and the target character sequence with the highest similarity to the text of the character sequence to be matched can also be determined. The target character sequence can be associated with text, images, links, audio, video, and other related resources. For example, the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions. The sequence of target characters can be a sequence of characters associated with the corresponding answer text. After determining the target character sequence having the highest similarity to the character sequence to be matched, the character sequence of the corresponding answer text associated with the target character sequence may be transmitted to the terminal.
在其中一个实施例中,如图4所示,提供了另一种文本相似度计算方法,该方法包括以下步骤:In one embodiment, as shown in FIG. 4, another text similarity calculation method is provided, the method comprising the following steps:
步骤402,获取待匹配字符序列和目标字符序列。Step 402: Acquire a sequence of characters to be matched and a sequence of target characters.
待匹配字符序列和目标字符序列可为字母、阿拉伯数字、汉字和标点符号等其中一种或多种的组合。The sequence of characters to be matched and the sequence of target characters may be a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation marks.
举例来说,待匹配字符序列可以为用户通过终端发送的用于咨询问题的字符序列。比 如待匹配字符序列可以是“请问3台电脑是多少钱?”。而目标字符序列可以为数据库中预存的问题模板的字符序列。比如目标字符序列可以是“3台计算机价格?”。当接收到终端发送的待匹配字符序列之后,可查找数据库中预设的目标字符序列。For example, the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions. For example, the sequence of characters to be matched can be "How much is the cost of 3 computers?". The target character sequence can be a sequence of characters of a question template pre-stored in the database. For example, the target character sequence can be "3 computer prices?". After receiving the sequence of characters to be matched sent by the terminal, the target character sequence preset in the database can be searched.
步骤404,将待匹配字符序列中包含的无关字符和目标字符序列中包含的无关字符删除。 Step 404, deleting the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the sequence of target characters.
无关字符包括但不限于标点符号及停用字符。针对过滤停用字符可预设停用字符库。中文停用字符包括但不限于语气词、连接词及转折词等,比如“啊”、“吧”、“哎”、“的”、“此外”、“但是”等。当检测到停用字符时,可将待匹配字符序列或目标字符序列中包含的停用字符删除。Unrelated characters include, but are not limited to, punctuation and deactivation characters. The deactivated character library can be preset for filtering deactivated characters. Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah", "bar", "哎", "of", "further", "but", and the like. When a stop character is detected, the character to be matched or the stop character contained in the target character sequence can be deleted.
举例来说,待匹配字符序列为“请问3台电脑是多少钱?”。包含标点符号“?”,和停用字符“是”,将该待匹配字符序列中包含的无关字符进行删除之后得到“3台电脑多少钱”。For example, the sequence of characters to be matched is "How much is the 3 computers?". The punctuation mark "?" is included, and the stop character "Yes" is included, and the irrelevant characters contained in the character sequence to be matched are deleted to obtain "how much is the 3 computers".
步骤406,对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。Step 406: Perform word segmentation on the to-be-matched character sequence and the target character sequence after deleting the irrelevant characters, to obtain a corresponding to-be-matched word sequence and a target word sequence.
分词是指将字符序列按照一定的规律转换成词序列的处理过程。待匹配词序列是指以待匹配词为单位按顺序形成的序列。目标词序列是指以目标词为单位按顺序形成的序列。Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The sequence of to-be-matched words refers to a sequence formed in order of the words to be matched. The target word sequence refers to a sequence formed in order in units of target words.
举例来说,对该删除无关字符之后的待匹配字符序列进行分词,可得到待匹配词序列“3|台|电脑|多少|钱”,其中“|”表示词分隔符,用于区分待匹配词序列中不同的词语。该待匹配词序列中包含“3”、“台”、“电脑”、“多少”、“钱”五个待匹配词。For example, if the character sequence to be matched after the deletion of the unrelated character is segmented, the sequence of the word to be matched may be obtained as “3|set|computer|how much|money”, where “|” represents a word separator, which is used to distinguish the to-be-matched. Different words in the word sequence. The sequence of to-be-matched words includes five to-be-matched words of “3”, “Taiwan”, “Computer”, “How much”, and “Money”.
步骤408,将待匹配词序列中包含的待匹配词和目标词序列中包含的目标词通过编辑距离公式进行计算,得到待匹配词序列与目标词序列之间的编辑距离。Step 408: Calculate the to-be-matched word and the target word included in the target word sequence included in the sequence of the word to be matched by the edit distance formula, and obtain an edit distance between the sequence of the word to be matched and the target word sequence.
编辑距离公式是指计算两个词序列之间以词语为单位由一个转成另一个所需的最少编辑操作次数的公式。最少编辑操作次数即为编辑距离。许可的编辑操作包括将一个词语替换成另一个词语,插入一个词语,删除一个词语。The edit distance formula is a formula that calculates the minimum number of edit operations required to convert one word to another from one word sequence. The minimum number of edit operations is the edit distance. A licensed editing operation involves replacing one word with another, inserting a word, and deleting a word.
举例来说,待匹配词序列为“3|台|电脑|多少|钱”,目标词序列为“3|台|计算机|价格”。将待匹配词序列转换为目标词序列需要进行3次操作,包括将“电脑”替换成“计算机”,将“多少”删除,将“钱”替换为“价格”。还可以预设同义词词库,由于“电脑”等同于“计算机”,因此可以不将“电脑”与“计算机”的替换过程不计入编辑距离。For example, the sequence of words to be matched is “3|set|computer|how much|money”, and the target word sequence is “3|set|computer|price”. Converting the sequence of words to be matched into the target word sequence requires three operations, including replacing "computer" with "computer", deleting "how much", and replacing "money" with "price." It is also possible to preset a synonym vocabulary. Since "computer" is equivalent to "computer", the replacement process of "computer" and "computer" may not be counted in the editing distance.
步骤410,获取待匹配词序列中包含的待匹配词的第一数量,和目标词序列中包含的目标词的第二数量。Step 410: Acquire a first quantity of the to-be-matched words included in the sequence of the to-be-matched words, and a second quantity of the target words included in the target word sequence.
步骤412,根据编辑距离、第一数量和第二数量进行计算,得到第一相似度。Step 412: Perform calculation according to the edit distance, the first quantity, and the second quantity to obtain a first similarity.
举例来说,长度为|S|的待匹配词序列S中包含|S|个待匹配词,长度为|T|的目标词序列T中包含|T|个目标词。待匹配词序列S与目标词序列T的编辑距离lev S,T(i,j)可通过公 式
Figure PCTCN2018099994-appb-000005
计算得到。i表示待匹配词序列S中第i个待匹配词,j表示目标词序列T中第j个目标词。当i和j中存在至少一个为0时,编辑距离lev S,T(i,j)取i和j中的最大值;否则,编辑距离lev S,T(i,j)取lev S,T(i,j-1)+1、lev S,T(i-1,j)+1、lev S,T(i-1,j-1)+1中的最小值。基于计算得到的编辑距离,可通过下列相似度计算公式
Figure PCTCN2018099994-appb-000006
计算得到第一相似度sim1 S,T(i,j)。Max(|S|,|T|)表示|S|、|T|中的最大值。且第一相似度sim1 S,T(i,j)取值为0至1。
For example, the to-be-matched word sequence S of length |S| contains |S| to be matched words, and the target word sequence T of length |T| contains |T| target words. The edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula
Figure PCTCN2018099994-appb-000005
Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T. When at least one of i and j is 0, the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value. Based on the calculated edit distance, the formula can be calculated by the following similarity
Figure PCTCN2018099994-appb-000006
The first similarity sim1 S, T (i, j) is calculated. Max(|S|, |T|) represents the maximum value in |S|, |T|. And the first similarity sim1 S, T (i, j) takes a value of 0 to 1.
步骤414,提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合。Step 414: Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
举例来说,待匹配词序列为“3|台|电脑|多少|钱”,可提取待匹配词序列中包含的所有待匹配词,形成待匹配词集合为{“3”、“台”、“电脑”、“多少”、“钱”}。“3”、“台”、“电脑”、“多少”及“钱”五个待匹配词是并列关系,不具有顺序。For example, the sequence of to-be-matched words is “3|set|computer|how much|money”, and all the to-be-matched words contained in the sequence of the to-be-matched words can be extracted, and the set of to-be-matched words is formed as {"3", "set", "Computer", "how much", "money"}. The five to-be-matched words “3”, “Taiwan”, “Computer”, “How much” and “Money” are juxtaposed and have no order.
步骤416,将待匹配词集合和目标词集合进行匹配,统计待匹配词与目标词的匹配数量。Step 416: Match the to-be-matched word set and the target word set, and count the matching number of the to-be-matched word and the target word.
可针对语义词典中词语的上下位层次关系可构建词语树,通过待匹配词与目标词在词语树中的路径距离,计算待匹配词与目标词之间的第二子相似度。将大于预设子相似度阈值的第二子相似度所对应的待匹配词与目标词判定为匹配,统计待匹配词集合和目标词集合中待匹配词与目标词的匹配数量。A word tree can be constructed for the upper and lower hierarchical relationship of words in the semantic dictionary, and the second sub-similarity between the to-be-matched word and the target word is calculated by the path distance between the to-be-matched word and the target word in the word tree. The to-be-matched word corresponding to the second sub-similarity greater than the preset sub-similarity threshold is determined to be matched with the target word, and the number of matches between the to-be-matched word and the target word in the to-be-matched word set and the target word set is counted.
步骤418,统计待匹配词集合的待匹配词数量和目标词集合的目标词数量。Step 418: Count the number of words to be matched and the number of target words of the target word set of the set of words to be matched.
步骤420,根据匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。Step 420: Calculate according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
举例来说,待匹配词集合为{“电脑”、“多少”、“钱”},目标词集合为{“计算机”、“价格”},可计算“电脑”与“计算机”的相似度sim 11、“电脑”与“价格”的相似度sim 12、“多少”与“计算机”的相似度sim 21、“多少”与“价格”的相似度sim 22、“钱”与“计算机”的相似度sim 31、“钱”与“价格”的相似度sim 32。取每个待匹配词与目标词集合的目标词计算得到的最大第二子相似度相乘,得到第二相似度sim2。比如说,与“电脑”、“多少”、“钱”所对应的最大第二子相似度分别为sim 11、sim 22、sim 32,则第二相似度sim2可通过下列相乘公式计算得到sim2=sim 11×sim 22×sim 32For example, the set of words to be matched is {"computer", "how much", "money"}, the target word set is {"computer", "price"}, and the similarity between "computer" and "computer" can be calculated. 11 , the similarity between "computer" and "price" sim 12 , "how much" and "computer" similarity sim 21 , "how much" and "price" similarity sim 22 , "money" and "computer" similar Degree sim 31 , the similarity between "money" and "price" sim 32 . Multiplying each of the to-be-matched words by the maximum second sub-similarity calculated by the target words of the target word set to obtain a second similarity sim2. For example, the maximum second sub-similarity corresponding to "computer", "how much", and "money" is sim 11 , sim 22 , and sim 32 respectively , and the second similarity sim2 can be calculated by the following multiplication formula to obtain sim2. =sim 11 ×sim 22 ×sim 32 .
步骤422,根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度。Step 422: Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
举例来说,计算得到第一相似度sim1和第二相似度sim2之后,可将第一相似度sim1和第二相似度sim2乘以相应的第一权值w1与第二权值w2,计算得到文本相似度sim(S,T)=sim1×w1+sim2×w2。For example, after the first similarity sim1 and the second similarity sim2 are calculated, the first similarity sim1 and the second similarity sim2 may be multiplied by the corresponding first weight w1 and second weight w2, and calculated. The text similarity sim(S,T)=sim1×w1+sim2×w2.
上述文本相似度计算方法中,在获取待匹配字符序列和目标字符序列之后,通过对匹配字符序列和目标字符序列进行预处理,得到以词语为单位按顺序形成的待匹配词序列和目标词序列,通过考虑词语顺序的第一相似度算法计算得到第一相似度,再根据待匹配词序列中包含的待匹配词和目标词序列中包含的目标词分别形成待匹配词集合与目标词集合,通过不考虑词语顺序的第二相似度算法计算得到第二相似度,然后综合第一相似度和第二相似度计算得到待匹配字符序列和目标字符序列之间的文本相似度。通过以词语为单位进行相似度计算,并综合两种相似度算法计算文本相似度,降低了以单字符通过单一相似度算法导致的误差,提高了文本相似度计算的准确性。In the above text similarity calculation method, after obtaining the character sequence to be matched and the target character sequence, by performing preprocessing on the matching character sequence and the target character sequence, the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained. The first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence. The second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence. The similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
应该理解的是,虽然图2和4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowcharts of FIGS. 2 and 4 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in Figures 2 and 4 may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.
在其中一个实施例中,如图5所示,提供了一种文本相似度计算装置500,该装置包括:字符序列获取模块502,用于获取待匹配字符序列和目标字符序列;词序列获取模块504,用于对待匹配字符序列和目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;第一相似度计算模块506,用于将待匹配词序列中包含的待匹配词和目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;词集合形成模块508,用于提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;第二相似度计算模块510,用于将待匹配词集合和目标词集合通过第二相似度算法进行计算,得到第二相似度;文本相似度计算模块512,用于根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度。In one embodiment, as shown in FIG. 5, a text similarity calculation device 500 is provided. The device includes: a character sequence acquisition module 502, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module. 504. Perform pre-processing on the sequence of the matched character and the target character, respectively, to obtain a sequence of the to-be-matched word and the target word. The first similarity calculation module 506 is configured to match the to-be-matched word included in the sequence of the word to be matched. And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; the word set forming module 508 is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all the target words to form a target. a second similarity calculation module 510, configured to calculate a to-be-matched word set and a target word set by using a second similarity algorithm to obtain a second similarity; a text similarity calculation module 512, configured to use the first similarity The degree and the second similarity are calculated to obtain the text similarity of the character sequence to be matched and the target character sequence.
在其中一个实施例中,词序列获取模块504还用于将待匹配字符序列中包含的无关字符和目标字符序列中包含的无关字符删除;对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。In one embodiment, the word sequence obtaining module 504 is further configured to delete the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the target character sequence; the sequence of characters to be matched and the sequence of target characters after deleting the unrelated characters The word segmentation is performed separately, and the corresponding sequence of the word to be matched and the sequence of the target word are obtained.
在其中一个实施例中,无关字符包括停用字符和相同字符;词序列获取模块504还用于将待匹配字符序列中包含的停用字符和目标字符序列中包含的停用字符删除;判断删除停用字符后的待匹配字符序列和目标字符序列中是否存在相同字符;相同字符是指在删除停用字符后的待匹配字符序列和目标字符序列中,处于相同位置的相同字符;若是,则将删除停用字符后的待匹配字符序列和目标字符序列中包含的相同字符删除,得到相应的待 匹配词序列和目标词序列。In one embodiment, the unrelated character includes a deactivated character and the same character; the word sequence obtaining module 504 is further configured to delete the deactivated character included in the character sequence to be matched and the deactivated character included in the target character sequence; Whether the same character exists in the character sequence to be matched and the target character sequence after the character is deactivated; the same character refers to the same character in the same position in the character sequence to be matched and the target character sequence after the deactivated character is deleted; if so, The character sequence to be matched and the same character included in the target character sequence after deleting the deactivated character are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
在其中一个实施例中,第一相似度计算模块506还用于将待匹配词序列中包含的待匹配词和目标词序列中包含的目标词通过编辑距离公式进行计算,得到待匹配词序列与目标词序列之间的编辑距离;获取待匹配词序列中包含的待匹配词的第一数量,和目标词序列中包含的目标词的第二数量;根据编辑距离、第一数量和第二数量进行计算,得到第一相似度。In one embodiment, the first similarity calculation module 506 is further configured to calculate, by using an edit distance formula, the to-be-matched words and the target words included in the target word sequence included in the sequence of to-be-matched words, to obtain a sequence of to-be-matched words and An edit distance between the target word sequences; a first number of words to be matched included in the sequence of words to be matched, and a second quantity of target words included in the target word sequence; according to the edit distance, the first quantity, and the second quantity The calculation is performed to obtain the first similarity.
在其中一个实施例中,第二相似度计算模块510还用于将待匹配词集合和目标词集合进行匹配,统计待匹配词与目标词的匹配数量;统计待匹配词集合的待匹配词数量和目标词集合的目标词数量;根据匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。In one embodiment, the second similarity calculation module 510 is further configured to match the to-be-matched word set and the target word set, and count the number of matches between the to-be-matched word and the target word; and count the number of to-be-matched words in the to-be-matched word set. And the number of target words in the target word set; the second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
在其中一个实施例中,该装置还包括第三相似度计算模块514,用于获取待匹配字符序列对应的待匹配拼音序列和目标字符序列对应的目标拼音序列;将待匹配拼音序列中包含的待匹配拼音和目标拼音序列中包含的目标拼音通过第一相似度算法进行计算,得到第三相似度;根据第一相似度和第二相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度,包括:根据第一相似度、第二相似度和第三相似度进行计算,得到待匹配字符序列和目标字符序列的文本相似度。In one embodiment, the apparatus further includes a third similarity calculation module 514, configured to acquire a pinyin sequence to be matched and a target pinyin sequence corresponding to the target character sequence corresponding to the character sequence to be matched; and to include the pinyin sequence to be matched The target pinyin included in the matched pinyin and the target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated according to the first similarity and the second similarity, and the character sequence to be matched and the target character sequence are obtained. The text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining a text similarity of the character sequence to be matched and the target character sequence.
在其中一个实施例中,字符序列获取模块502,还用于接收终端发送的待匹配字符序列;根据待匹配字符序列从数据库中获取多个目标字符序列;该装置还包括相关资源发送模块,用于查询文本相似度大于预设相似度阈值的目标字符序列所对应的相关资源;将相关资源发送至终端。In one embodiment, the character sequence obtaining module 502 is further configured to receive a sequence of characters to be matched sent by the terminal, and obtain a plurality of target character sequences from the database according to the sequence of characters to be matched; the device further includes a related resource sending module, The related resources corresponding to the target character sequence whose query text similarity is greater than the preset similarity threshold are sent; the related resources are sent to the terminal.
关于文本相似度计算装置的具体限定可以参见上文中对于文本相似度计算方法的限定,在此不再赘述。上述文本相似度计算装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the text similarity calculation device, reference may be made to the definition of the text similarity calculation method in the above, and details are not described herein again. Each of the above-described text similarity computing devices may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性计算机可读存储介质、内存储器。该非易失性计算机可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性计算机可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储目标字符序列。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种文本相似度计算方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in FIG. 6. The computer device includes a processor, memory, network interface, and database connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-transitory computer readable storage medium, an internal memory. The non-transitory computer readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium. The database of the computer device is used to store a sequence of target characters. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by the processor to implement a text similarity calculation method.
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可 以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。It will be understood by those skilled in the art that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的文本相似度计算方法的步骤。In one embodiment, a computer apparatus is provided comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, implement any of the embodiments of the present application The steps of the text similarity calculation method provided.
在其中一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的文本相似度计算方法的步骤。In one of the embodiments, there is provided one or more non-transitory computer readable storage mediums storing computer readable instructions that, when executed by one or more processors, cause one or more processes The steps of the text similarity calculation method provided in any one of the embodiments of the present application are implemented.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database, or other medium used in the various embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (20)

  1. 一种文本相似度计算方法,包括:A method for calculating text similarity, comprising:
    获取待匹配字符序列和目标字符序列;Obtaining a sequence of characters to be matched and a sequence of target characters;
    对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;
    提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;
    将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;
    根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列,包括:The method according to claim 1, wherein the pre-processing of the to-be-matched character sequence and the target character sequence respectively obtains a corresponding to-be-matched word sequence and a target word sequence, including:
    将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除;及Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and
    对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  3. 根据权利要求2所述的方法,其特征在于,所述无关字符包括停用字符和相同字符;所述将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除,包括:The method according to claim 2, wherein said unrelated character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched and an unrelated character contained in said target character sequence Delete, including:
    将所述待匹配字符序列中包含的停用字符和所述目标字符序列中包含的停用字符删除;及Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and
    判断删除停用字符后的待匹配字符序列和目标字符序列中是否存在相同字符;所述相同字符是指在所述删除停用字符后的待匹配字符序列和目标字符序列中,处于相同位置的相同字符;Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;
    若是,则将所述删除停用字符后的待匹配字符序列和目标字符序列中包含的所述相同字符删除,得到相应的待匹配词序列和目标词序列。If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
  4. 根据权利要求1所述的方法,其特征在于,所述将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度,包括:The method according to claim 1, wherein said calculating a word to be matched included in said sequence of words to be matched and a target word included in said target word sequence are calculated by a first similarity algorithm to obtain a first A similarity, including:
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过编辑距离公式进行计算,得到所述待匹配词序列与所述目标词序列之间的编辑距离;And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;
    获取所述待匹配词序列中包含的待匹配词的第一数量,和所述目标词序列中包含的目标词的第二数量;及Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and
    根据所述编辑距离、第一数量和第二数量进行计算,得到第一相似度。Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
  5. 根据权利要求1所述的方法,其特征在于,所述将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度,包括:The method according to claim 1, wherein the calculating the set of to-be-matched words and the set of target words by using a second similarity algorithm to obtain a second similarity comprises:
    将所述待匹配词集合和所述目标词集合进行匹配,统计所述待匹配词与所述目标词的匹配数量;Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;
    统计所述待匹配词集合的待匹配词数量和所述目标词集合的目标词数量;及Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and
    根据所述匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
  6. 根据权利要求1至5任意一项所述的方法,其特征在于,在所述获取待匹配字符序列和目标字符序列之后,还包括:The method according to any one of claims 1 to 5, further comprising: after the obtaining the sequence of characters to be matched and the sequence of target characters, further comprising:
    获取所述待匹配字符序列对应的待匹配拼音序列和目标字符序列对应的目标拼音序列;Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;
    将所述待匹配拼音序列中包含的待匹配拼音和所述目标拼音序列中包含的目标拼音通过第一相似度算法进行计算,得到第三相似度;And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;
    所述根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度,包括:及Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:
    根据所述第一相似度、所述第二相似度和第三相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
  7. 根据权利要求1至5任意一项所述的方法,其特征在于,所述获取待匹配字符序列和目标字符序列,包括:The method according to any one of claims 1 to 5, wherein the obtaining the sequence of characters to be matched and the sequence of target characters comprises:
    接收终端发送的待匹配字符序列;Receiving a sequence of characters to be matched sent by the terminal;
    根据所述待匹配字符序列从数据库中获取多个目标字符序列;Obtaining a plurality of target character sequences from the database according to the sequence of characters to be matched;
    在所述根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度之后,还包括:After the calculating, according to the first similarity and the second similarity, the text similarity between the to-be-matched character sequence and the target character sequence, the method further includes:
    查询文本相似度大于预设相似度阈值的目标字符序列所对应的相关资源;及Querying related resources corresponding to the target character sequence whose text similarity is greater than the preset similarity threshold; and
    将所述相关资源发送至所述终端。Transmitting the related resource to the terminal.
  8. 一种文本相似度计算装置,包括:A text similarity calculation device includes:
    字符序列获取模块,用于获取待匹配字符序列和目标字符序列;a character sequence obtaining module, configured to obtain a sequence of characters to be matched and a sequence of target characters;
    词序列获取模块,用于对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;a word sequence obtaining module, configured to preprocess the to-be-matched character sequence and the target character sequence respectively, to obtain a corresponding candidate word sequence and a target word sequence;
    第一相似度计算模块,用于将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;a first similarity calculation module, configured to calculate, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;
    词集合形成模块,用于提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;a word set forming module, configured to extract all the to-be-matched words to form a set of to-be-matched words, and extract all target words to form a target word set;
    第二相似度计算模块,用于将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及a second similarity calculation module, configured to calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;
    文本相似度计算模块,用于根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。And a text similarity calculation module, configured to calculate, according to the first similarity and the second similarity, a text similarity between the to-be-matched character sequence and the target character sequence.
  9. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors to cause the one or more The processors perform the following steps:
    获取待匹配字符序列和目标字符序列;Obtaining a sequence of characters to be matched and a sequence of target characters;
    对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;
    提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;
    将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;
    根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
  10. 根据权利要求9所述的计算机设备,其特征在于,所述对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列的步骤,包括执行以下步骤:The computer device according to claim 9, wherein the step of separately preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words, including performing The following steps:
    将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除;及Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and
    对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  11. 根据权利要求10所述的计算机设备,其特征在于,所述无关字符包括停用字符和相同字符;所述将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除的步骤,包括执行以下步骤:The computer device according to claim 10, wherein said irrelevant character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched is irrelevant to said target character sequence The steps to delete characters include performing the following steps:
    将所述待匹配字符序列中包含的停用字符和所述目标字符序列中包含的停用字符删除;及Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and
    判断删除停用字符后的待匹配字符序列和目标字符序列中是否存在相同字符;所述相同字符是指在所述删除停用字符后的待匹配字符序列和目标字符序列中,处于相同位置的相同字符;Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;
    若是,则将所述删除停用字符后的待匹配字符序列和目标字符序列中包含的所述相同字符删除,得到相应的待匹配词序列和目标词序列。If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
  12. 根据权利要求9所述的计算机设备,其特征在于,所述将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度的步骤,包括执行以下步骤:The computer device according to claim 9, wherein the calculating the word to be matched included in the sequence of words to be matched and the target word included in the sequence of target words are calculated by a first similarity algorithm to obtain The first similarity step includes performing the following steps:
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过编辑距离公式进行计算,得到所述待匹配词序列与所述目标词序列之间的编辑距离;And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;
    获取所述待匹配词序列中包含的待匹配词的第一数量,和所述目标词序列中包含的目标词的第二数量;及Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and
    根据所述编辑距离、第一数量和第二数量进行计算,得到第一相似度。Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
  13. 根据权利要求9所述的计算机设备,其特征在于,所述将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度的步骤,包括执行以下步骤:The computer device according to claim 9, wherein the step of calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity comprises performing the following steps :
    将所述待匹配词集合和所述目标词集合进行匹配,统计所述待匹配词与所述目标词的匹配数量;Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;
    统计所述待匹配词集合的待匹配词数量和所述目标词集合的目标词数量;及Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and
    根据所述匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
  14. 根据权利要求9至13任意一项所述的计算机设备,其特征在于,在所述获取待匹配字符序列和目标字符序列的步骤之后,还包括执行以下步骤:The computer device according to any one of claims 9 to 13, wherein after the step of acquiring the sequence of characters to be matched and the sequence of target characters, the method further comprises the step of:
    获取所述待匹配字符序列对应的待匹配拼音序列和目标字符序列对应的目标拼音序列;Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;
    将所述待匹配拼音序列中包含的待匹配拼音和所述目标拼音序列中包含的目标拼音通过第一相似度算法进行计算,得到第三相似度;And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;
    所述根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度,包括:及Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:
    根据所述第一相似度、所述第二相似度和第三相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
  15. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取待匹配字符序列和目标字符序列;Obtaining a sequence of characters to be matched and a sequence of target characters;
    对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列;Performing pre-processing on the to-be-matched character sequence and the target character sequence respectively to obtain a corresponding to-be-matched word sequence and a target word sequence;
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度;And calculating, by using a first similarity algorithm, a to-be-matched word included in the to-be-matched word sequence and a target word included in the target word sequence to obtain a first similarity;
    提取所有待匹配词形成待匹配词集合,并提取所有目标词形成目标词集合;Extracting all the words to be matched to form a set of words to be matched, and extracting all the target words to form a target word set;
    将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度;及And calculating the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity;
    根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity and the second similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
  16. 根据权利要求15所述的存储介质,其特征在于,所述对所述待匹配字符序列和所述目标字符序列分别进行预处理,得到相应的待匹配词序列和目标词序列的步骤,包括 执行以下步骤:The storage medium according to claim 15, wherein the step of preprocessing the sequence of the character to be matched and the sequence of the target character to obtain a corresponding sequence of the word to be matched and the sequence of the target word includes performing The following steps:
    将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除;及Deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters; and
    对删除无关字符后的待匹配字符序列和目标字符序列分别进行分词,得到相应的待匹配词序列和目标词序列。The word sequence to be matched and the target character sequence after deleting the unrelated characters are respectively segmented, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  17. 根据权利要求16所述的存储介质,其特征在于,所述无关字符包括停用字符和相同字符;所述将所述待匹配字符序列中包含的无关字符和所述目标字符序列中包含的无关字符删除的步骤,包括执行以下步骤:The storage medium according to claim 16, wherein said irrelevant character comprises a deactivated character and the same character; said unrelated character contained in said sequence of characters to be matched is irrelevant to said target character sequence The steps to delete characters include performing the following steps:
    将所述待匹配字符序列中包含的停用字符和所述目标字符序列中包含的停用字符删除;及Deleting the deactivated characters included in the sequence of characters to be matched and the deactivated characters included in the sequence of target characters; and
    判断删除停用字符后的待匹配字符序列和目标字符序列中是否存在相同字符;所述相同字符是指在所述删除停用字符后的待匹配字符序列和目标字符序列中,处于相同位置的相同字符;Determining whether the same character exists in the sequence of characters to be matched and the sequence of target characters after deleting the deactivated character; the same character is in the same position in the sequence of characters to be matched and the sequence of target characters after the deletion of the deactivated character The same character;
    若是,则将所述删除停用字符后的待匹配字符序列和目标字符序列中包含的所述相同字符删除,得到相应的待匹配词序列和目标词序列。If yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, to obtain a corresponding sequence of the to-be-matched word and the target word sequence.
  18. 根据权利要求15所述的存储介质,其特征在于,所述将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过第一相似度算法进行计算,得到第一相似度的步骤,包括执行以下步骤:The storage medium according to claim 15, wherein the calculating the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence is calculated by a first similarity algorithm to obtain The first similarity step includes performing the following steps:
    将所述待匹配词序列中包含的待匹配词和所述目标词序列中包含的目标词通过编辑距离公式进行计算,得到所述待匹配词序列与所述目标词序列之间的编辑距离;And calculating, by the edit distance formula, the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence, to obtain an edit distance between the to-be-matched word sequence and the target word sequence;
    获取所述待匹配词序列中包含的待匹配词的第一数量,和所述目标词序列中包含的目标词的第二数量;及Obtaining a first number of words to be matched included in the sequence of to-be-matched words, and a second quantity of target words included in the target word sequence; and
    根据所述编辑距离、第一数量和第二数量进行计算,得到第一相似度。Calculating according to the edit distance, the first quantity, and the second quantity, to obtain a first similarity.
  19. 根据权利要求15所述的存储介质,其特征在于,所述将所述待匹配词集合和所述目标词集合通过第二相似度算法进行计算,得到第二相似度的步骤,包括执行以下步骤:The storage medium according to claim 15, wherein the step of calculating the to-be-matched word set and the target word set by a second similarity algorithm to obtain a second similarity comprises performing the following steps :
    将所述待匹配词集合和所述目标词集合进行匹配,统计所述待匹配词与所述目标词的匹配数量;Matching the to-be-matched word set and the target word set, and counting the number of matches between the to-be-matched word and the target word;
    统计所述待匹配词集合的待匹配词数量和所述目标词集合的目标词数量;及Counting the number of to-be-matched words of the set of words to be matched and the number of target words of the target word set; and
    根据所述匹配数量、待匹配词数量和目标词数量进行计算,得到第二相似度。The second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
  20. 根据权利要求15至19任意一项所述的存储介质,其特征在于,在所述获取待匹配字符序列和目标字符序列的步骤之后,还包括执行以下步骤:The storage medium according to any one of claims 15 to 19, further comprising, after the step of acquiring the sequence of characters to be matched and the sequence of target characters, performing the following steps:
    获取所述待匹配字符序列对应的待匹配拼音序列和目标字符序列对应的目标拼音序列;Obtaining a pinyin sequence to be matched corresponding to the character sequence to be matched and a target pinyin sequence corresponding to the target character sequence;
    将所述待匹配拼音序列中包含的待匹配拼音和所述目标拼音序列中包含的目标拼音通过第一相似度算法进行计算,得到第三相似度;And the target pinyin included in the target pinyin sequence and the target pinyin included in the target pinyin sequence are calculated by using a first similarity algorithm to obtain a third similarity;
    所述根据所述第一相似度和所述第二相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度,包括:及Performing calculation according to the first similarity and the second similarity to obtain text similarity between the sequence of characters to be matched and the sequence of target characters, including:
    根据所述第一相似度、所述第二相似度和第三相似度进行计算,得到所述待匹配字符序列和所述目标字符序列的文本相似度。Calculating according to the first similarity, the second similarity, and the third similarity, obtaining a text similarity of the to-be-matched character sequence and the target character sequence.
PCT/CN2018/099994 2018-01-12 2018-08-10 Text similarity calculation method and device, computer apparatus, and storage medium WO2019136993A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810031770.0 2018-01-12
CN201810031770.0A CN108304378B (en) 2018-01-12 2018-01-12 Text similarity computing method, apparatus, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2019136993A1 true WO2019136993A1 (en) 2019-07-18

Family

ID=62868820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099994 WO2019136993A1 (en) 2018-01-12 2018-08-10 Text similarity calculation method and device, computer apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN108304378B (en)
WO (1) WO2019136993A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738202A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Character recognition method, device and computer readable storage medium
CN110765767A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Extraction method, device, server and storage medium of local optimization keywords
CN111274366A (en) * 2020-03-25 2020-06-12 联想(北京)有限公司 Search and recommend methods and devices, equipment, and storage media
CN111767706A (en) * 2020-06-19 2020-10-13 北京工业大学 Method, device, electronic device and medium for calculating text similarity
CN111797082A (en) * 2020-05-29 2020-10-20 深圳壹账通智能科技有限公司 Data deduplication method, device and computer equipment based on field analysis
CN112149414A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium
CN112287657A (en) * 2020-11-19 2021-01-29 每日互动股份有限公司 Information matching system based on text similarity
CN113011178A (en) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 Text generation method, text generation device, electronic device and storage medium
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113268972A (en) * 2021-05-14 2021-08-17 东莞理工学院城市学院 Intelligent calculation method, system, equipment and medium for appearance similarity of two English words
CN113779183A (en) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN113821587A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Text relevance determination method, model training method, device and storage medium
CN114564936A (en) * 2022-03-15 2022-05-31 北京梆梆安全科技有限公司 Method and device for detecting infringement of text, electronic equipment and storage medium
CN114637812A (en) * 2020-12-15 2022-06-17 顺丰恒通支付有限公司 Logistics information-based logistics subject matching method and device and computer equipment
CN116136839A (en) * 2023-04-17 2023-05-19 湖南正宇软件技术开发有限公司 Method, system and related equipment for generating legal document face manuscript
CN116881437A (en) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 Data processing system for acquiring text set

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304378B (en) * 2018-01-12 2019-09-24 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium
CN109189907A (en) * 2018-08-22 2019-01-11 山东浪潮通软信息科技有限公司 A kind of search method and device based on semantic matches
CN110083834B (en) * 2019-04-24 2023-05-09 北京百度网讯科技有限公司 Semantic matching model training method and device, electronic equipment and storage medium
CN110287286B (en) * 2019-06-13 2022-03-08 北京百度网讯科技有限公司 Method and device for determining similarity of short texts and storage medium
CN110633356B (en) * 2019-09-04 2022-05-20 广州市巴图鲁信息科技有限公司 Word similarity calculation method and device and storage medium
CN110717158B (en) * 2019-09-06 2024-03-01 冉维印 Information verification method, device, equipment and computer readable storage medium
CN112825090B (en) * 2019-11-21 2024-01-05 腾讯科技(深圳)有限公司 Method, device, equipment and medium for determining interest points
CN111159339A (en) * 2019-12-24 2020-05-15 北京亚信数据有限公司 Text matching processing method and device
CN111382563B (en) * 2020-03-20 2023-09-08 腾讯科技(深圳)有限公司 Text relevance determining method and device
CN111898376B (en) * 2020-07-01 2024-04-26 拉扎斯网络科技(上海)有限公司 Name data processing method and device, storage medium and computer equipment
CN112765962B (en) * 2021-01-15 2022-08-30 上海微盟企业发展有限公司 Text error correction method, device and medium
CN113032519A (en) * 2021-01-22 2021-06-25 中国平安人寿保险股份有限公司 Sentence similarity judgment method and device, computer equipment and storage medium
CN113627722B (en) * 2021-07-02 2024-04-02 湖北美和易思教育科技有限公司 Simple answer scoring method based on keyword segmentation, terminal and readable storage medium
CN113420234B (en) * 2021-07-02 2022-08-02 青海师范大学 Microblog data acquisition method and system
CN113569036A (en) * 2021-07-20 2021-10-29 上海明略人工智能(集团)有限公司 Recommendation method and device for media information and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175545A1 (en) * 2008-01-04 2009-07-09 Xerox Corporation Method for computing similarity between text spans using factored word sequence kernels
CN103176962A (en) * 2013-03-08 2013-06-26 深圳先进技术研究院 Statistical method and statistical system of text similarity
CN107491425A (en) * 2017-07-26 2017-12-19 合肥美的智能科技有限公司 Determine method, determining device, computer installation and computer-readable recording medium
CN108304378A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123618B (en) * 2011-11-21 2016-09-14 北京新媒传信科技有限公司 Text similarity acquisition methods and device
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 A Calculation Method of Text Similarity
US9535899B2 (en) * 2013-02-20 2017-01-03 International Business Machines Corporation Automatic semantic rating and abstraction of literature
CN104216968A (en) * 2014-08-25 2014-12-17 华中科技大学 Rearrangement method and system based on document similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175545A1 (en) * 2008-01-04 2009-07-09 Xerox Corporation Method for computing similarity between text spans using factored word sequence kernels
CN103176962A (en) * 2013-03-08 2013-06-26 深圳先进技术研究院 Statistical method and statistical system of text similarity
CN107491425A (en) * 2017-07-26 2017-12-19 合肥美的智能科技有限公司 Determine method, determining device, computer installation and computer-readable recording medium
CN108304378A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738202A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Character recognition method, device and computer readable storage medium
CN110765767A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Extraction method, device, server and storage medium of local optimization keywords
CN110765767B (en) * 2019-09-19 2024-01-19 平安科技(深圳)有限公司 Extraction method, device, server and storage medium of local optimization keywords
CN111274366A (en) * 2020-03-25 2020-06-12 联想(北京)有限公司 Search and recommend methods and devices, equipment, and storage media
CN111274366B (en) * 2020-03-25 2024-12-20 联想(北京)有限公司 Search recommendation method, device, equipment, and storage medium
CN111797082A (en) * 2020-05-29 2020-10-20 深圳壹账通智能科技有限公司 Data deduplication method, device and computer equipment based on field analysis
CN113779183A (en) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN113779183B (en) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN111767706A (en) * 2020-06-19 2020-10-13 北京工业大学 Method, device, electronic device and medium for calculating text similarity
CN112149414A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium
CN112149414B (en) * 2020-09-23 2023-06-23 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium
CN112287657B (en) * 2020-11-19 2024-01-30 每日互动股份有限公司 Information matching system based on text similarity
CN112287657A (en) * 2020-11-19 2021-01-29 每日互动股份有限公司 Information matching system based on text similarity
CN114637812A (en) * 2020-12-15 2022-06-17 顺丰恒通支付有限公司 Logistics information-based logistics subject matching method and device and computer equipment
CN113011178B (en) * 2021-03-29 2023-05-16 广州博冠信息科技有限公司 Text generation method, text generation device, electronic device and storage medium
CN113011178A (en) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 Text generation method, text generation device, electronic device and storage medium
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113268972A (en) * 2021-05-14 2021-08-17 东莞理工学院城市学院 Intelligent calculation method, system, equipment and medium for appearance similarity of two English words
CN113821587A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Text relevance determination method, model training method, device and storage medium
CN113821587B (en) * 2021-06-02 2024-05-17 腾讯科技(深圳)有限公司 Text relevance determining method, model training method, device and storage medium
CN114564936A (en) * 2022-03-15 2022-05-31 北京梆梆安全科技有限公司 Method and device for detecting infringement of text, electronic equipment and storage medium
CN116136839A (en) * 2023-04-17 2023-05-19 湖南正宇软件技术开发有限公司 Method, system and related equipment for generating legal document face manuscript
CN116881437B (en) * 2023-09-08 2023-12-01 北京睿企信息科技有限公司 Data processing system for acquiring text set
CN116881437A (en) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 Data processing system for acquiring text set

Also Published As

Publication number Publication date
CN108304378A (en) 2018-07-20
CN108304378B (en) 2019-09-24

Similar Documents

Publication Publication Date Title
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
WO2022142027A1 (en) Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US11816138B2 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
US11157816B2 (en) Systems and methods for selecting and generating log parsers using neural networks
WO2019153551A1 (en) Article classification method and apparatus, computer device and storage medium
WO2020057022A1 (en) Associative recommendation method and apparatus, computer device, and storage medium
US20200184307A1 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
JP2020123318A (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program for determining text relevance
CN108595695A (en) Data processing method, device, computer equipment and storage medium
WO2021175005A1 (en) Vector-based document retrieval method and apparatus, computer device, and storage medium
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN111737997A (en) A text similarity determination method, device and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
CN110134965B (en) Method, apparatus, device and computer readable storage medium for information processing
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
WO2022141872A1 (en) Document abstract generation method and apparatus, computer device, and storage medium
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
EP3640861A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
WO2022213864A1 (en) Corpus annotation method and apparatus, and related device
CN111625579B (en) Information processing method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899704

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899704

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载