+

CN103207921A - Method for automatically extracting terms from Chinese electronic document - Google Patents

Method for automatically extracting terms from Chinese electronic document Download PDF

Info

Publication number
CN103207921A
CN103207921A CN2013101564948A CN201310156494A CN103207921A CN 103207921 A CN103207921 A CN 103207921A CN 2013101564948 A CN2013101564948 A CN 2013101564948A CN 201310156494 A CN201310156494 A CN 201310156494A CN 103207921 A CN103207921 A CN 103207921A
Authority
CN
China
Prior art keywords
word
words
atom
atomic
automatically
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101564948A
Other languages
Chinese (zh)
Inventor
于娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN2013101564948A priority Critical patent/CN103207921A/en
Publication of CN103207921A publication Critical patent/CN103207921A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a method for automatically extracting terms from a Chinese electronic document. The method is characterized by comprising the following steps of: step S01: processing the electronic document into a group of word strings consisting of atomic words with a special property; step S02: counting the frequency of the atomic word strings and substrings, adopting the atomic word string with the appearance times being more than N times as a candidate term, wherein N is a settable parameter; and step S03: deleting the term which only appears as a substring in a candidate term set to obtain a term set appearing in the document, and realizing the purpose for automatically extracting the terms in the Chinese electronic document. The method has the effects and benefits that the real problem and difficulty that the performance for automatically extracting the term is not high and the automation degree is limited can be solved. The high-efficient automatic method for extracting the terms is a foundation for automatically processing a text and can powerfully guarantee the information search, text summarization, content management and the like. The good term extracting method can promote the automation degree and the performance of the work.

Description

一种从中文电子文档中自动提取词语的方法A Method of Automatically Extracting Words from Chinese Electronic Documents

技术领域technical field

本发明属于自然语言处理领域,涉及到从中文电子文档中自动提取词语集合的方法。The invention belongs to the field of natural language processing and relates to a method for automatically extracting word sets from Chinese electronic documents.

背景技术Background technique

近年来,随着科研、经济和Internet等领域的迅速发展,电子文档的数量加速增长,如何快速和有效地处理这些海量电子文档已成为信息检索、知识管理、Web服务等领域的关键任务之一。由此,文本检索、分类、自动摘要等电子文档自动处理技术成为了相关领域的研究热点。在这些技术中,自动提取电子文档中的所有词语(简称“提词”)是一项基础工作。本发明的提词方法针对的是中文电子文档的自动处理,如无特殊说明,后文的“文档”均指“中文电子文档”,“词语”均指“中文词语”。In recent years, with the rapid development of scientific research, economy, and the Internet, the number of electronic documents has increased rapidly. How to quickly and effectively process these massive electronic documents has become one of the key tasks in the fields of information retrieval, knowledge management, and Web services. . As a result, automatic electronic document processing technologies such as text retrieval, classification, and automatic summarization have become research hotspots in related fields. Among these technologies, automatic extraction of all words in electronic documents (referred to as "prompt") is a basic work. The word prompting method of the present invention is aimed at the automatic processing of Chinese electronic documents. If there is no special instruction, the following "documents" all refer to "Chinese electronic documents", and "words" all refer to "Chinese words".

文档中的词语(term或者word)依据是否遵循意义组合原理(thePrinciple of Compositionality,一个复杂表达式的意义由其各个组成部分的意义及其组合结构决定。)分为两种:原子词和合成词(也称复合词)。原子词(atomic word,aw)是语言中用于组合形成其它新词的短词,不遵循意义组合原理,如,“系统”、“知识”等。合成词(compoundword,cw)是由多个原子词组成的面向内容的长词,这些词的构成一般遵循意义组合原理。如,“系统工程”、“知识管理”等。The words (term or word) in the document are divided into two types: atomic words and compound words (also known as compound words). Atomic words (aw) are short words in language that are used to combine to form other new words, and do not follow the principle of meaning combination, such as "system", "knowledge", etc. Compound words (compound words, cw) are content-oriented long words composed of multiple atomic words, and the formation of these words generally follows the principle of meaning combination. For example, "systems engineering", "knowledge management" and so on.

原子词的自动提取可以基于原子词词典轻松完成。由于原子词比较稳定,较少出现新词,所以,基于汉语主题词表或者中国分类主题词表等词典就可提取得到,并且准确率与召回率均令人满意。The automatic extraction of atomic words can be easily done based on the atomic word dictionary. Since atomic words are relatively stable and new words rarely appear, they can be extracted based on dictionaries such as Chinese Thesaurus or Chinese Classified Thesaurus, and the accuracy and recall rates are satisfactory.

合成词的自动提取方法主要有两类:一种是基于统计的方法,如基于串频与串长的提词方法等。一种是基于词性分析的方法,如依据词性的组词规则提取合成词的方法等。这两种方法各有其优缺点。There are two main types of automatic extraction methods for compound words: one is a method based on statistics, such as a word prompt method based on string frequency and string length. One is a method based on part-of-speech analysis, such as a method of extracting compound words according to the rules of part-of-speech word formation. Both approaches have their pros and cons.

基于统计方法提取合成词的基本思想为:相邻汉字共现的频率越高,越有可能是一个独立的词语。因此,该方法的一般过程为:(1)依据某一算法切分电子文档,获取其中的每一个子串;(2)统计每一子串的出现频率或者其左右子串单独出现的概率等判断指标;(3)依据这些指标是否达到阈值来判定该子串是否独立成词。这种方法的优点在于:不基于词典,因此不受词典限制,一般召回率较高,能够提取得到新出现的词语。缺点在于:(1)统计方法一般仅适用于自动提取大语料中的词语;(2)不能同时保证准确率和召回率,为追求高准确率所设定的阈值必然会导致较低的召回率;(3)在切分文档获取子串时不考虑语法和词法,从而导致最终将一部分“不成词”的子串也错误地列入提词结果,如,“系统工”、“识管理”等。The basic idea of extracting compound words based on statistical methods is: the higher the co-occurrence frequency of adjacent Chinese characters, the more likely it is an independent word. Therefore, the general process of this method is: (1) segment the electronic document according to a certain algorithm, and obtain each substring in it; (2) count the occurrence frequency of each substring or the probability of its left and right substrings appearing alone, etc. Judgment indicators; (3) According to whether these indicators reach the threshold to determine whether the substring is independent into a word. The advantage of this method is that it is not based on the dictionary, so it is not limited by the dictionary, the general recall rate is high, and new words can be extracted. The disadvantages are: (1) statistical methods are generally only suitable for automatic extraction of words in large corpora; (2) accuracy and recall cannot be guaranteed at the same time, and the threshold set for high accuracy will inevitably lead to lower recall ;(3) Grammar and lexical grammar are not considered when segmenting documents to obtain substrings, resulting in the final part of the substrings that are "non-words" are also mistakenly included in the prompt results, such as "system engineering", "knowledge management" wait.

基于词性分析的方法一般基于原子词词典对语料进行原子词切分,然后依据规则取原子词的组合(如,多元名词)作为词语。张新等人在文章《基于规则与统计的本体概念自动获取方法研究》中提出了一种依据词性判断汉字串是否独立成词的自动提取术语的方法。基于词性分析的提词方法优点是准确率高;缺点是:召回率极低,受限于规则集合的准确性和完备性。The method based on part-of-speech analysis generally performs atomic word segmentation on the corpus based on the atomic word dictionary, and then takes the combination of atomic words (such as multiple nouns) as words according to the rules. Zhang Xin et al. proposed a method of automatically extracting terms based on part of speech to judge whether a Chinese character string is an independent word in the article "Research on Automatic Acquisition of Ontology Concepts Based on Rules and Statistics". The advantage of the word prompting method based on part-of-speech analysis is high accuracy; the disadvantage is that the recall rate is extremely low, which is limited by the accuracy and completeness of the rule set.

为克服上述合成词提取方法的缺陷以提高自动提词的性能,于娟等人在文章《结合词性分析与串频统计的词语提取方法》中提出了一种结合原子词词性分析和原子词串频统计的提词方法。该方法的基本思想是:特定词性的原子词参与组词的概率较高,且共现频率较高的原子词串“成词”的可能性较高。基于这个思想,该方法首先将电子文档处理为一组由特定词性的原子词所组成的词串,然后统计这些词串及其子串的频率,最后得到“成词”的词语,达成自动提取文档中词语的目的。但是,这种方法仍然存在缺陷。尽管该方法的召回率令人满意,但是结果集合中存在大量“半截词”,如“管理信息”(存在于‘管理信息系统’)、“全球性”(存在于‘全球性企业’)等词,影响了方法的准确率。In order to overcome the defects of the above compound word extraction method and improve the performance of automatic word prompting, Yu Juan et al. proposed a method combining atomic word part-of-speech analysis and atomic word string in the article "word extraction method combined with part-of-speech analysis and string frequency statistics". The word prompt method of frequency statistics. The basic idea of this method is: the atomic word of a specific part of speech has a higher probability of participating in word formation, and the atomic word string with a higher co-occurrence frequency has a higher probability of "forming a word". Based on this idea, the method first processes electronic documents into a group of word strings composed of atomic words of a specific part of speech, then counts the frequency of these word strings and their substrings, and finally obtains the "word" words to achieve automatic extraction The purpose of the term in the document. However, this approach still has flaws. Although the recall rate of this method is satisfactory, there are a large number of "half words" in the result set, such as "management information" (existing in 'management information system'), "global" (existing in 'global enterprise'), etc. words, which affect the accuracy of the method.

发明内容Contents of the invention

有鉴于此,本发明的目的是提供一种从中文电子文档中自动提取词语的方法,解决由于自动提取结果中存在“半截词”而影响准确率的问题,实现计算机自动地高效地提取中文电子文档中的词语。In view of this, the purpose of the present invention is to provide a method for automatically extracting words from Chinese electronic documents, to solve the problem that the accuracy rate is affected by the existence of "half words" in the automatic extraction results, and to realize the computer to automatically and efficiently extract Chinese electronic documents. words in the document.

本发明采用以下方案实现:一种从中文电子文档中自动提取词语的方法,其特征在于包括以下步骤:The present invention adopts following scheme to realize: a kind of method for extracting words automatically from Chinese electronic document, it is characterized in that comprising the following steps:

步骤S01:将电子文档处理为一组由特定词性的原子词所组成的词串;Step S01: processing the electronic document into a group of word strings composed of atomic words of a specific part of speech;

步骤S02:统计该些原子词词串及其子串的频率,将出现次数超过N次的原子词词串做为候选词语,其中N为可设定参数;Step S02: Count the frequencies of these atomic word strings and their substrings, and use the atomic word strings that appear more than N times as candidate words, where N is a parameter that can be set;

步骤S03:删除候选词语集合中仅做为子串出现的词语,得到文档中出现的词语的集合,实现自动提取中文电子文档中的词语的目的。Step S03: Delete the words that only appear as substrings in the set of candidate words, and obtain the set of words that appear in the document, so as to realize the purpose of automatically extracting words in the Chinese electronic document.

在本发明一实施例中,所述步骤S01的实现方式包括如下步骤:In an embodiment of the present invention, the implementation of step S01 includes the following steps:

S011:对电子文档进行原子词切分和词性标注,得到经过原子词切分和词性标注的文档;S011: Perform atomic word segmentation and part-of-speech tagging on electronic documents, and obtain documents that have undergone atomic word segmentation and part-of-speech tagging;

S012:删除无用原子词,得到原子词串的集合,其中包括以下两步删除两种无用原子词:S012: Delete useless atomic words to obtain a set of atomic word strings, including the following two steps to delete two kinds of useless atomic words:

S0121:根据词性删除无用原子词:将不参与组词的原子词替换为一第一预定符号,输出结果中,原子词之间采用一第二预定符号作为间隔,原子词串之间采用所述第一预定符号作为间隔;S0121: Delete useless atomic words according to part of speech: replace the atomic words that do not participate in word formation with a first predetermined symbol, and in the output result, use a second predetermined symbol as an interval between atomic words, and use the above-mentioned atomic word strings the first predetermined symbol as a space;

S0122:依据一个停用原子词列表进一步删除原子词,将停用原子词替换为所述第一预定符号,由此生成新的原子词串的有序集合。S0122: Further delete the atomic words according to a list of inactive atomic words, and replace the inactive atomic words with the first predetermined symbol, thereby generating a new ordered set of atomic word strings.

在本发明一实施例中,所述第一预定符号为换行符,所述第二预定符号为空格。In an embodiment of the present invention, the first predetermined symbol is a newline character, and the second predetermined symbol is a space.

在本发明一实施例中,所述步骤S011中对电子文档进行原子词切分和词性标注采用中科院分词系统ICTCLAS或哈尔滨工业大学的分词系统IRLAS完成。In an embodiment of the present invention, the atomic word segmentation and part-of-speech tagging of the electronic document in the step S011 is completed by the Chinese Academy of Sciences word segmentation system ICTCLAS or the Harbin Institute of Technology word segmentation system IRLAS.

在本发明一实施例中,所述步骤S02采用以下算法实现:In an embodiment of the present invention, the step S02 is implemented using the following algorithm:

1)对于原子词串集合中的每一个原子词串AWS,执行步骤2);1) For each atomic word string AWS in the atomic word string set, perform step 2);

2)对于原子词串的每一个原子词,顺序执行步骤3)、4);2) For each atomic word in the atomic word string, perform steps 3) and 4) sequentially;

3)切分得到以该原子词为首的AWS的所有子串;3) Segment to obtain all substrings of AWS headed by the atomic word;

4)对于每一个子串,执行步骤5);4) For each substring, execute step 5);

5)判断子串在语料中出现的次数是否超过N次,如果是,执行步骤6);否则,执行步骤7);5) Determine whether the substring appears more than N times in the corpus, if yes, perform step 6); otherwise, perform step 7);

6)去除子串中的间隔符形成汉字串,作为候选词语;同时保存其出现频率;6) Remove the spacer in the substring to form a Chinese character string as a candidate word; at the same time save its frequency of occurrence;

7)返回步骤2)。7) Go back to step 2).

本发明设计并实现一个自动提取中文电子文档中所出现的词语的方法。与已有的提词方法相比较,该方法:(1)在切分汉字串时,以原子词为步长,避免了因原子词被切分导致的错误提词,如“系统工”、“识管理”等。(2)在提取合成词时,表现出较高的性能;很少单独使用的合成词也能够提取得到,如“决策支持”等。(3)解决了结果集合中存在“半截词”的问题,提高了自动提词的准确率。本发明的效果和益处是:解决了自动提词性能不高、自动化程度有限的实际问题和困难。高效的自动提词方法是文本自动处理的基础,是信息检索、文本摘要、内容管理等应用的有力保证。良好的词语提取方法能够促进上述工作的自动化程度和性能。The invention designs and implements a method for automatically extracting words appearing in Chinese electronic documents. Compared with the existing prompting methods, this method: (1) When segmenting Chinese character strings, the atomic word is used as the step size, which avoids the wrong prompting caused by the segmentation of atomic words, such as "system worker", "Knowledge Management" and so on. (2) When extracting compound words, it shows high performance; compound words that are rarely used alone can also be extracted, such as "decision support" and so on. (3) Solved the problem of "half words" in the result set, and improved the accuracy of automatic word prompting. The effects and benefits of the present invention are: solving the practical problems and difficulties of low performance of automatic prompting and limited degree of automation. Efficient automatic word prompting method is the basis of automatic text processing and a powerful guarantee for applications such as information retrieval, text summarization, and content management. A good word extraction method can facilitate the automation and performance of the above work.

附图说明Description of drawings

图1是本发明实施例的方法流程示意图。Fig. 1 is a schematic flow chart of the method of the embodiment of the present invention.

图2是本发明另一实施例具体的方法流程示意图。Fig. 2 is a schematic flow chart of a specific method in another embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

如图1所示,本实施例提供一种从中文电子文档中自动提取词语的方法,其特征在于包括以下步骤:步骤S01:将电子文档处理为一组由特定词性的原子词所组成的词串;步骤S02:统计该些原子词词串及其子串的频率,将出现次数超过N次的原子词词串做为候选词语,其中N为可设定参数,较佳的该N可以为2;步骤S03:删除候选词语集合中仅做为子串出现的词语,得到文档中出现的词语的集合,实现自动提取中文电子文档中的词语的目的。As shown in Figure 1, the present embodiment provides a method for automatically extracting words from a Chinese electronic document, which is characterized in that it includes the following steps: Step S01: processing the electronic document into a group of words composed of atomic words of a specific part of speech string; step S02: count the frequency of these atomic word strings and their substrings, and use the atomic word strings that occur more than N times as candidate words, where N is a parameter that can be set, and preferably this N can be 2; Step S03: Delete the words that only appear as substrings in the set of candidate words, and obtain the set of words that appear in the document, so as to realize the purpose of automatically extracting the words in the Chinese electronic document.

具体的,请参见图2,本实施例所述自动提词方法分以下步骤提取中文电子文档中的词语集合:Specifically, referring to Fig. 2, the automatic word prompting method described in the present embodiment extracts the word collection in the Chinese electronic document in the following steps:

1.对电子文档进行原子词切分和词性标注,得到经过原子词切分和词性标注的文档。1. Perform atomic word segmentation and part-of-speech tagging on electronic documents to obtain documents that have undergone atomic word segmentation and part-of-speech tagging.

该步骤对输入的电子文档进行原子词切分和词性标注。可采用中科院分词系统ICTCLAS或哈尔滨工业大学的分词系统IRLAS等。This step performs atomic word segmentation and part-of-speech tagging on the input electronic document. The word segmentation system ICTCLAS of the Chinese Academy of Sciences or the word segmentation system IRLAS of Harbin Institute of Technology can be used.

2.删除无用原子词,得到原子词串的集合。2. Useless atomic words are deleted to obtain a collection of atomic word strings.

无用原子词指的是那些一般不参与组成合成词的原子词。该步骤处理经过原子词切分和词性标注的电子文档,分两步删除两种无用原子词,输出结果为由保留原子词组成的词串的有序集合。Useless atomic words refer to those atomic words that generally do not participate in the formation of compound words. This step processes electronic documents that have undergone atomic word segmentation and part-of-speech tagging, and deletes two types of useless atomic words in two steps, and the output result is an ordered set of word strings composed of retained atomic words.

这里为方便后续说明,做如下定义:For the convenience of subsequent explanations, the following definitions are made:

定义1:原子词串(Chinese atomic word string,AWS)是一个由一个或多个中文原子词构成的有限序列。记为AWS="aw1_aw2_...awn_",其中aw1_aw2_...awn_是AWS的值,awi(1≤i≤n)是原子词。一个原子词串的长度(记为AWSLen)是指构成该原子词串的原子词的个数。Definition 1: Chinese atomic word string (AWS) is a finite sequence consisting of one or more Chinese atomic words. Recorded as AWS="aw 1_ aw 2_ ...aw n_ ", where aw 1_ aw 2_ ...aw n_ is the value of AWS, and aw i (1≤i≤n) is an atomic word. The length of an atomic word string (denoted as AWSLen) refers to the number of atomic words constituting the atomic word string.

例如,"信息_系统_"是一个原子词串,长度为2,是对“信息系统”进行原子词切分后形成的。For example, "information_system_" is an atomic word string with a length of 2, which is formed after the atomic word segmentation of "information system".

原子词串中的相邻原子词之间可使用空格作为分隔符。为了明晰起见,不妨采用下划线“_”表示空格。Spaces can be used as separators between adjacent atomic words in the atomic word string. For clarity, an underscore "_" may be used to represent spaces.

定义2:原子词串的子串是该原子词串的一个子序列。Definition 2: A substring of an atomic word string is a subsequence of the atomic word string.

例如,"信息_","系统_"和"信息_系统_"是原子词串"信息_系统_"的子串。For example, "information_", "system_" and "information_system_" are substrings of the atomic word string "information_system_".

1)根据词性删除。该步骤根据词性删除无用原子词。在输入经过原子词切分和词性标注的电子文档后,该模块保留那些标注为特定词性的原子词,将一般不参与组词的原子词(如,介词、助词等)替换为换行符(或其它预定符号,这里并不以此为限),如此,输出的是原子词串的有序集合,原子词串由保留的原子词构成。输出结果中,原子词之间采用单个空格作为间隔,原子词串之间采用换行符作为间隔。1) Delete according to part of speech. This step deletes useless atomic words according to part of speech. After inputting electronic documents that have undergone atomic word segmentation and part-of-speech tagging, this module retains those atomic words marked as specific parts of speech, and replaces atomic words that generally do not participate in word formation (such as prepositions, auxiliary words, etc.) with line breaks (or other predetermined symbols, which are not limited here), so that the output is an ordered set of atomic word strings, and the atomic word strings are composed of reserved atomic words. In the output results, a single space is used as an interval between atomic words, and a newline is used as an interval between atomic word strings.

2)停用原子词删除。该步骤依据一个停用原子词列表进一步删除原子词,将停用原子词替换为换行符,由此生成新的原子词串的有序集合。停用原子词,即那些从词性上判断有可能参与组成合成词但实际情况下一般不参与组词的词,如,是(动词)、要(动词)、提供(动词)、不少(形容词)等。2) Disable atomic word removal. This step further deletes the atomic words according to a list of inactive atomic words, and replaces the inactive atomic words with line breaks, thereby generating a new ordered set of atomic word strings. Stop using atomic words, that is, those words that may participate in the formation of compound words from the part of speech, but generally do not participate in the formation of words in actual situations, such as, is (verb), want (verb), provide (verb), many (adjective )wait.

3.统计子串出现频率,得到候选词语集合。3. The occurrence frequency of substrings is counted to obtain a set of candidate words.

上一步骤将经过原子词切分和词性标注的电子文档处理为一组原子词串的有序集合。这一步骤切分这些原子词串的子串,输出在文档中多次出现的子串,作为候选词语。这些候选词语包括原子词、合成词以及部分不能独立成词的汉字串。算法步骤如下:In the previous step, the electronic document that has undergone atomic word segmentation and part-of-speech tagging is processed into an ordered set of atomic word strings. This step splits the substrings of these atomic word strings, and outputs the substrings that appear multiple times in the document as candidate words. These candidate words include atomic words, compound words, and some Chinese character strings that cannot be formed into words independently. The algorithm steps are as follows:

1)对于原子词串集合中的每一个原子词串AWS,执行2)。1) For each atomic word string AWS in the atomic word string set, execute 2).

2)对于原子词串的每一个原子词,顺序执行3)、4)。2) For each atomic word in the atomic word string, execute 3) and 4) in sequence.

3)切分得到以该原子词为首的AWS的所有子串。3) Segment to get all substrings of AWS headed by the atomic word.

4)对于每一个子串,执行5)。4) For each substring, execute 5).

5)判断子串在语料中出现的次数是否超过N次(N为可设定参数),如果是,执行6);否则,执行7)。5) Determine whether the substring appears more than N times in the corpus (N is a parameter that can be set), if yes, go to 6); otherwise, go to 7).

6)去除子串中的间隔符形成汉字串,作为候选词语;同时保存其出现频率。7)返回2)。6) Remove the spacers in the substring to form a Chinese character string as a candidate word; meanwhile save its frequency of occurrence. 7) Return to 2).

4.删除“半截词”,得到词语集合。4. Delete the "half word" to get a set of words.

该步骤处理候选词语集合,删除其中仅做为子串出现的候选词语,得到最终的自动提词结果——电子文档中的词语集合。仅作为子串出现的候选词语是指,那些在文档中出现频率与其母串相同的子串。This step processes the set of candidate words, deletes the candidate words that only appear as substrings, and obtains the final automatic prompt result—the set of words in the electronic document. Candidate words that only appear as substrings refer to those substrings that appear in the document with the same frequency as their parent strings.

在实际的词语提取过程中,为了提高结果的准确性,在自动提取得到词语集合后,也可以添加一个人工修正的步骤。人工修正是专家手动修改自动提取结果的过程。In the actual word extraction process, in order to improve the accuracy of the result, after the word set is automatically extracted, a manual correction step can also be added. Manual correction is the process of manually modifying the automatically extracted results by experts.

为了让一般技术人员更好的理解本发明:以表一所示文档为例。In order to allow those skilled in the art to better understand the present invention: take the document shown in Table 1 as an example.

Figure BDA00003126119600081
Figure BDA00003126119600081

表一Table I

经过步骤1采用中科院分词系统ICTCLAS对文档进行原子词切分和词性标注,切分后的文档如表二所示。After step 1, use the Chinese Academy of Sciences word segmentation system ICTCLAS to perform atomic word segmentation and part-of-speech tagging on the document. The document after segmentation is shown in Table 2.

表二Table II

步骤2删除无用原子词。结果如表三所示:Step 2 delete useless atomic words. The results are shown in Table 3:

Figure BDA00003126119600092
Figure BDA00003126119600092

表三Table three

步骤3统计子串出现频率,获取候选词语。结果如表四所示。Step 3 counts the occurrence frequency of substrings to obtain candidate words. The results are shown in Table 4.

序号serial number 词语words 出现频率Frequency of occurrence 11 企业enterprise 33 22 全球worldwide 33 33 全球性global 22 44 全球性企业global enterprise 22 55 通信communication 22 66 系统system 22 77 信息information 22 88 信息系统Information system 22 99 范围scope 22

表四Table four

步骤4删除“半截词”,得到本发明所述方法进行自动提词的词语集合。结果如表五所示。Step 4 deletes the "half word" to obtain the word set for automatic word prompting by the method of the present invention. The results are shown in Table 5.

序号serial number 词语words 出现频率Frequency of occurrence 11 企业enterprise 33 22 全球worldwide 33 33 全球性企业global enterprise 22 44 通信communication 22 55 信息系统Information system 22 66 范围scope 22

表五Table five

以上所述仅为本发明的较佳实施例,凡依本发明申请专利范围所做的均等变化与修饰,皆应属本发明的涵盖范围。The above descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made according to the scope of the patent application of the present invention shall fall within the scope of the present invention.

Claims (5)

1. method of therefrom extracting word in the message subdocument automatically is characterized in that may further comprise the steps:
Step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech;
Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter;
Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.
2. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: the implementation of described step S01 comprises the steps:
S011: electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging;
S012: delete useless atom word, obtain the set of atom word string, go on foot two kinds of useless atom words of deletion comprising following two:
S0121: delete useless atom word according to part of speech: the atom word that will not participate in organizing word replaces with one first predetermined symbol, among the output result, adopt one second predetermined symbol as at interval between the atom word, adopt described first predetermined symbol as the interval between the atom word string;
S0122: further delete the atom word according to an inactive atom word tabulation, the atom word of will stopping using replaces with described first predetermined symbol, generates the ordered set of new atom word string thus.
3. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: described first predetermined symbol is newline, and described second predetermined symbol is the space.
4. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: among the described step S011 electronic document is carried out atom word segmentation and part-of-speech tagging employing Words partition system ICTCLAS or Words partition system IRLAS and finish.
5. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: described step S02 adopts following algorithm to realize:
1) for each the atom word string AWS in the atom word set of strings, execution in step 2);
2) for each atom word of atom word string, order execution in step 3), 4);
3) cutting obtains all substrings with the AWS headed by this atom word;
4) for each substring, execution in step 5);
5) judge whether the number of times that substring occurs surpasses N time in language material, if, execution in step 6); Otherwise, execution in step 7);
6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously;
7) return step 2).
CN2013101564948A 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document Pending CN103207921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101564948A CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101564948A CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Publications (1)

Publication Number Publication Date
CN103207921A true CN103207921A (en) 2013-07-17

Family

ID=48755142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101564948A Pending CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Country Status (1)

Country Link
CN (1) CN103207921A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013899A1 (en) * 2013-07-31 2015-02-05 Empire Technology Development Llc Information extraction from semantic data
CN104766504A (en) * 2015-03-31 2015-07-08 黄庆梅 Atomic word point contact learning machine
CN106970904A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device of new word discovery
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN110969009A (en) * 2019-12-03 2020-04-07 哈尔滨工程大学 Word segmentation method of Chinese natural language text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0802492A1 (en) * 1996-04-17 1997-10-22 International Business Machines Corporation Document search system
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0802492A1 (en) * 1996-04-17 1997-10-22 International Business Machines Corporation Document search system
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于娟 等: "结合词性分析与串频统计的词语提取方法", 《系统工程理论与实践》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013899A1 (en) * 2013-07-31 2015-02-05 Empire Technology Development Llc Information extraction from semantic data
CN104766504A (en) * 2015-03-31 2015-07-08 黄庆梅 Atomic word point contact learning machine
CN106970904A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device of new word discovery
CN106970904B (en) * 2016-01-14 2020-06-05 北京国双科技有限公司 Method and device for discovering new words
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN109213988B (en) * 2017-06-29 2022-06-21 武汉斗鱼网络科技有限公司 Barrage theme extraction method, medium, equipment and system based on N-gram model
CN110969009A (en) * 2019-12-03 2020-04-07 哈尔滨工程大学 Word segmentation method of Chinese natural language text
CN110969009B (en) * 2019-12-03 2023-10-13 哈尔滨工程大学 Word segmentation method for Chinese natural language text

Similar Documents

Publication Publication Date Title
CN109710947B (en) Method and device for generating electric power professional thesaurus
CN103778243B (en) Domain term extraction method
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN103235774B (en) A kind of science and technology item application form Feature Words extracting method
CN108132929A (en) A kind of similarity calculation method of magnanimity non-structured text
CN103761264B (en) Concept hierarchy establishing method based on product review document set
CN108845982B (en) A Chinese word segmentation method based on word association features
CN103631858B (en) A kind of science and technology item similarity calculating method
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN106407182A (en) A method for automatic abstracting for electronic official documents of enterprises
CN105573979B (en) A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN103207921A (en) Method for automatically extracting terms from Chinese electronic document
CN104778201A (en) Multi-query result combination-based prior art retrieval method
Ye et al. Unknown Chinese word extraction based on variety of overlapping strings
CN101794308A (en) Method for extracting repeated strings facing meaningful string mining and device
CN113408286B (en) A Chinese entity recognition method and system for the field of mechanical and chemical engineering
CN102375863A (en) Method and device for keyword extraction in geographic information field
Pande et al. Application of natural language processing tools in stemming
CN104077274B (en) Method and device for extracting hot word phrases from document set
CN101853284B (en) Internet-oriented meaningful string extraction method and device
CN101872363B (en) Method for extracting keywords
Biba et al. Boosting text classification through stemming of composite words
CN114330336A (en) New word discovery method and device based on left-right information entropy and mutual information
CN106502980A (en) A kind of search method and system based on text morpheme cutting
CN108197118A (en) A kind of method that automatic indexing and retrieval are carried out using computer system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130717

RJ01 Rejection of invention patent application after publication
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载