JP2011180862A

JP2011180862A - Method and device of extracting term, and program

Info

Publication number: JP2011180862A
Application number: JP2010044964A
Authority: JP
Inventors: Yoshihiro Matsuo; 義博松尾; Nozomi Kobayashi; のぞみ小林; Hisako Asano; 久子浅野; Genichiro Kikui; 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-03-02
Filing date: 2010-03-02
Publication date: 2011-09-15
Anticipated expiration: 2030-03-02
Also published as: JP5113863B2

Abstract

【課題】高速かつ用語のゆらぎ等を考慮した用語抽出方法を提供する。
【解決手段】この発明の用語抽出方法は、用語ＷＦＳＴ作成過程と、ゆらぎＷＦＳＴ変換過程と、変換器合成過程と、ＷＦＳＴ記憶過程と、ＷＦＳＴ復号過程とを含む。用語ＷＦＳＴ作成過程は、外部から入力される用語リストの各用語をＷＦＳＴ表現した用語ＷＦＳＴを作成する。ゆらぎモデル変換過程は、用語のゆらぎ確率若しくは音訳のゆらぎ確率をゆらぎモデルであるゆらぎＷＦＳＴに変換する。そして、ＷＦＳＴ復号過程で、ゆらぎＷＦＳＴと用語ＷＦＳＴとを合成した合成ＷＦＳＴを用いて、入力文字列から表記ゆらぎや異表記を含む用語を抽出する。
【選択図】図２The present invention provides a term extraction method that takes into account fluctuations of terms and the like at high speed.
The term extraction method of the present invention includes a term WFST creation process, a fluctuation WFST conversion process, a converter synthesis process, a WFST storage process, and a WFST decoding process. The term WFST creation process creates a term WFST in which each term in the term list input from the outside is WFST-represented. The fluctuation model conversion process converts the fluctuation probability of a term or the fluctuation probability of a transliteration into fluctuation WFST which is a fluctuation model. Then, in the WFST decoding process, terms including notation fluctuations and different notations are extracted from the input character string using the synthesized WFST obtained by synthesizing the fluctuation WFST and the term WFST.
[Selection] Figure 2

Description

この発明は、テキストから用語を抽出する用語抽出方法と、その装置とプログラムに関する。 The present invention relates to a term extracting method, an apparatus, and a program for extracting terms from text.

従来の用語抽出方法の最も簡単な例としては、予め用語リストを準備しておき、与えられたテキストと文字列が一致する用語を抽出する方法がある。この方法は、テキストと用語リストの各語とを逐一照合するので、用語リストの用語数に比例した抽出時間を必要とする。 As the simplest example of the conventional term extraction method, there is a method of preparing a term list in advance and extracting a term that matches a given text with a character string. Since this method collates the text with each word of the term list one by one, it requires an extraction time proportional to the number of terms in the term list.

その抽出時間を減らす目的で、例えば、トライ（trie）と呼ばれる木構造形式で辞書データを管理する方法が知られている（非特許文献１）。この方法は、長さがｋ文字の用語の場合、ｋ回の処理で登録されている用語を抽出できるので、単語リストの各語を逐一照合する方法に比較して抽出時間を短縮できる。また、テキストの全ての部分文字列と照合することを避ける目的で、不要な照合をしないように改良された方法も知られている（非特許文献２）。 For the purpose of reducing the extraction time, for example, a method of managing dictionary data in a tree structure format called “trie” is known (Non-Patent Document 1). In this method, in the case of a term having a length of k characters, the registered term can be extracted by k times of processing, so that the extraction time can be shortened compared to the method of collating each word in the word list one by one. In addition, there is also known an improved method for avoiding unnecessary collation for the purpose of avoiding collation with all partial character strings of text (Non-Patent Document 2).

長尾真、佐藤理史編「岩波講座ソフトウェア科学自然言語処理」岩波書店、p250Edited by Makoto Nagao, Satoshi Sato, "Iwanami Course Software Science, Natural Language Processing", Iwanami Shoten, p250 北研二、津田和彦著「情報探索アルゴリズム」共立出版、2002年１月、p108Kenji Kita and Kazuhiko Tsuda “Information Search Algorithm” Kyoritsu Shuppan, January 2002, p108

上記した従来の方法は、用語リストと完全に一致する用語しか抽出できないという課題を持つ。例えば、用語リストに「バイオリン」が登録されていた場合でも、表記揺らぎ「ヴァイオリン」を抽出することができない。また、「violin」といった異なる表記を抽出することもできないという課題があった。 The conventional method described above has a problem that only terms that completely match the term list can be extracted. For example, even when “violin” is registered in the term list, the notation fluctuation “violin” cannot be extracted. In addition, there is a problem that it is impossible to extract different notations such as “violin”.

用語リスト中の「バイオリン」と「ヴァイオリン」の照合を成功させるためには、例えば、両表記を動的計画法（参考文献：「日本語情報処理」p40、電子通信学会編）で照合することが考えられる。しかし、この方法では、従来の方法と同様に用語リスト中の用語数に比例した照合時間が必要であり現実的ではない。また、「violin」と照合するためには「バイ-vi」「オ-o」「リン-lin」といった音素毎の音訳確率を求めておき、「バイオリン-violin」全体の音訳確率が閾値を越えたら抽出する方法も考えられる。しかし、この方法でも用語リスト中の用語数に比例した照合時間が必要であり現実的ではない。 To successfully collate “violin” and “violin” in the terminology list, for example, collate both notations with dynamic programming (reference: “Japanese Information Processing” p40, edited by IEICE). Can be considered. However, this method is not practical because it requires a matching time proportional to the number of terms in the term list as in the conventional method. In addition, transliteration probabilities for each phoneme such as “Bi-vi”, “O-o”, and “Lin-lin” are obtained to match “violin”, and the transliteration probability of “violin-violin” exceeds the threshold. A method of extracting can also be considered. However, this method is not practical because it requires a matching time proportional to the number of terms in the term list.

この発明は、このような課題に鑑みてなされたものであり、照合が高速に行え、且つ、表記ゆらぎを考慮した用語も抽出できる用語抽出方法とその装置と、プログラムを提供することを目的とする。この発明の用語抽出方法は、有限状態オートマトンを拡張した重み付き有限状態変換器（Weighted Finite-State Transducer、以降、ＷＦＳＴと称する）を利用する。 The present invention has been made in view of such problems, and it is an object of the present invention to provide a term extraction method, an apparatus thereof, and a program capable of performing collation at high speed and extracting terms in consideration of notation fluctuations. To do. The term extraction method of the present invention uses a weighted finite state transducer (hereinafter referred to as WFST) that is an extension of a finite state automaton.

この発明の用語抽出方法は、用語ＷＦＳＴ作成過程と、ゆらぎモデル変換過程と、変換器合成過程と、ＷＦＳＴ記憶過程と、ＷＦＳＴ復号過程と、を含む。用語ＷＦＳＴ作成過程は、外部から入力される用語リストの各用語から、その各用語を文字毎に区切り各文字を入力文字としたＷＦＳＴ表現された用語リストである用語ＷＦＳＴを作成する。ゆらぎモデル変換過程は、用語のゆらぎ確率若しくは音訳のゆらぎ確率をゆらぎモデルであるゆらぎＷＦＳＴに変換する。変換器合成過程は、用語ＷＦＳＴとゆらぎＷＦＳＴを合成し、合成ＷＦＳＴを出力する。ＷＦＳＴ記憶過程は、合成ＷＦＳＴをＷＦＳＴ記憶部に記憶する。ＷＦＳＴ復号過程は、ＷＦＳＴ記憶部を参照して入力文字列に対する累積重みが最大になる遷移系列の文字列を復号する。 The term extraction method of the present invention includes a term WFST creation process, a fluctuation model conversion process, a converter synthesis process, a WFST storage process, and a WFST decoding process. The term WFST creation process creates a term WFST, which is a term list expressed in WFST from each term in the term list input from the outside, separating each term for each character and using each character as an input character. The fluctuation model conversion process converts the fluctuation probability of a term or the fluctuation probability of a transliteration into fluctuation WFST which is a fluctuation model. The converter combining process combines the terms WFST and fluctuation WFST and outputs a combined WFST. In the WFST storage process, the composite WFST is stored in the WFST storage unit. In the WFST decoding process, the WFST storage unit is referenced to decode the character string of the transition sequence that maximizes the cumulative weight for the input character string.

この発明の用語抽出方法によれば、用語のゆらぎ確率や音訳のゆらぎ確率をＷＦＳＴで表現し、その、ゆらぎＷＦＳＴと、用語リストの各用語から作成した用語ＷＦＳＴとを合成することで得られた合成ＷＦＳＴを用いることで、表記ゆらぎや異表記を含む用語を抽出する。したがって、この合成ＷＦＳＴには、用語の全候補が圧縮して格納されることになるので、１回の照合で高速に、且つ、表記ゆらぎ等を考慮した用語抽出が可能になる。 According to the term extraction method of the present invention, the fluctuation probability of terms and the fluctuation probability of transliteration are expressed in WFST, and the fluctuation WFST and the term WFST created from each term in the term list are synthesized. By using the synthesized WFST, terms including notation fluctuations and different notations are extracted. Therefore, since all the term candidates are compressed and stored in this synthesized WFST, it is possible to extract terms in consideration of notation fluctuations and the like at a high speed by one collation.

この発明の用語抽出方法を実現する構成例１００を示す図。The figure which shows the structural example 100 which implement | achieves the term extraction method of this invention. この発明の用語抽出方法の動作フローを示す図。The figure which shows the operation | movement flow of the vocabulary extraction method of this invention. 用語リスト１の例を示す図であり、（ａ）は用語のみ、（ｂ）は用語に識別ＩＤを付与した例を示す。It is a figure which shows the example of the term list | wrist 1, (a) shows only a term, (b) shows the example which provided identification ID to the term. 用語リスト１の用語ＷＦＳＴの例を示す図。The figure which shows the example of the term WFST of the term list 1. 用語のゆらぎ確率２の例を示す図。The figure which shows the example of the fluctuation probability 2 of a term. ゆらぎＷＦＳＴ（用語）の例を示す図。The figure which shows the example of fluctuation WFST (term). 音訳のゆらぎ確率３の例を示す図。The figure which shows the example of the fluctuation probability 3 of transliteration. ゆらぎＷＦＳＴ（音訳）の例を示す図。The figure which shows the example of fluctuation WFST (transliteration). 合成ＷＦＳＴの例を示す図。The figure which shows the example of synthetic | combination WFST. ＷＦＳＴ復号部５０の処理の例を示す図。The figure which shows the example of a process of the WFST decoding part. この発明の用語抽出装置２００の機能構成例を示す図。The figure which shows the function structural example of the vocabulary extraction apparatus 200 of this invention. 用語抽出装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the term extraction apparatus 200.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の用語抽出方法を実現する機能構成例１００を示す。その動作フローを図２に示す。機能構成例１００は、用語ＷＦＳＴ作成部１０、モデル変換部２０、変換器合成部３０、ＷＦＳＴ記憶部４０、ＷＦＳＴ復号部５０、を具備する。これらの各部、若しくは全部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example 100 for realizing the term extraction method of the present invention. The operation flow is shown in FIG. The functional configuration example 100 includes a term WFST creation unit 10, a model conversion unit 20, a converter synthesis unit 30, a WFST storage unit 40, and a WFST decoding unit 50. Each of these units or all the functions are realized by, for example, reading a predetermined program into a computer including a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

用語ＷＦＳＴ作成部１０は、外部から入力される用語リスト１の各用語から、その各用語を文字毎に区切り各文字を入力文字としたＷＦＳＴ表現された用語リストである用語ＷＦＳＴを作成する（ステップＳ１０）。 The term WFST creation unit 10 creates a term WFST, which is a term list expressed in WFST from each term in the term list 1 inputted from the outside, separating each term into characters and using each character as an input character (step) S10).

モデル変換部２０は、用語のゆらぎ確率２若しくは音訳のゆらぎ確率３をゆらぎモデルであるゆらぎＷＦＳＴに変換する（ステップＳ２０）。変換器合成部３０は、用語ＷＦＳＴ作成部１０で作成した用語ＷＦＳＴと、モデル変換部２０で変換したゆらぎＷＦＳＴを合成し、合成ＷＦＳＴを出力する（ステップＳ３０）。ＷＦＳＴ記憶部４０は、合成ＷＦＳＴを記憶する（ステップＳ４０）。ＷＦＳＴ復号部５０は、ＷＦＳＴ記憶部４０を参照して入力文字列に対する累積重みが最大になる遷移系列の文字列を復号する（ステップＳ５０）。
以上、説明した用語抽出方法によれば、入力文字列の中から特定の用語を、高速に、且つ、表記ゆらぎを考慮して抽出することができる。以降、用語ＷＦＳＴや、ゆらぎＷＦＳＴの具体例を示して更に詳しくこの発明の動作を説明する。 The model conversion unit 20 converts the fluctuation probability 2 of the term or the fluctuation probability 3 of the transliteration into a fluctuation WFST that is a fluctuation model (step S20). The converter combining unit 30 combines the term WFST created by the term WFST creating unit 10 and the fluctuation WFST converted by the model converting unit 20 and outputs a synthesized WFST (step S30). The WFST storage unit 40 stores the combined WFST (step S40). The WFST decoding unit 50 refers to the WFST storage unit 40 and decodes the character string of the transition sequence that maximizes the cumulative weight for the input character string (step S50).
As described above, according to the term extraction method described above, a specific term can be extracted from an input character string at high speed in consideration of notation fluctuation. Hereinafter, the operation of the present invention will be described in more detail by showing specific examples of the term WFST and the fluctuation WFST.

〔用語ＷＦＳＴ〕
図３に用語リスト１の例を示す。図３（ａ）に具体例として、「バイオリン」、「イアリング」、「ボーリング」を示す。図３（ｂ）に示すように各用語に識別ＩＤを付与してもよい。 [Term WFST]
FIG. 3 shows an example of the term list 1. FIG. 3A shows “violin”, “earring”, and “boring” as specific examples. As shown in FIG. 3B, an identification ID may be given to each term.

図４に、図３の用語リスト１から、用語ＷＦＳＴ作成部１０で作成した用語ＷＦＳＴの例を示す。等価なＷＦＳＴは無数に存在するため図４はその一例である。図４において、円は状態を表し円中の数字は状態番号を表す。矢印は状態遷移を表す。矢印に付記された文字列は、「入力文字：出力文字/重み」を表す。なお、出力文字のεは空文字を意味する。 FIG. 4 shows an example of the term WFST created by the term WFST creation unit 10 from the term list 1 of FIG. Since there are an infinite number of equivalent WFSTs, FIG. 4 is an example. In FIG. 4, a circle represents a state, and a number in the circle represents a state number. Arrows indicate state transitions. The character string appended to the arrow represents “input character: output character / weight”. The output character ε means an empty character.

初期状態が状態（０）であり、受理状態が二重丸で示す状態（５）である。用語リスト１から用語ＷＦＳＴへの変換は、各用語を文字に区切り、各文字を入力文字とした状態遷移を設定することで行う。
初期状態（０）において入力文字「バ」が与えられると、状態（０）→状態（１）に状態遷移する。その時の状態遷移が、入力文字「バ」、出力文字「バイオリン」、重み１.０である。状態（１）→状態（２）の状態遷移の入力文字が「イ」、出力文字がε（空文字）、状態（２）→状態（３）の状態遷移の入力文字が「オ」、出力文字がε（空文字）、状態（３）→状態（４）の状態遷移の入力文字が「リ」、出力文字がε（空文字）、状態（４）→状態（５）の状態遷移の入力文字が「ン」、出力文字がε（空文字）で受理状態（５）にいたる。他の用語のボーリングとイアリングに対しても同様に、図４に示す用語ＷＦＳＴが作成される。 The initial state is the state (0), and the acceptance state is the state (5) indicated by a double circle. Conversion from the term list 1 to the term WFST is performed by dividing each term into characters and setting a state transition with each character as an input character.
When the input character “B” is given in the initial state (0), the state transitions from the state (0) to the state (1). The state transition at that time is an input character “B”, an output character “Violin”, and a weight 1.0. The input character for the state transition from state (1) to state (2) is "I", the output character is ε (empty character), the input character for state transition from state (2) to state (3) is "o", and the output character Is ε (null character), state (3) → state (4) state transition input character is “li”, output character is ε (null character), state (4) → state (5) state transition input character is “N”, the output character is ε (empty character), and the acceptance state (5) is reached. Similarly, the term WFST shown in FIG. 4 is created for the other terms boring and earring.

図４において、各用語の文字列が初期状態（０）直後の状態遷移に記載されているが、同一用語中の何れの状態遷移に記載しても等価な用語ＷＦＳＴとなる。各用語に識別ＩＤ（図３（ｂ））を付与した場合、各用語に替えて識別ＩＤが初期状態（０）直後の状態遷移に記載される。
確率値を重みとして扱うと、重みの積算は乗算で行われる。その場合は重みを全て１としておく。対数確率を重みとして扱うと重みの積算は加算で行われる。その場合は重みを全て０とすればよい。 In FIG. 4, the character string of each term is described in the state transition immediately after the initial state (0), but the equivalent term WFST is obtained even if it is described in any state transition in the same term. When an identification ID (FIG. 3B) is assigned to each term, the identification ID is described in the state transition immediately after the initial state (0) instead of each term.
When probability values are handled as weights, weights are accumulated by multiplication. In that case, all the weights are set to 1. When logarithmic probabilities are treated as weights, weights are accumulated by addition. In that case, all the weights may be set to 0.

〔ゆらぎＷＦＳＴ〕
図５に、用語のゆらぎ確率２の例を示す。原表記「バ」に対する揺らぎ表記としては、例えば「バ」と「ヴァ」の２つが存在し、その出現確率が０．７と０．３である。この出現確率は、人が与えてもよいし、機械的に求めてもよい（例えば参考文献：特許第4084515号）。 [Fluctuation WFST]
FIG. 5 shows an example of the fluctuation probability 2 of terms. As fluctuation notation for the original notation “ba”, for example, there are two “ba” and “va”, and their appearance probabilities are 0.7 and 0.3. This appearance probability may be given by a person or may be obtained mechanically (for example, reference: Japanese Patent No. 4084515).

図６に、用語のゆらぎ確率（図５）をゆらぎモデルであるゆらぎＷＦＳＴに変換した例を示す。ゆらぎＷＦＳＴは、揺らぎ表記を原表記に書き換える変換器（トランスデューサ）であり、初期状態（０）から受理状態（４）まで遷移した時の重みの積算が、図５に示す確率値となる。モデル変換部２０が行う用語のゆらぎ確率からゆらぎＷＦＳＴへの変換は、揺らぎ表記を文字単位で区切り、各文字列を１文字ずつ入力文字に持つ状態遷移を設定することで行う。 FIG. 6 shows an example in which the term fluctuation probability (FIG. 5) is converted into a fluctuation model WFST which is a fluctuation model. The fluctuation WFST is a converter (transducer) that rewrites the fluctuation notation to the original notation, and the integration of the weights when transitioning from the initial state (0) to the acceptance state (4) becomes the probability value shown in FIG. The conversion from the term fluctuation probability to the fluctuation WFST performed by the model conversion unit 20 is performed by setting a state transition in which the fluctuation notation is divided in units of characters and each character string is included in the input character one character at a time.

揺らぎ表記「バ」の入力文字に対して、出力文字「バ」を出力する重みが０.７の状態遷移（バ：バ/０.７）で、初期状態（０）→受理状態（４）で遷移する。また、揺らぎ表記「ヴァ」の入力文字に対して、出力文字「バ」を出力する重みが０.３で、初期状態（０）→受理状態（４）に遷移する状態遷移が並列に設けられている。同様に、揺らぎ表記「ア」に対して出力文字「ア」を出力する重みが０.９の状態遷移（ア：ア/０.９）、状態遷移（ヤ：ア/０.１）等が並列に設けられている。受理状態（４）→初期状態（１）には、入力なしで自動的に戻る（ε：ε）。 With respect to the input character of fluctuation notation “B”, the output state of “B” for the output character “B” is a state transition with a weight of 0.7 (B: B / 0.7), initial state (0) → accepted state (4) Transition with. In addition, with respect to an input character of fluctuation notation “va”, a weight for outputting the output character “ba” is 0.3, and a state transition for transition from the initial state (0) to the acceptance state (4) is provided in parallel. ing. Similarly, a state transition (A: A / 0.9), a state transition (YA: A / 0.1), etc., for which the output character “A” is output with respect to the fluctuation notation “A” are 0.9. It is provided in parallel. The process automatically returns from the acceptance state (4) to the initial state (1) without input (ε: ε).

図７に、アルファベット文字の音訳のゆらぎ確率３の例を示す。原表記「バ」に対する音訳の揺らぎ表記としては、例えば「ba」と「va」の２つが存在し、その出現確率が０．７と０．３である。図８に、音訳のゆらぎ確率（図７）をゆらぎＷＦＳＴに変換した例を示す。音訳のゆらぎＷＦＳＴも、用語ゆらぎの場合と同様に、揺らぎ表記を文字単位で区切り、各文字列を１文字ずつ入力文字に持つ状態遷移を設定することで行う。アルファベット文字の入力文字列に対して、音訳ゆらぎのゆらぎＷＦＳＴが使われる。 FIG. 7 shows an example of the fluctuation probability 3 of transliteration of alphabetic characters. There are two transliteration fluctuation notations for the original notation “ba”, for example, “ba” and “va”, and their appearance probabilities are 0.7 and 0.3. FIG. 8 shows an example in which the transliteration fluctuation probability (FIG. 7) is converted to fluctuation WFST. The transliteration fluctuation WFST is performed by dividing the fluctuation notation in units of characters and setting a state transition in which each character string is one character at a time as in the case of the term fluctuation. Transliteration fluctuation WFST is used for an input character string of alphabet characters.

〔合成ＷＦＳＴ〕
図９に、図４の用語ＷＦＳＴと図５のゆらぎＷＦＳＴ（用語）を合成した合成ＷＦＳＴの例を示す。合成（Composition）とは、ＷＦＳＴの代表的な演算の一つであり、２つのＷＦＳＴを１つのＷＦＳＴにすることである。合成演算については、例えば参考文献：著者「Fernando C.N.Pereira, Michael Riley」タイトル「Speech Recognition by Composition of Weighted Finite Automata」出典「In Emmanuel Roche and Yves Schabes, editors, Finite-State Devices for Natural Language Processing, chapter 15,pp.431-453.MIT Press, Cambridge, Massachusetts,1997.」に記載されている。
図９の状態（１４）は、用語ＷＦＳＴ（図４）の状態（０）から状態（１）への遷移に、ゆらぎＷＦＳＴ（図６）の状態（０）から入力文字「ヴ」で遷移する遷移確率０.３の状態遷移（（０）→（１）→（４））が、合成された結果追加された状態である。また、図９の状態（６）から状態（７）へ追加された状態遷移（ウ：ε/０.２）は、ゆらぎＷＦＳＴ（図６）の状態遷移（（０）→（３）→（４））が合成された結果である。 [Synthetic WFST]
FIG. 9 shows an example of a combined WFST in which the term WFST in FIG. 4 and the fluctuation WFST (term) in FIG. 5 are synthesized. Composition is one of the typical operations of WFST, and is to make two WFSTs into one WFST. For example, reference: author “Fernando CNPereira, Michael Riley” title “Speech Recognition by Composition of Weighted Finite Automata” source “In Emmanuel Roche and Yves Schabes, editors, Finite-State Devices for Natural Language Processing, chapter 15 , pp.431-453. MIT Press, Cambridge, Massachusetts, 1997.
The state (14) of FIG. 9 changes from the state (0) of the term WFST (FIG. 4) to the state (1), and from the state (0) of the fluctuation WFST (FIG. 6) with the input character “V”. A state transition ((0) → (1) → (4)) with a transition probability of 0.3 is a state added as a result of the synthesis. Further, the state transition (c: ε / 0.2) added from the state (6) to the state (7) in FIG. 9 is the state transition ((0) → (3) → () of the fluctuation WFST (FIG. 6). 4)) is the result of the synthesis.

このように用語ＷＦＳＴとゆらぎＷＦＳＴとが合成された合成ＷＦＳＴは、ＷＦＳＴ記憶部４０に記憶される。音訳のゆらぎＷＦＳＴについても、用語のゆらぎと同様に合成された合成ＷＦＳＴが、ＷＦＳＴ記憶部４０に記憶される。 The combined WFST in which the term WFST and the fluctuation WFST are combined as described above is stored in the WFST storage unit 40. As for the transliteration fluctuation WFST, the synthesized WFST synthesized in the same manner as the term fluctuation is stored in the WFST storage unit 40.

〔ＷＦＳＴ復号部〕
ＷＦＳＴ復号部５０は、入力文字列に対して累積重みが最大になる遷移系列の文字列が出力できるものであればどのようなものでも構わない。例えば、図１０に示す処理をＷＦＳＴ符号部５０が行う。
ここでは、入力文字列として、単語単位で入力される例で説明する。 [WFST decoding unit]
The WFST decoding unit 50 may be anything as long as it can output a character string of a transition sequence that has the maximum cumulative weight with respect to the input character string. For example, the WFST encoding unit 50 performs the process shown in FIG.
Here, an example in which the input character string is input in units of words will be described.

まず、初期状態をキュー（Queue）に入れる（ステップＳ５１）。状態とは、（文字位置、状態番号、出力文字列、累積重み）で構成されるデータである。キューは単純なスタックやFIFOでも構わない。最尤解を一つだけ求める場合には優先度付きキューを用いる。初期状態とは、（文字列の先頭、開始状態番号、空文字列、初期重み）である。初期重みとは、重み積算が乗算であれば１、加算であれば０である。 First, the initial state is put in a queue (Step S51). The state is data composed of (character position, state number, output character string, cumulative weight). The queue can be a simple stack or FIFO. When only one maximum likelihood solution is obtained, a priority queue is used. The initial state is (the beginning of the character string, the start state number, the empty character string, the initial weight). The initial weight is 1 if the weight integration is multiplication and 0 if the addition is addition.

次にキューから１状態候補を取得する（ステップＳ５２）。そして、取得した状態の文字位置に該当する文字を入力文字列から取得して次文字とする（ステップＳ５４）。 Next, one state candidate is acquired from the queue (step S52). Then, the character corresponding to the character position in the acquired state is acquired from the input character string and set as the next character (step S54).

取得した次文字が入力文字と合致する全ての遷移を、ＷＦＳＴ記憶部４０を参照して取得する（ステップＳ５５）。合わせて入力文字が空文字（ε）である遷移と、現在の状態番号が受理状態番号かどうかもＷＦＳＴ記憶部４０から取得する。 All transitions in which the acquired next character matches the input character are acquired with reference to the WFST storage unit 40 (step S55). In addition, a transition in which the input character is an empty character (ε) and whether the current state number is an accepted state number are also acquired from the WFST storage unit 40.

現在の状態番号が受理状態番号であれば、その受理状態を結果リストに登録する（ステップＳ５７）。例えば、優先度付きキューを用いた場合で、最尤解を１つだけ求めれば良い場合は、ここで終了する（ステップＳ５８のＹｅｓ）。
受理判定（ステップＳ５６）では、入力文字列と前方一致する用語を抽出することになる。もしも、完全一致する用語のみを抽出したければ、文字位置が入力文字列の終端にあるかどうかを合わせて判定すればよい。 If the current state number is an acceptance state number, the acceptance state is registered in the result list (step S57). For example, when a queue with priority is used and only one maximum likelihood solution needs to be obtained, the process ends here (Yes in step S58).
In the acceptance determination (step S56), a term that matches the input character string is extracted. If only exact matching terms are to be extracted, it is only necessary to determine whether the character position is at the end of the input character string.

次に、得られた遷移先状態を全てキューに追加する（ステップＳ５９）。遷移先の状態のうち、次文字位置は、入力文字が空文字（ε）であれば現文字位置と同一、そうでなければ１文字進めた位置である。次状態番号はＷＦＳＴ記憶部４０から取得した遷移先状態番号である。出力文字列は、現出力文字列とＷＦＳＴ記憶部４０から取得した出力文字列を連結したものである。累積重みは、現累積重みとＷＦＳＴ記憶部４０から取得した累積重みを積算したものである。 Next, all the obtained transition destination states are added to the queue (step S59). Of the transition destination states, the next character position is the same as the current character position if the input character is an empty character (ε), and is the position advanced by one character otherwise. The next state number is a transition destination state number acquired from the WFST storage unit 40. The output character string is a concatenation of the current output character string and the output character string acquired from the WFST storage unit 40. The cumulative weight is obtained by integrating the current cumulative weight and the cumulative weight acquired from the WFST storage unit 40.

ステップＳ５２〜Ｓ５９の処理を、状態候補が取得できなくなるまで繰り返す（ステップＳ５３のＮｏ）。状態候補が取得できなくなった時点の結果リストに蓄えられた文字列が、ＷＦＳＴ復号部５０の出力となる（ステップＳ５３のＹｅｓ）。この例のように、入力文字列として単語単位で入力される場合は、用語抽出というよりは用語照合といった方が実態と合う。このような用語照合方法は、例えば検索エンジンのキーワード入力のフィルタとして利用可能である。 The processes in steps S52 to S59 are repeated until no state candidate can be acquired (No in step S53). The character string stored in the result list at the time when the state candidate cannot be acquired is the output of the WFST decoding unit 50 (Yes in step S53). When the input character string is input in units of words as in this example, term matching is more suitable than actual term extraction. Such a term matching method can be used as a keyword input filter of a search engine, for example.

〔用語抽出装置〕
図１２、この発明の用語抽出装置２００の機能構成例を示す。その動作フローを図１３に示す。用語抽出装置２００は、図１に示したＷＦＳＴ記憶部４０と、ＷＦＳＴ復号部５０とで構成される。
ＷＦＳＴ復号部５０は、前方一致するものを出力するように構成しておく。抽出した用語は結果リストに蓄えられる。同一の用語の複数回の抽出を禁止する場合は、最初に抽出した時点でその用語についての抽出を終了するようにすればよい。 [Term Extractor]
FIG. 12 shows a functional configuration example of the term extracting device 200 of the present invention. The operation flow is shown in FIG. The term extraction device 200 includes the WFST storage unit 40 and the WFST decoding unit 50 shown in FIG.
The WFST decoding unit 50 is configured to output a forward match. The extracted terms are stored in the result list. When prohibiting the same term from being extracted a plurality of times, the extraction of the term may be terminated when it is first extracted.

ＷＦＳＴ復号部５０は、テキスト全体を構成する文字列を取り込む（ステップＳ２０１）。そして、ＷＦＳＴ復号部５０は、取り込んだ入力文字の先頭の文字から、入力文字を更新しながら入力文字列の終端まで、上記したＷＦＳＴ復号過程を実行する（ステップＳ５０，Ｓ２０２，Ｓ２０３）。そのＷＦＳＴ復号過程において、受理状態番号が検出される度に抽出された用語が結果リストに出力される（ステップＳ５０）。 The WFST decoding unit 50 takes in a character string that constitutes the entire text (step S201). Then, the WFST decoding unit 50 executes the above-described WFST decoding process from the first character of the input character taken in to the end of the input character string while updating the input character (steps S50, S202, S203). In the WFST decoding process, the extracted term is output to the result list every time the acceptance state number is detected (step S50).

用語抽出装置２００は、ゆらぎＷＦＳＴと、用語リストの各用語から作成した用語ＷＦＳＴとを合成することで得られた合成ＷＦＳＴを用いて用語を抽出するので、高速に、且つ、表記ゆらぎや異表記を含む用語を抽出することができる。 The term extraction device 200 extracts terms using the synthesized WFST obtained by synthesizing the fluctuation WFST and the term WFST created from each term in the term list. Can be extracted.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能部の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functional units that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a magnetic tape, or the like, and an optical disc, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), Using CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。また、各装置の機能構成部は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしても良い。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network. In addition, the functional configuration unit of each device may be configured by causing a predetermined program to be executed on a computer, or at least a part of these processing contents may be realized in hardware.

Claims

A term WFST creation process for creating a term WFST, which is a term list expressed in WFST, with each character being separated for each character from each term of the term list input from outside,
A fluctuation model conversion process for converting a fluctuation probability of a term or a fluctuation probability of a transliteration into a fluctuation model WFST,
A converter combining step of combining the term WFST and the fluctuation WFST and outputting the combined WFST;
A WFST storage process of storing the composite WFST in a WFST storage unit;
A WFST decoding process of decoding a character string of a transition sequence in which the cumulative weight with respect to the input character string is maximized with reference to the WFST storage unit;
A term extraction method including:

A combined WFST in which a term list WFST expressed as a WFST expression using each character created from each term in the term list as an input character and a fluctuation WFST that is a fluctuation probability of a term or a fluctuation probability of a transliteration is stored. A WFST storage unit;
A WFST decoding unit that decodes a character string of a transition sequence in which the cumulative weight with respect to the input character string is maximized with reference to the WFST storage unit;
A term extraction device comprising:

An apparatus program for causing a computer to execute the function of each unit of the term extracting apparatus according to claim 2.