JP4013489B2

JP4013489B2 - Corresponding category search system and method

Info

Publication number: JP4013489B2
Application number: JP2001058303A
Authority: JP
Inventors: 博増市
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2001-03-02
Filing date: 2001-03-02
Publication date: 2007-11-28
Anticipated expiration: 2021-03-02
Also published as: JP2002259445A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書集合が複数のカテゴリに分類されているカテゴリ構造を対象とし、異なる言語に対してそれぞれ構築された複数のカテゴリ構造間のカテゴリの対応関係を決定する技術に関する。
【０００２】
【従来の技術】
大量の文書集合へのアクセスを容易にする方法の一つとして、文書集合を複数のカテゴリへと分類する手法を挙げることができる。文書集合をカテゴリに分類した場合、ユーザが求める文書が属していると想定されるカテゴリのみを検索対象として検索を行うことにより、効率よく所望の文書を得ることが可能となる。文書集合を人手によってカテゴリへと分類する場合もあれば、文献「情報検索論認知的アプローチへの展望，ＤａｖｉｄＥｌｌｉｓ著，丸善株式会社，（１９９４）」に記述されているようなカテゴリ分類を自動化する手法もこれまで多く提案されてきた。
【０００３】
このようなカテゴリ化された文書集合（以降、カテゴリ構造とも呼ぶ）が複数の言語に対して構築されている場合、複数のカテゴリ構造間のカテゴリの対応関係（類似する意味内容の文書集合を含むカテゴリの対応関係）を決定することは、言語をまたがる文書検索（多言語文書検索）を行う上で重要である。すなわち、検索対象とするターゲット言語（二次的な検索に用いられる言語）の文書集合を、ソース言語（直接に検索に用いられる言語）による検索要求に近いカテゴリに限定することによって、検索の精度を向上させることが可能となる。
【０００４】
このようなカテゴリの対応関係を自動的に決定するための方法としては、多言語文書検索の手法を流用する方法が主である。例えば、文献「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ，”ＱｕｅｒｙＴｒａｎｓｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎＳｕｍｍｉｔＶＩＩ ’９９ＷｏｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，（１９９９）」では、翻訳対の集合（パラレルコーパス）を学習データとして、異なる言語で書かれた文書の各々を同じベクトル空間上の文書ベクトルとして表現し、ベクトル間の余弦の値を文書間の類似度であるとして多言語文書検索を行う手法が提案されている。この手法を用いれば、カテゴリに属する全ての文書に対応する文書ベクトルの和をカテゴリベクトルと定義し、カテゴリベクトル間の余弦を類似度と定義することによって、異なる言語を対象として構築された複数のカテゴリ構造間のカテゴリの対応関係を決定することが可能となる。
【０００５】
【発明が解決しようとする課題】
しかしながら現状においては、上記の手法によって対応するカテゴリを決定する上で実用上十分な精度が得られているとは言い難い。一般に多言語情報検索の検索精度を低下させる最大の要因は、単語あるいはフレーズの意味曖昧性の問題である。第１の言語のある単語（フレーズ）を第２の言語の単語（フレーズ）へと翻訳する際には、多くの翻訳候補が存在する。例えば、英語の「ｂａｓｅ」という単語は、軍事用語としては「基地」、野球用語としては「塁」、政治用語としては「支持母体」、数学用語としては「基数」、化学用語としては「塩基」、文法用語としては「期体」、建築用語としては「（塗料の）主成分」等、分野に依存して様々な翻訳候補が存在する。これらの翻訳候補は多くの場合分野依存であるため、多言語情報検索では、検索対象を特定の分野の文書集合に限れば高い精度が得られると言われている。すなわち、カテゴリ内にはそのカテゴリの分野に応じた訳語が存在し、分野ごとに適切な訳語を用いて多言語文書検索を行う必要がある。上記の文書ベクトルを用いた手法では、学習データとしてある一つのパラレルコーパスを用いるため分野に応じた適切な文書ベクトルを生成することができず、したがって意味曖昧性の問題を解決することができない。カテゴリごとにパラレルコーパスを用意することができれば分野に応じた適切な文書ベクトルを生成することは可能であるが、一般にパラレルコーパスは入手が困難であり、実際にはそのようなアプローチは不可能である。パラレルコーパスを学習データとする多言語文書検索手法以外にも、２ヶ国語辞書を用いる多言語文書検索手法も数多く提案されているが、一般的な２ヶ国語辞書を用いた場合は意味曖昧性の問題が解決できず、分野（カテゴリ）ごとに２ヶ国語辞書を用意することが実際上不可能である点は全く同様である。
【０００６】
本発明はこのような点に鑑みてなされたものであり、カテゴリごとにパラレルコーパスを用意することなく、高い精度でカテゴリ間の対応関係を決定することができるシステムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
文献「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ，”ＡＢｏｏｔｓｔｒａｐｐｉｎｇｍｅｔｈｏｄｆｏｒＥｘｔｒａｃｔｉｎｇＢｉｌｉｎｇｕａｌＴｅｘｔＰａｉｒｓ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＴｈｅ１８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，ｐｐ．１０６６−１０７０（２０００）」では、以下のような２ヶ国語の類似文書ペア決定手法が提案されている。
【０００８】
「あるパラレルコーパスを初期の学習データとして多言語文書検索を行い、２ヶ国語の文書が混在する文書集合中から類似する２ヶ国語文書ペアを決定し、得られた文書ペアを初期の学習データに追加し、得られた学習データに基づいて再度多言語文書検索を行う。この多言語文書検索処理と、得られた文書ペアの学習データへの追加処理を交互に繰り返すことによって、学習ペアを成長させ、最終的に精度の高いパラレルコーパス（２ヶ国語文書ペア）を得る。」
【０００９】
上記文献に記載されている通り、この手法は多言語文書検索の対象である文書集合中の各文書の意味内容が似通っている（同一の分野である）場合にしか有効に働かない。本発明は、上記手法のこの性質を逆に利用するものである。すなわち、第１の言語で書かれた文書集合を含む第１のカテゴリと第２の言語で書かれた文書集合を含む第２のカテゴリを合わせたものを、多言語文書検索の対象として上記手法を適用し、学習ペアが成長すれば第１のカテゴリと第２のカテゴリの分野が類似のものであると判断する。
【００１０】
本発明の一構成は、図１に示されるように、第１の言語を対象として生成された第１のカテゴリ構造と第２の言語を対象として生成された第２のカテゴリ構造を保持するカテゴリ構造保持手段（１）と、多言語文書検索を行う際の学習データを保持する学習データ保持手段（２）と、学習データ保持手段に保持されている学習データを用いて、カテゴリ構造保持手段に保持されている第１のカテゴリ中のカテゴリと第２のカテゴリ中のカテゴリを対象として多言語文書検索を行い、類似する第１の言語と第２の言語の２ヶ国語文書ペアを決定する多言語文書検索手段（３）と、多言語文書検索によって得られる文書ペアを保持すると共に、該文書ペアを学習データ保持手段に追加する検索結果保持手段（４）と、検索結果保持手段に保持されている文書ペアを参照してカテゴリ間の対応関係を決定するカテゴリ対応関係決定手段（５）とを備えることを特徴とし、この構成において、多言語文書検索手段による多言語検索処理と、検索結果保持手段による文書ペアの学習データへの追加処理とを交互に繰り返すものである。
【００１１】
なお、本発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能であり、少なくともその一部をコンピュータプログラムとして構成することもできることはもちろんである。
【００１２】
本発明の上述の一構成および本発明の他の構成は特許請求の範囲に明瞭に記載され、また以下において実施例を用いて詳細に説明される。
【００１３】
【発明の実施の形態】
以下、本発明の実施例について説明する。
【００１４】
図２は、本発明の実施例の対応カテゴリ検索システムの構成を示している。なお、この実施例においては、日本語と英語を対象として説明を行うが、形態素解析処理（文を単語へと分割する処理）が適用可能な言語であればいかなる言語であっても同様の効果を得ることができる。
【００１５】
図２において、カテゴリ構造保持手段１１は、複数の日本語文書および複数の英語文書をそれぞれカテゴリに分類して格納するカテゴリ構造（第１のカテゴリ構造と第２のカテゴリ構造）を計算機内部に保持する手段である。
【００１６】
学習データ保持手段１２は、日英の翻訳文書対の集合（日英のパラレルコーパス）を初期学習データとして保持する手段である。該パラレルコーパスは、特に分野を限るものではなく、入手が容易な一般的内容のパラレルコーパスである。また、検索結果保持手段１６から日英文書ペアを受け取ると、初期学習データであるパラレルコーパスに追加して新たな日英のパラレルコーパスとして保持する。
【００１７】
単語ベクトル生成手段１３は、学習データ保持手段１２に保持される日英のパラレルコーパスを学習データとして、そこに含まれる全ての日本語単語および英語単語に対して、対応する多次元ベクトル（単語ベクトル）を計算する手段である。以下、単語ベクトルを計算するアルゴリズムを説明する。
【００１８】
［ステップ１］：学習データ中に含まれる全ての日本語文書および英語文書に対して形態素解析処理を施す。
［ステップ２］：ステップ１で得られた全単語のうち、学習データ中で出現頻度の多いものから順にｎ個の単語を選択する。ここで得られたｎ個の単語のことを特徴表現語と呼ぶことにする。ｎの値は数千のオーダーとする。
［ステップ３］：行と列がそれぞれ、ステップ１で得られた全ての日本語／英語単語、および特徴表現語に対応する行列を作成する。ステップ１で得られた全ての日本語／英語単語の総異なり語数が１０万であり、ｎの値を３，０００とした場合、１０万行×３，０００列の行列ができることになる。この行列の各要素には、その要素の行に対応する単語と列に対応する特徴表現語が、学習データ中に含まれる全ての日英文書翻訳対中で何度共起しているか（同時に出現しているか）を記録する。すなわち、日英の翻訳対を一つの文書であるとみなして、文書内の共起回数をカウントする。こうして得られた行列のことを共起行列と呼ぶことにする。このようにして、全日本語単語と全英語単語をｎ次元のベクトルで表現する共起行列を作成することができる。このベクトルは、各単語がどのようなコンテキストで出現しやすい傾向にあるかを示すベクトルであるといえる。
【００１９】
［ステップ４］：ステップ３で得られたｎ次元のベクトルは次元数が大きいため、後に必要となる処理で計算時間が膨大なものになってしまう。そこで、計算処理を実時間の範囲に抑えるために、元のｎ次元のベクトルを行列の次元圧縮手法によって、ｎ’次元（数百次元）のベクトルへと圧縮する。次元圧縮手法には様々なものが存在するが、「Ｂｅｒｒｙ，Ｍ．，Ｄｏ，Ｔ．，Ｏ’Ｂｒｉｅｎ，Ｇ．，Ｋｒｉｓｈｎａ，Ｖ．ａｎｄＶａｒａｄｈａｎ，Ｓ．，”ＳＶＤＰＡＣＫＣＵＳＥＲ’ＳＧＵＩＤＥ”．Ｔｅｃｈ．Ｒｅｐ．ＣＳ−９３−１９４．ＵｎｉｖｅｒｓｉｔｙｏｆＴｅｎｎｅｓｓｅｅ，Ｋｎｏｘｖｉｌｌｅ，ＴＮ（１９９３）」で詳細な説明がなされているＳｉｎｇｕｌａｒＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎがその代表例である。このようにして全ての日本語単語および英語単語に対して得られたｎ’次元のベクトルを単語ベクトルと呼ぶことにする。
【００２０】
文書ベクトル生成手段１４は、単語ベクトル生成手段１３で得られる単語ベクトルを用いて、カテゴリ構造保持手段１１中に保持されているカテゴリＡ中の全日本語文書およびカテゴリＢ中の全英語文書に対応する文書ベクトルを計算する手段である。まず、カテゴリ構造保持手段１１中に保持されているカテゴリＡ中の全日本語文書およびカテゴリＢ中の全英語文書に形態素解析処理を施し、単語へと分割する。次に、各文書中に含まれる全単語に対応する単語ベクトルの総和を正規化した（ベクトルの長さを１とした）ベクトルを計算し、得られたベクトルを文書ベクトルとする。ただし、対応する単語ベクトルが単語ベクトル生成手段１３によって生成されていない単語は無視するものとする。
【００２１】
多言語検索手段１５は、カテゴリ構造保持手段１１に保持されている第１のカテゴリ中の任意の日本語カテゴリ（カテゴリＡ）と第２のカテゴリ中の任意の英語カテゴリ（カテゴリＢ）のカテゴリペア中から類似する日英の文書ペアを検索する手段である。したがって、以下の処理を全ての日本語カテゴリと英語カテゴリのカテゴリペア（カテゴリＡとカテゴリＢの任意の組み合わせ）に対してそれぞれ行うものとする。
【００２２】
まず、文書ベクトル生成手段１４から得られる文書ベクトルを参照することにより、以下の条件を満たす日本語文書と英語文書のペアを、カテゴリＡおよびカテゴリＢに属する全ての文書集合から抽出する。
【００２３】
「文書ペア中の日本語文書に対応する文書ベクトルと最も関連度の高い（内積の値が大きい）英語文書ベクトルがペア中の英語文書ベクトルであり、逆にペア中の英語文書ベクトルと最も関連度の高い日本語文書ベクトルがペア中の日本語文書ベクトルである。」
【００２４】
次に、上記の条件を満たす日英文書ペアうち、ペア中の日英文書に対応する日英文書ベクトルの間の内積の値が予め設定された閾値よりも大きいペアを抽出する。このようにして得られた日英の文書ペアは、意味内容が極めて近いものであり、学習データとして使用することができるものとなる。
【００２５】
検索結果保持手段１６は、カテゴリＡとカテゴリＢを対象に多言語検索手段１５から得られた日英文書ペア集合を計算機内部に保持する手段である。得られた文書ペア集合は、新たな学習データの一部として学習データ保持手段１２へ渡される。
【００２６】
このようにして、
（１）学習データ保持手段１２に保持された学習データに基づき、単語ベクトル生成手段１３によって単語ベクトル集合を生成し、
（２）文書ベクトル生成手段１４によって文書ベクトル集合を生成し、
（３）多言語検索手段１５によって意味内容が近い日英の文書ペアを抽出し、
（４）検索結果保持手段１６によって、得られた文書ペアを学習データの一部として学習データ保持手段１２に追加する（既に追加されている場合は以前のものと置き換える）。
という処理を繰り返し行うことにより、カテゴリＡ中の日本語文書集合とカテゴリＢ中の英語文書集合の意味内容が近い（カテゴリＡとカテゴリＢが同分野に属する）場合に限り、検索結果保持手段１６中に保持される文書ペアの数が徐々に増加することになる。
【００２７】
カテゴリ対応関係決定手段１７は、「検索結果保持手段１６に保持されている文書ペアの総数」の「カテゴリＡおよびカテゴリＢに含まれる総文書数」に対する割合を参照し、該割合が予め定められた閾値Ｔよりも大きい場合、カテゴリＡとカテゴリＢが類似する（同分野の）カテゴリペアであると決定する。また、カテゴリＡとカテゴリＢに対して上記の繰り返し処理が一定回数以上行われたにもかかわらず、該割合が閾値Ｔを超えない場合は、カテゴリＡとカテゴリＢが類似する（同分野の）カテゴリペアではないと決定する。
【００２８】
カテゴリの対応関係の決定は、１回の文書検索だけで終了させても良い。また、閾値を多段に設定しても良い。例えば所定回数目の文書検索で閾値ａ（ａ＜ｂ）未満であれば、非対応と判別し、閾値ａ以上で閾値ｂ未満であれば、再度文書検索を繰返し、同様な判別を行い、閾値ｂ以上であれば、即座にカテゴリが対応すると判別するような構成を採用しても良い。要するに、文書検索結果が、カテゴリの対応関係を肯定する兆候を示すときに、カテゴリが対応すると判別すれば、どのような構成を採用しても良い。
【００２９】
このような構成をとり、カテゴリ対応関係決定手段１７によって、全ての日本語カテゴリと英語カテゴリのカテゴリペア（カテゴリＡとカテゴリＢの任意の組み合わせ）に対してそれぞれ対応関係の有無を決定することにより、第１のカテゴリと第２のカテゴリのカテゴリの対応関係を網羅的に決定することが可能となる。
【００３０】
なお、本実施例では前述の文献「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ，”ＱｕｅｒｙＴｒａｎｓｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎＳｕｍｍｉｔＶＩＩ’９９ＷｏｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，（１９９９）」に記載のパラレルコーパスを学習データとするベクトル空間法に基づく多言語文書検索手法を利用したが、学習データへの検索結果の追加を繰り返し行うことが可能な多言語文書検索手法であれば、いかなる手法であっても同様の効果が得られる（図１参照）。
【００３１】
例えば、カテゴリＡ中の第１の言語で書かれた文書を機械翻訳システムによって第２の言語へと翻訳し、一般的な単言語を対象とする文書検索手法を用いて多言語文書検索を行う手法によっても同様の効果を得ることができる。
【００３２】
パラレルコーパスを学習データとして機械翻訳システムを実現する例として、文献「ＰｅｔｅｒＦ．Ｂｒｏｗｎ，ＳｔｅｐｈｅｎＡ．ＤｅｌｌａＰｉｅｔｒａ，ＶｉｎｃｅｎｔＪ．ＤｅｌｌａＰｉｅｔｒａ，ａｎｄＲｏｂｅｒｔＬ．Ｍｅｒｃｅｒ，”ＴｈｅｍａｔｈｅｍａｔｉｃｓｏｆｓｔａｔｉｓｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ：Ｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ”，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，３２：２６３−３１１，１９９３．」を挙げることができる。
【００３３】
【発明の効果】
以上のように本発明によれば、分野ごとに学習データを用意することなしに単語曖昧性解消の問題を回避し、異なる言語を対象として構築された複数のカテゴリ構造間のカテゴリの対応関係を高い精度で決定することが可能となる。
【図面の簡単な説明】
【図１】本発明に係る典型的な対応カテゴリ検索システムの構成を示す図である。
【図２】本発明の一実施例に係る対応カテゴリ検索システムの構成を示す図である。
【符号の説明】
１１カテゴリ構造保持手段
１２学習データ保持手段
１３単語ベクトル生成手段
１４文書ベクトル生成手段
１５多言語検索手段
１６検索結果保持手段
１７カテゴリ対応関係決定手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for determining a category correspondence between a plurality of category structures respectively constructed for different languages, targeting a category structure in which a document set is classified into a plurality of categories.
[0002]
[Prior art]
One method for facilitating access to a large amount of document collections is a method of classifying the document collections into a plurality of categories. When the document set is classified into categories, it is possible to efficiently obtain a desired document by performing a search for only a category to which a document desired by a user is supposed to belong. In some cases, a set of documents may be manually classified into categories, or the classification as described in the document “Prospects for Information Retrieval Cognitive Approach, by David Ellis, Maruzen Co., Ltd. (1994)” is automated. Many techniques have been proposed.
[0003]
When such a categorized document set (hereinafter also referred to as a category structure) is constructed for a plurality of languages, a category correspondence between the plurality of category structures (including a document set having similar semantic contents) Determining the correspondence between categories) is important in performing a document search across multiple languages (multilingual document search). In other words, by limiting the document set of the target language (language used for secondary search) as the search target to categories close to the search request in the source language (language used for direct search), the accuracy of the search Can be improved.
[0004]
As a method for automatically determining the correspondence between such categories, a method using a multilingual document search method is mainly used. For example, the literature "Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann and Stanley Peters," Query Translation Method for Cross Language Information Retrieval ", The Proceedingsof Machine Translation Summit VII '99 Workshop on Machine Translation for Cross Language Information Retrieval, (1999)" In Using a set of translation pairs (parallel corpus) as learning data, each document written in a different language is represented as a document vector in the same vector space, and the remainder between vectors Technique for multilingual document retrieval has been proposed a value as a similarity between documents. By using this method, a sum of document vectors corresponding to all documents belonging to a category is defined as a category vector, and a cosine between category vectors is defined as a similarity, whereby a plurality of languages constructed for different languages are defined. It becomes possible to determine the correspondence of categories between category structures.
[0005]
[Problems to be solved by the invention]
However, at present, it is difficult to say that a practically sufficient accuracy is obtained in determining the corresponding category by the above method. In general, the biggest factor that reduces the search accuracy of multilingual information search is a problem of meaning ambiguity of words or phrases. When a certain word (phrase) in the first language is translated into a word (phrase) in the second language, there are many translation candidates. For example, the word “base” in English is “base” as a military term, “塁” as a baseball term, “supporting body” as a political term, “base” as a mathematical term, and “base” as a chemical term. There are various translation candidates depending on the field, such as “term” as a grammatical term and “main component of (paint)” as an architectural term. Since these translation candidates are often field-dependent, it is said that high accuracy can be obtained in multilingual information search if the search target is limited to a set of documents in a specific field. That is, there are translated words in the category corresponding to the field of the category, and it is necessary to perform a multilingual document search using an appropriate translated word for each field. In the method using the document vector described above, a certain parallel corpus is used as learning data, so that an appropriate document vector corresponding to the field cannot be generated, and therefore the problem of semantic ambiguity cannot be solved. If a parallel corpus can be prepared for each category, it is possible to generate an appropriate document vector according to the field, but generally it is difficult to obtain a parallel corpus, and in fact, such an approach is impossible. is there. In addition to the multilingual document search method using parallel corpus as learning data, many multilingual document search methods using bilingual dictionaries have been proposed. However, if a general bilingual dictionary is used, meaning ambiguity It is completely the same that the above problem cannot be solved and it is practically impossible to prepare a bilingual dictionary for each field (category).
[0006]
The present invention has been made in view of these points, and an object thereof is to provide a system that can determine the correspondence between categories with high accuracy without preparing a parallel corpus for each category. .
[0007]
[Means for Solving the Problems]
Literature "Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann and Stanley Peters," A Bootstrapping method for Extracting Bilingual Text Pairs ", The Proceedings of The 18th International Conference on Computational Linguistics, pp.1066-1070 (2000)" in, as follows: A similar bilingual similar document pair determination method has been proposed.
[0008]
“Search a multilingual document using a parallel corpus as initial learning data, determine a similar bilingual document pair from a set of documents in which bilingual documents are mixed, and use the obtained document pair as initial learning data. The multilingual document search is performed again based on the obtained learning data, and the learning pair is obtained by alternately repeating the multilingual document searching process and the process of adding the obtained document pair to the learning data. Growing up and finally getting a highly accurate parallel corpus (a bilingual document pair). "
[0009]
As described in the above document, this method works only when the semantic contents of each document in the document set that is the target of multilingual document search are similar (in the same field). The present invention takes advantage of this property of the above approach. That is, the above-described method is performed by using a combination of a first category including a document set written in a first language and a second category including a document set written in a second language as a target of multilingual document search. When the learning pair grows, it is determined that the fields of the first category and the second category are similar.
[0010]
As shown in FIG. 1, one configuration of the present invention is a category that holds a first category structure generated for a first language and a second category structure generated for a second language. The category holding unit uses the structure holding unit (1), the learning data holding unit (2) that holds learning data when performing multilingual document search, and the learning data held in the learning data holding unit. A multilingual document search is performed for the category in the first category and the category in the second category that are held, and a bilingual document pair of the similar first language and second language is determined. Language document search means (3), a document result obtained by multilingual document search, a search result holding means (4) for adding the document pair to the learning data holding means, and a search result holding means Category correspondence determining means (5) for determining correspondence between categories with reference to a document pair, and in this configuration, multilingual search processing by the multilingual document search means and search result holding The process of adding a document pair to learning data by means is repeated alternately.
[0011]
It should be noted that the present invention can be realized not only as an apparatus or a system but also as a method, and at least a part thereof can be configured as a computer program.
[0012]
One of the above-described configurations of the present invention and other configurations of the present invention are clearly set forth in the appended claims, and are described in detail below using examples.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Examples of the present invention will be described below.
[0014]
FIG. 2 shows the configuration of the corresponding category search system according to the embodiment of the present invention. In this embodiment, explanation is given for Japanese and English, but the same effect can be obtained in any language as long as morphological analysis processing (processing for dividing a sentence into words) is applicable. Can be obtained.
[0015]
In FIG. 2, the category structure holding means 11 holds a category structure (first category structure and second category structure) in which a plurality of Japanese documents and a plurality of English documents are classified into categories and stored. It is means to do.
[0016]
The learning data holding means 12 is means for holding a set of Japanese-English translation document pairs (Japanese-English parallel corpus) as initial learning data. The parallel corpus is not limited to a particular field, and is a parallel corpus having general contents that are easily available. When a Japanese-English document pair is received from the search result holding means 16, it is added to the parallel corpus as initial learning data and held as a new Japanese-English parallel corpus.
[0017]
The word vector generation means 13 uses, as learning data, a Japanese-English parallel corpus held in the learning data holding means 12, and for all Japanese words and English words contained therein, corresponding multidimensional vectors (word vectors). ). Hereinafter, an algorithm for calculating a word vector will be described.
[0018]
[Step 1]: All Japanese and English documents included in the learning data are subjected to a morphological analysis process.
[Step 2]: Of all the words obtained in Step 1, n words are selected in descending order of appearance frequency in the learning data. The n words obtained here are called feature expression words. The value of n is on the order of thousands.
[Step 3]: A matrix corresponding to all Japanese / English words and feature expression words obtained in Step 1 is created for each row and column. If the total number of different Japanese / English words obtained in step 1 is 100,000 and the value of n is 3,000, a matrix of 100,000 rows × 3,000 columns can be formed. For each element of this matrix, how many times the word corresponding to the row of the element and the feature expression word corresponding to the column co-occur in all Japanese-English translation pairs included in the learning data (simultaneously Record whether it has appeared). That is, the Japanese-English translation pair is regarded as one document, and the number of co-occurrence in the document is counted. The matrix thus obtained is called a co-occurrence matrix. In this way, it is possible to create a co-occurrence matrix that expresses all Japanese words and all English words by an n-dimensional vector. This vector can be said to be a vector indicating in what context each word tends to appear.
[0019]
[Step 4]: Since the n-dimensional vector obtained in Step 3 has a large number of dimensions, the calculation time becomes enormous in the processing required later. Therefore, in order to limit the calculation process to the real time range, the original n-dimensional vector is compressed into an n′-dimensional (several hundred dimensions) vector by a matrix dimension compression method. There are various dimensional compression methods, but "Berry, M., Do, T., O'Brien, G., Krishna, V. and Varadhan, S.," SVDPACKC USER'S GUIDE ".Tech. Rep. CS-93-194.University of Tennessee, Knoxville, TN (1993) "is a representative example of Single Value Decomposition. The n′-dimensional vectors obtained for all Japanese words and English words in this way are referred to as word vectors.
[0020]
The document vector generation means 14 uses the word vector obtained by the word vector generation means 13 to correspond to all Japanese documents in category A and all English documents in category B held in the category structure holding means 11. A means for calculating a document vector. First, morphological analysis processing is performed on all Japanese documents in category A and all English documents in category B held in category structure holding means 11 and divided into words. Next, a vector obtained by normalizing the sum of the word vectors corresponding to all the words included in each document (with the vector length set to 1) is calculated, and the obtained vector is set as the document vector. However, words whose corresponding word vectors are not generated by the word vector generation means 13 are ignored.
[0021]
The multilingual search means 15 is a category pair of an arbitrary Japanese category (category A) in the first category and an arbitrary English category (category B) in the second category held in the category structure holding means 11. This is a means for searching for similar Japanese-English document pairs. Therefore, it is assumed that the following processing is performed for each category pair of any Japanese category and English category (any combination of category A and category B).
[0022]
First, by referring to the document vector obtained from the document vector generation means 14, pairs of Japanese documents and English documents satisfying the following conditions are extracted from all document sets belonging to category A and category B.
[0023]
“The document vector corresponding to the Japanese document in the document pair has the highest degree of relevance (the inner product value is large) is the English document vector in the pair, and conversely the most related to the English document vector in the pair. The Japanese document vector with the highest degree is the Japanese document vector in the pair. "
[0024]
Next, out of Japanese-English document pairs satisfying the above conditions, a pair is extracted in which the value of the inner product between the Japanese-English document vectors corresponding to the Japanese-English documents in the pair is larger than a preset threshold value. The Japanese-English document pairs obtained in this way have very close meanings and can be used as learning data.
[0025]
The search result holding means 16 is a means for holding the Japanese-English document pair set obtained from the multilingual search means 15 for the categories A and B in the computer. The obtained document pair set is transferred to the learning data holding means 12 as a part of new learning data.
[0026]
In this way
(1) Based on the learning data held in the learning data holding means 12, the word vector generating means 13 generates a word vector set,
(2) generating a document vector set by the document vector generating means 14;
(3) The multilingual search means 15 extracts Japanese-English document pairs having similar meaning contents,
(4) The search result holding unit 16 adds the obtained document pair to the learning data holding unit 12 as part of the learning data (if already added, replaces the previous one).
By repeating the above process, the search result holding means 16 is used only when the semantic content of the Japanese document set in category A and the English document set in category B are close (category A and category B belong to the same field). The number of document pairs held in will gradually increase.
[0027]
The category correspondence determining unit 17 refers to the ratio of the “total number of document pairs held in the search result holding unit 16” to the “total number of documents included in category A and category B”, and the ratio is determined in advance. If the threshold value T is larger than the threshold value T, it is determined that the category A and the category B are similar (same field) category pairs. In addition, when the above repetition processing is performed a certain number of times or more for category A and category B but the ratio does not exceed the threshold T, category A and category B are similar (in the same field). Determine that it is not a category pair.
[0028]
The determination of the category correspondence may be completed by only one document search. Further, the threshold value may be set in multiple stages. For example, if it is less than the threshold a (a <b) in a predetermined number of document searches, it is determined as non-corresponding. If it is more than b, you may employ | adopt the structure which discriminate | determines that a category respond | corresponds immediately. In short, any configuration may be adopted as long as it is determined that the category corresponds when the document search result indicates a sign that the category correspondence is affirmed.
[0029]
By adopting such a configuration, the category correspondence determining means 17 determines the presence / absence of correspondence for all Japanese category and English category category pairs (any combination of category A and category B). Thus, it is possible to comprehensively determine the correspondence relationship between the first category and the second category.
[0030]
It should be noted that the above-mentioned reference "Hiroshi Masuichi in the present embodiment, Raymond Flournoy, Stefan Kaufmann and Stanley Peters," Query Translation Method for Cross Language Information Retrieval ", The Proceedings of Machine Translation Summit VII'99 Workshopon Machine Translation for Cross Language Information Retrieval , (1999) ", the multilingual document search method based on the vector space method using the parallel corpus as learning data is used, but the search result is repeatedly added to the learning data. If Ukoto multilingual document retrieval method capable of the same effect be any technique is obtained (see FIG. 1).
[0031]
For example, a document written in a first language in category A is translated into a second language by a machine translation system, and a multilingual document search is performed using a document search method for a general single language. The same effect can be obtained by a technique.
[0032]
As an example of realizing a machine translation system using a parallel corpus as learning data, the literature “Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer,“ The Mathematical Sciences: , Computational Linguistics, 32: 263-311, 1993. ".
[0033]
【The invention's effect】
As described above, according to the present invention, the problem of word ambiguity resolution can be avoided without preparing learning data for each field, and the correspondence of categories between multiple category structures constructed for different languages can be obtained. It becomes possible to determine with high accuracy.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a typical corresponding category search system according to the present invention.
FIG. 2 is a diagram showing a configuration of a corresponding category search system according to an embodiment of the present invention.
[Explanation of symbols]
11 category structure holding means 12 learning data holding means 13 word vector generating means 14 document vector generating means 15 multilingual search means 16 search result holding means 17 category correspondence determining means

Claims

Category structure holding means for holding a first category structure generated for the first language and a second category structure generated for the second language;
Learning data holding means for holding learning data when performing multilingual document search;
Using the learning data held in the learning data holding means, a multilingual document search is performed for the category in the first category structure and the category in the second category structure held in the category structure holding means. A multilingual document search means for determining a bilingual document pair of a similar first language and second language;
Search result holding means for holding a document pair obtained by multilingual document search and adding the document pair to the learning data holding means;
Category correspondence determining means for determining correspondence between categories by referring to document pairs held in the search result holding means;
Corresponding category search system comprising:

Category structure holding means for holding a first category structure generated for the first language and a second category structure generated for the second language;
Learning data holding means for holding a translation document pair as learning data when performing multilingual document search;
Using the learning data held in the learning data holding means, the multiple categories based on the vector space method are applied to the categories in the first category structure and the categories in the second category structure held in the category structure holding means. Multilingual document search means for performing a language document search and determining a bilingual document pair of a similar first language and second language;
Search result holding means for holding a document pair obtained by multilingual document search and adding the document pair to the learning data holding means;
Category correspondence determining means for determining correspondence between categories by referring to document pairs held in the search result holding means;
Corresponding category search system comprising:

Category structure holding means for holding a first category structure generated for the first language and a second category structure generated for the second language;
Learning data holding means for holding a translation document pair as learning data when performing multilingual document search;
Using the learning data held in the learning data holding means, the document written in the first language belonging to the category in the first category structure held in the category structure holding means is translated into the second language. A document search is performed on the obtained translation document set and a document set belonging to a category in the second category structure, and a bilingual document pair of a similar first language and second language is determined. Language document search means;
Search result holding means for holding a document pair obtained by multilingual document search and adding the document pair to the learning data holding means;
Category correspondence determining means for determining correspondence between categories by referring to document pairs held in the search result holding means;
Corresponding category search system comprising:

A category structure holding step for storing a first category structure generated for the first language and a second category structure generated for the second language;
A learning data holding step for storing learning data when performing multilingual document search;
Using the learning data stored in the learning data holding step, performing a multilingual document search for the category in the first category structure and the category in the second category structure stored in the category structure step, A multilingual document search step for determining a bilingual document pair of similar first language and second language;
A search result holding step of holding the document pair obtained in the multilingual document search step and adding the document pair as learning data;
A category correspondence determination step for determining correspondence between categories with reference to the document pair stored in the search result holding step;
A corresponding category search method comprising:

A category structure holding step for storing a first category structure generated for the first language and a second category structure generated for the second language;
A learning data holding step for storing learning data when performing multilingual document search;
Using the learning data stored in the learning data holding step, performing a multilingual document search for the category in the first category structure and the category in the second category structure stored in the category structure step, A multilingual document search step for determining a bilingual document pair of similar first language and second language;
A search result holding step of holding the document pair obtained in the multilingual document search step and adding the document pair as learning data;
A category correspondence determination step for determining correspondence between categories with reference to the document pair stored in the search result holding step;
A computer program for searching for a corresponding category, which is used for causing a computer to execute a program.

Category structure holding means for holding a first category structure generated for the first language and a second category structure generated for the second language;
Learning data holding means for holding learning data when performing multilingual document search;
Using the learning data held in the learning data holding means, a multilingual document search is performed for the category in the first category structure and the category in the second category structure held in the category structure holding means. A multilingual document search means for determining a bilingual document pair of a similar first language and second language;
Category correspondence determining means for determining correspondence between categories based on document pairs obtained by multilingual document search;
Corresponding category search system comprising: