JP3945282B2

JP3945282B2 - Information search apparatus, information search method, program, and recording medium

Info

Publication number: JP3945282B2
Application number: JP2002076923A
Authority: JP
Inventors: 敬重田中
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2002-03-19
Filing date: 2002-03-19
Publication date: 2007-07-18
Anticipated expiration: 2022-03-19
Also published as: JP2003281182A

Description

【０００１】
【発明の属する技術分野】
本発明は、データベースの情報を検索する情報検索装置、情報検索方法、プログラムおよび記録媒体に関する。
【０００２】
【従来の技術】
企業などでは、例えばＬＡＮ（Local Area Network）などのコンピュータネットワーク（以下、単に「ネットワーク」と称する）が構成され、このネットワーク内における各種データの共有により、作業効率の向上化が図られている。具体的には、ネットワークを形成するいずれかのコンピュータにグループウェアやコラボレートウェアなどと呼ばれるソフトウェア（以下、「グループウェア」と称する）が導入されることで、このコンピュータ（以下、「グループウェアサーバ」と称する）が保持する各種データ（例えば、共有文書や各ユーザのスケジュールなど）に対してネットワークに接続された各コンピュータ（以下、「クライアント端末」と称する）からアクセス可能になる。
【０００３】
また、グループウェアには、クライアント端末からの要求に応じて、蓄積された文書データから該当する文書データを検索する機能が備えられている。これにより、ユーザは、クライアント端末を用いてグループウェアサーバが管理する大量の文書データから所望の文書データを見つけることが容易となる。
【０００４】
【発明が解決しようとする課題】
しかしながら、グループウェアサーバが文書データを検索する時には、全ての文書データを対象に検索処理を実行するのが一般的であり、文書データの数や各文書データの容量に比例して検索時間も長くなるといった問題がある。特に、顧客からの問い合わせに対応するコールセンターでは、グループウェアサーバが顧客からの問い合わせに応じた文書データを素早く検索して取り出す必要があるため、この問題は、より深刻化する。
【０００５】
本発明は、上述した事情を鑑みてなされたものであり、データベースに蓄積されている情報のうち、検索条件に該当する情報を特定するに要する時間を短縮することが可能な情報検索装置、情報検索方法、プログラムおよび記録媒体を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記目的を達成するために、本発明は、少なくともテキスト文を含むテキストデータと、当該テキストデータの識別情報とを対応付けるとともに、当該テキスト文に関連した複数の関連情報と、当該複数の関連情報を分類する項目と、当該テキスト文に対応するテキストデータの識別情報とを対応付けるデータベースを検索する情報検索装置であって、前記項目のうち、検索の対象となり得る重み付け単語を含む項目を指定する項目指定情報を記憶する第１の記憶手段と、前記項目指定情報によって指定された項目に分類される関連情報を前記データベースから取得する関連情報取得手段と、前記重み付け単語によって指定された単語を前記テキストデータから抽出して前記テキストデータの付加する重み付け単語付加手段と、前記テキストデータからテキスト文を抽出する本文抽出手段と、前記抽出されたテキスト文を複数の単語に分割して解析する形態素解析手段と、前記複数の単語の各々が前記テキスト文に出現する回数を計数する出現頻度計数手段と、前記関連情報取得手段によって取得された関連情報と、前記単語と当該単語の出現回数と、当該関連情報に対応する前記識別情報とを対応付けて記憶する第２の記憶手段と、前記項目指定情報によって指定された項目に則した検索条件を取得する検索条件取得手段と、前記第２の記憶手段に記憶された関連情報の中から、前記検索条件に該当する関連情報を特定し、当該関連情報に対応する前記識別情報を特定する検索手段と、を備える情報検索装置を提供する。
【０００７】
また、上記目的を達成するために、本発明は、ＣＰＵと記憶装置とを有し、少なくともテキスト文を含むテキストデータと、当該テキストデータの識別情報とを対応付けるとともに、当該テキスト文に関連した複数の関連情報と、当該複数の関連情報を分類する項目と、当該テキスト文に対応するテキストデータの識別情報とを対応付けるデータベースを検索する情報検索装置における情報検索方法であって、前記ＣＰＵが、前記項目のうち、検索の対象となり得る重み付け単語を含む項目を指定する項目指定情報を前記記憶装置に記憶する第１の過程と、前記ＣＰＵが、前記項目指定情報によって指定された項目に分類される関連情報を前記データベースから取得する第２の過程と、前記ＣＰＵが、前記重み付け単語によって指定された単語を前記テキストデータから抽出して前記テキストデータに付加する第３の過程と、前記ＣＰＵが、前記テキストデータからテキスト文を抽出する第４の過程と、前記ＣＰＵが、前記抽出されたテキスト文を複数の単語に分割して解析する第５の過程と、前記ＣＰＵが、前記複数の単語の各々が前記テキスト文に出現する回数を計数する第６の過程と、前記ＣＰＵが、前記関連情報取得手段によって取得された関連情報と、前記単語と当該単語の出現回数と、当該関連情報に対応する前記識別情報とを対応付けて前記記憶装置に記憶する第７の過程と、前記ＣＰＵが、前記項目指定情報によって指定された項目に則した検索条件を取得する第８の過程と、前記ＣＰＵが、前記記憶装置に記憶された関連情報の中から、前記検索条件に該当する関連情報を特定し、当該関連情報に対応する前記識別情報を特定する第９の過程と、を備える情報検索装置における情報検索方法を提供する。
【０００８】
上述した情報検索装置および情報検索方法によれば、データベースに記憶されている複数の項目から検索の対象となり得る項目だけが予め抽出され、そして、その抽出された項目に対して検索が行われる。従って、本発明によれば、該当する文書データを特定するに要する時間が、データベースの全ての項目に対して検索が実行されるときに比べて早くなる。また、利用者は、項目指定情報が指定する項目を変更するだけで、検索の対象とする項目を変更することができる。
【０００９】
ここで、上記情報検索装置において、前記テキストデータからテキスト文を抽出する本文抽出手段と、前記抽出されたテキスト文を複数の単語に分割して解析する形態素解析手段と、前記複数の単語の各々が前記テキスト文に出現する回数を計数する出現頻度計数手段とを備え、前記第２の記憶手段は、前記単語と当該単語の出現回数とを、前記テキスト文に対応するテキストデータの識別情報と対応付けて記憶する構成が望ましい。この構成によれば、検索条件として単語が取得された場合に、当該単語を多く含む順にテキストデータの識別情報を特定するといったことが行える。
【００１０】
また、上記目的を達成するために、本発明は、少なくともテキスト文を含むテキストデータと、当該テキストデータの識別情報とを対応付けるとともに、当該テキスト文に関連した複数の関連情報と、当該複数の関連情報を分類する項目と、当該テキスト文に対応するテキストデータの識別情報とを対応付けるデータベースを検索するコンピュータを、前記項目のうち、検索の対象となり得る重み付け単語を含む項目を指定する項目指定情報を記憶する第１の記憶手段、前記項目指定情報によって指定された項目に分類される関連情報を前記データベースから取得する関連情報取得手段、前記重み付け単語によって指定された単語を前記テキストデータから抽出して前記テキストデータの付加する重み付け単語付加手段、前記テキストデータからテキスト文を抽出する本文抽出手段、前記抽出されたテキスト文を複数の単語に分割して解析する形態素解析手段、前記複数の単語の各々が前記テキスト文に出現する回数を計数する出現頻度計数手段、前記関連情報取得手段によって取得された関連情報と、前記単語と当該単語の出現回数と、当該関連情報に対応する前記識別情報とを対応付けて記憶する第２の記憶手段と、前記項目指定情報によって指定された項目に則した検索条件を取得する検索条件取得手段、および前記第２の記憶手段に記憶された関連情報の中から、前記検索条件に該当する関連情報を特定し、当該関連情報に対応する前記識別情報を特定する検索手段として機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体に記憶されていても良いことは勿論である。
【００１１】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
【００１２】
図１は、本発明の実施形態に係る情報検索システムの構成を示す図である。この図において、グループウェアサーバ２０は、例えば磁気ディスクなどの記憶装置に格納されたグループウェアデータベース２０ａを備えている。このグループウェアデータベース２０ａには、ネットワーク２を介して接続された多数のクライアント端末３０の間で共有される文書データが蓄積されている。ここで、文書データとは、テキスト文が含まれるデータのことである。また、グループウェアサーバ２０は、共有される文書データが蓄積されたデータベース（すなわち、上述したグループウェアデータベース２０ａ）の他にも、実際には、例えば利用者毎の電子メールデータが蓄積されたデータベースや、利用者毎のスケジュールデータが蓄積されたデータベースといった多種のデータベースを備えている。
【００１３】
さて、図１において、情報検索装置１０は、パーソナルコンピュータなどから構成されており、ネットワーク２を介してクライアント端末３０からの文書データの検索要求を取得し、この検索要求に該当する文書データの候補を当該クライアント端末３０に送信するものである。さらに説明すると、情報検索装置１０は、例えば磁気ディスクなどの記憶装置を備え、この記憶装置には、検索用データベース１０ａが格納されている。情報検索装置１０は、グループウェアデータベース２０ａに蓄積されている各文書データに関連する情報を検索用データベース１０ａに蓄積し、クライアント端末３０から検索要求を取得したときに、この検索用データベース１０ａに蓄積された情報を検索するようになっている。
【００１４】
図２は、本実施形態に係る情報検索装置１０の構成を示す機能ブロック図である。同図において、設定ファイル解析部１００は、設定ファイル２００に示される指示に従って、文書データに関連する情報のうち、検索用データベース１０ａに蓄積すべき情報（以下、「検索用情報」という）を特定し、データ収集部１０２に出力する。ここで、設定ファイル２００は、例えばグループウェアサーバ２０の管理者などによって作成されるデータファイルであり、その構成を図３に示す。同図に示すように、設定ファイル２００には、取得項目、重み付け単語、格納先アドレスおよび格納元アドレスの各々が指定されている。
【００１５】
取得項目は、グループウェアサーバ２０が管理するデータ項目のうち、どの項目を取得するかを指定するものである。詳述すると、グループウェアサーバ２０は、文書データに関連する関連情報をデータ項目ごとに分けて記録されたグループウェアファイル２２を、文書データごとに備えている。図４は、このグループウェアファイルの一例を示す図である。この図において、文字列「ITEM_NAME」は、データ項目を示すものであり、この文字列「ITEM_NAME」と等号（＝）にて結ばれた文字列がデータ項目名を示す。例えば、「ITEM_NAME=Classification」である場合、データ項目名は、「分類（Classification）」となる。また、データ項目名（すなわち、文字列「ITEM_NAME」）の次行がデータ項目名に対応する文書データの関連情報である。具体的には、例えば、文字列「ITEM_NAME=Classification」の次行に記載された文字列「TYPE_TEXT=テクニカルノート」は、データ項目名「分類」に対応する文書データの関連情報が「テクニカルノート」であることを示している。そこで、取得項目は、グループウェアファイル２２に含まれるデータ項目名（文字列「ITEM_NAME」によって示されるデータ項目名）のうち、取得すべきデータ項目名を指定する。なお、図示を省略するが、このグループウェアファイル２２には、当該グループウェアファイル２２が、どの文書データに対応しているかも示されている。
【００１６】
また、設定ファイル２００における重み付け単語は、検索語として頻繁に用いられる単語を指定するためのものである。格納元アドレスは、検索対象となるデータベースが格納されているアドレスを示すものである。詳述すると、グループウェアサーバ２０は、上述したように、多数のデータベースを備えるのが一般的であり、このため、どのデータベースを検索対象とするかが特定される必要がある。そこで、アドレスを指定することにより、検索対象となるデータベースを特定するのである。また、格納先アドレスは、上述した格納元アドレスによって特定されるデータベース内の各データから検索用情報に従って抽出した情報を検索用データベース１０ａに格納するときのアドレスを示すものである。このように、検索対象となるデータベースごとに、異なる格納先アドレスが指定されることで、検索対象となるデータベースごとに抽出した多数の情報を検索用データベース１０ａに格納することができるようになっている。
【００１７】
さて、図２において、データ収集部１０２は、設定ファイル解析部１００からの検索用情報によって示される取得項目をグループウェアサーバ２０からネットワーク２を介して受け取り、次の処理を行うものである。すなわち、データ収集部１０２は、文書データおよびグループウェアファイル２２から取得した各項目のうち、文書データにおける本文部分に対応するものから本文データファイル２０２を生成するとともに、本文部分以外のものから情報データファイル２０４を生成し、各々をインデキシング部１０４に出力する。図５に示すように、本文データファイル２０２には、重み付け単語によって指定された単語（図示例では、「インターフェースデバイスＹＹＹ」など）が本文データの末尾に付加される（詳細については、後述）。また、図６に示すように、情報データファイル２０４に含まれる情報は、例えば、文書データに付されたタイトル（TITLE）や、グループウェアデータベース２０ａにおける文書データの格納元アドレス（URL：Uniform Resource Locator）などである。なお、データ収集部１０２がグループウェアサーバ２０から文書データを取得する機能は、グループウェアの製造元が提供するＡＰＩ（Application Program Interface）によって実現されている。
【００１８】
インデキシング部１０４は、データ収集部１０２から受け取った本文データファイル２０２に対して形態素解析を行った後に、インデキシング（目次化）を実行し、この実行結果を、インデックスファイル２０６に登録するものであり、コンピュータにおけるＣＰＵに相当する。インデックスファイル２０６は、検索用データベース１０ａに格納されているものであり、インデックスファイル２０６には、ページテーブル２０６ａ、キーワードテーブル２０６ｃおよび単語テーブル２０６ｂが含まれている（図７参照）。なお、各データテーブルについては、後述する。
【００１９】
ここで、インデキシング部１０４が実行する形態素解析とは、漢字仮名交じりで記載された日本語の文を単語（形態素）に分解し、各単語の読み仮名や品詞などを特定することである。形態素解析用辞書１０６は、インデキシング部１０４における形態素解析に用いられる辞書であり、様々な単語を収録している。さらに説明すると、インデキシング部１０４は、解析対象となる文の続きの部分と最も長く一致する単語を形態素解析用辞書１０６から抽出するといったことを繰り返して文を単語（形態素）に分解する。なお、単語同士が空白で区切られる言語（例えば英語）にて本文データファイルの本文が記載されている場合には、形態素解析が必要ないことは勿論である。
【００２０】
図８は、上述したページテーブルの一例を示す図である。このページテーブル２０６ａは、各文書データの概要を示す情報を管理するためのものである。このページテーブル２０６ａの１つのレコードには、文書識別情報と、サーバ識別情報と、格納元アドレスと、最終更新日時情報と、題名情報と、本文情報と、分類情報と、総単語数情報と、ソフト別文書識別情報と、参照レベル情報との各々が含まれている。
【００２１】
ここで、文書識別情報は、グループウェアデータベース２０ａから取得した文書データごとに、情報検索装置１０が固有に割り当てる識別情報である。サーバ識別情報は、その文書データの取得元であるグループウェアサーバ２０を特定する情報であり、本実施形態にあっては、図８に示すように、情報検索装置１０がサーバごとに固有に割り当てた番号によって示される。格納元アドレスは、グループウェアデータベース２０ａにおける文書データの格納アドレスを示すものであり、図８に示すように、ＵＲＬによって指定されている。最終更新日時情報は、情報検索装置１０が文書データの情報を更新した最終日時を示す情報である。題名情報は、その文書データの題名（TITLE）を示す情報であり、例えば２５６バイトといった所定バイト数の文字列によって示される。本文情報は、その文書データの本文の先頭から所定文字数（例えば２５６バイト）分の文を示すものである。
【００２２】
また、分類情報は、文書データの文書の分類を示す情報である。より具体的には、例えば、文書データがコールセンター内のネットワークで共有されるものである場合、分類情報には、その文書データが製品のテクニカルサポート用文書なのか、製品のマニュアルなのかといったことを示す情報が記録される。総単語情報は、文書データの本文における総単語数を示すものである。ソフト別文書識別情報は、グループウェアサーバ２０が文書データに割り当てた固有の識別情報を示すものである。参照レベル情報は、その文書データの閲覧がネットワークに接続された各クライアント端末に限定されているか、または、ネットワーク外の端末にも許可されているかといった情報を示すものである。ここで、サーバ識別情報と、ソフト別文書識別情報とがページテーブル２０６ａに含まれているのは、多数のサーバに同一のグループウェアが導入されている場合に、各々のサーバが同一の識別情報を文書データに割り当てたときでも、どのサーバのどの文書データなのかを一意に特定できるようにするためである。
【００２３】
次いで、図９は、上述した単語テーブルの一例を示す図である。この単語テーブル２０６ｂは、各文書データの本文に含まれる単語を管理するためのものである。より具体的には、図９に示すように、単語テーブル２０６ｂの１つのレコードには、単語と、情報検索装置１０が単語ごとに固有に割り当てられる単語識別情報と、グループウェアデータベース２０ａに蓄積されている全文書データのうち、この単語を本文に含む文書データの数を示す単語使用文書数とが含まれている。ここで、単語使用文書数は、インデキシング部１０４が文書データの本文データファイル２０２に対して形態素解析を行った結果に従って算出されるものである。具体的には、インデキシング部１０４は、１つの本文データファイル２０２に形態素解析を行って本文を単語（形態素）に分解した後に、各々の単語ごとに固有の識別情報を割り当てて、単語テーブル２０６ｂに登録する。そして、インデキシング部１０４は、登録した単語識別情報に対応する単語使用文書数の値を「１」だけインクリメントする。係る処理がグループウェアデータベース２０ａに蓄積されている全ての文書データについて行われた結果、単語ごとの単語使用文書数が得られる。
【００２４】
また、図１０は、上述したキーワードテーブルの一例を示す図である。このキーワードテーブル２０６ｃは、各文書データの本文に含まれる単語ごとに、１つの単語が何回出現しているかなどを管理するためのものである。具体的には、図１０に示すように、キーワードテーブル２０６ｃの１つのレコードには、上述した単語テーブル２０６ｂに含まれる単語識別情報と、上述したページテーブル２０６ａに含まれる文書識別情報と、出現回数と、重要度とが含まれている。出現回数は、単語が、文書識別情報によって特定される文書データの本文内に何回出現するかを示すものであり、インデキシング部１０４が行う形態素解析により得られる。さらに説明すると、インデキシング部１０４は、文書データの本文データファイル２０２の本文を単語（形態素）に分解した後に、その本文内に、単語識別情報によって示される単語が幾つ含まれるかを計数することにより、出現頻度を算出する。重要度は、全文書データの本文における単語の頻出度を示すものであり、次の式を用いてインデキシング部１０４により算出される。
（重要度）＝Ｓ×ｌｏｇ(Ｎ／ｎ)
ここで、Ｓは、出現回数、Ｎは、グループウェアデータベース２０ａに蓄積されている文書データの数、ｎは、上述した単語使用文書数である。この式によって示されるように、本文に同じ単語が含まれる文書データが多くなる程、その単語の重要度が小さくなり、また、１つの文書データの本文に同じ単語が頻繁に出現する程、その単語の重要度が高くなる。ここで、上述したように、文書データの本文データファイル２０２の末尾には、データ収集部１０２により重み付け単語が付与されているため、この重み付け単語の重要度は、相対的に高くなるのである。特に、文書データの題目（TITLE）には、その文書データの本文の内容を顕著に反映した単語が含まれることが多いため、この題目を本文データファイル２０２に重み付けするようにしても良い。
【００２５】
図２において、検索要求取得応答部１０８は、ネットワーク２を介してクライアント端末３０から検索要求を受け取り、検索部１１０に出力する。この検索要求取得応答部１０８は、コンピュータにおけるネットワークインターフェースデバイスに相当する。また、検索部１１０は、検索要求取得応答部１０８からの検索要求に応じて検索用データベース１０ａに格納されているインデックスファイル２０６を検索し、検索結果を、検索要求取得応答部１０８に出力する。検索要求取得応答部１０８は、検索部１１０から検索結果を受け取ると、この検索結果をネットワーク２を介してクライアント端末３０に送信する。
【００２６】
次いで、本実施形態に係る情報検索装置１０の動作について説明する。
ここで、以下に説明する各処理手順を規定するプログラムは、情報検索装置１０が備えるＲＯＭや磁気ディスクなどの記録媒体に格納されている。なお、このプログラムは、例えば、光ディスクや光磁気ディスク、磁気ディスクなどの可搬型の記録媒体に記録されたものが情報検索装置１０にインストールされたものでも良く、また、ネットワーク２を介して当該情報検索装置１０にインストールされたものであっても良い。
【００２７】
さて、情報検索装置１０は、グループウェアデータベース２０ａに蓄積されている各文書データの情報を示すインデックスファイル２０６に登録するための登録処理を実行する。具体的には、図１１に示すように、先ず、設定ファイル解析部１００が設定ファイル２００を読み出して、設定ファイル２００によって指示される取得項目、重み付け単語、格納元アドレスおよび格納先アドレスを特定し、これらの特定した情報を検索用情報としてデータ収集部１０２に出力する（ステップＳａ１）。
【００２８】
次に、データ収集部１０２は、設定ファイル解析部１００からの検索用情報によって示される取得項目をグループウェアサーバ２０からネットワーク２を介して受け取り、本文データファイル２０２（図５参照）および情報データファイル２０４（図６参照）を生成し、各々をインデキシング部１０４に出力する（ステップＳａ２）。
【００２９】
そして、インデキシング部１０４は、データ収集部１０２から受け取った本文データファイル２０２に対して形態素解析を行った後に、インデキシングを実行し、この実行結果を、３つのデータテーブルを含むインデックスファイル２０６に登録する。（ステップＳａ３）。これにより、１つの文書データに関する情報がインデックスファイル２０６に登録されることとなる。次いで、データ収集部１０２は、グループウェアデータベース２０ａ内に処理されてない文書データがあるかを判別し（ステップＳａ４）、この判別結果がＹＥＳであれば、残りの文書データの情報をインデックスファイル２０６に登録すべく、処理手順をステップＳａ２に戻す。一方、ステップＳａ４における判別結果がＮＯであれば、データ収集部１０２は、処理を終了する。これにより、グループウェアデータベース２０ａに蓄積されている全ての文書データの情報がインデックスファイル２０６に登録されることとなる。
【００３０】
ところで、グループウェアデータベース２０ａに蓄積されている文書データに対して、追加または削除が行われたり、また、１つの文書データに対して編集が行われたりといった編集処理が頻繁に行われる。そこで、情報検索装置１０は、インデックスファイル２０６に登録されている情報とグループウェアデータベース２０ａ内の各文書データの整合性が崩れないように、次のインデックスファイル修正処理を一定時間ごとに行っている。
【００３１】
すなわち、図１２に示すように、先ず、データ収集部１０２は、設定ファイル解析部１００からの検索用情報によって示される取得項目をグループウェアサーバ２０からネットワーク２を介して受け取り、本文データファイル２０２および情報データファイル２０４を生成し、各々をインデキシング部１０４に出力する（ステップＳｂ１）。インデキシング部１０４は、本文データファイル２０２、情報データファイル２０４およびインデックスファイル２０６に登録されている情報から、文書データが、▲１▼追加されたものであるか、▲２▼修正されたものであるか、▲３▼編集が加えられていないものか、を判別する（ステップＳｂ２）。
【００３２】
より具体的には、インデキシング部１０４は、情報データファイル２０４に含まれているサーバ識別情報およびソフト別文書識別情報に該当するものがインデックスファイル２０６のページテーブル２０６ａに登録されていなければ、この文書データが追加されたものであると判別する。一方、情報データファイル２０４に含まれているサーバ識別情報およびソフト別文書識別情報に該当するものが、インデックスファイル２０６のページテーブル２０６ａに既に登録されているものの、最終更新日時情報が情報データファイル２０４とインデックスファイル２０６との間で異なる場合には、インデキシング部１０４は、この文書データが修正されたと判別する。さらにまた、サーバ識別情報、ソフト別文書識別情報および最終更新日時情報の各々がいずれも情報データファイル２０４とインデックスファイル２０６との間で同じであれば、インデキシング部１０４は、この文書データに対して何ら編集処理が成されていないと判別する。
【００３３】
さて、ステップＳｂ２における判別結果が、▲１▼追加されたものである、と判別された場合には、インデキシング部１０４は、上述した登録処理におけるステップＳａ３と同様の処理を実行し、この文書データの情報をインデックスファイル２０６に登録する（ステップＳｂ３）。次いで、データ収集部１０２は、グループウェアデータベース２０ａ内に処理されていない文書データがあるかを判別し（ステップＳｂ４）、この判別結果がＹＥＳであれば、残りの文書データを処理すべく、処理手順をステップＳｂ１に戻す。これにより、グループウェアデータベース２０ａに追加された文書データの情報がインデックスファイル２０６に新たに登録されることとなる。
【００３４】
一方、ステップＳｂ２の判別において、▲２▼修正されたものである、と判別された場合には、インデキシング部１０４は、この文書データに対応するインデックスファイル２０６の情報を一旦削除した後に、この文書データに対応する情報を新たに生成し、インデックスファイル２０６に登録する。より具体的には、インデキシング部１０４は、先ず、この文書データに対応する文書識別情報（図８参照）を特定し（ステップＳｂ５）、インデックスファイル２０６に含まれるページテーブル２０６ａ、単語テーブル２０６ｂ、キーワードテーブル２０６ｃの各々のテーブルから、特定した文書識別情報に関する情報を一括して削除する（ステップＳｂ６）。次いで、インデキシング部１０４は、この文書データに対応する情報を上述したインデキシング処理により生成し、インデックスファイル２０６に登録する（ステップＳｂ７）。次いで、データ収集部１０２は、グループウェアデータベース２０ａ内に処理されていない文書データがあるかを判別し（ステップＳｂ４）、この判別結果がＹＥＳであれば、残りの文書データを処理すべく、処理手順をステップＳｂ１に戻す。これにより、文書データに対して行われた修正がインデックスファイル２０６に反映されることとなる。また、ステップＳｂ２における判別結果が、▲３▼編集が加えられていないものであると判別された場合にも、インデキシング部１０４は、処理ステップをステップＳｂ４に進める。
【００３５】
次いで、ステップＳｂ４における判別結果がＮＯであれば、グループウェアデータベース２０ａ内の全ての文書データに対して処理が実行されたこととなる。従って、上述した一連の処理の間、インデックスファイル２０６（ページテーブル２０６ａ）において、一度も参照されなかった文書識別情報に対応する文書データは、グループウェアデータベース２０ａ内に存在しないこととなる。従って、インデキシング部１０４は、インデックスファイル２０６のページテーブル２０６ａから、参照されなかった文書識別情報を全て抽出し（ステップＳｂ８）、抽出した文書識別情報に対応する各情報を、インデックスファイル２０６に含まれる全てのテーブルから削除して（ステップＳｂ９）、処理を終了する。これにより、グループウェアデータベース２０ａから削除された文書データに対応する情報がインデックスファイル２０６から削除されることとなる。また、文書データが削除された場合、その文書識別情報に対応する情報をインデックスファイル２０６から削除するだけでよいため、インデックスファイル２０６の修正に要する時間が短縮される。
【００３６】
このように、インデックスファイル２０６には、グループウェアデータベース２０ａに蓄積されている各文書データの情報が登録され、文書データに対して、追加や削除、修正といった編集処理が行われたとしても、上述したインデックスファイル修正処理が一定時間ごとに繰り返し行われることで、その編集処理に応じて変更された情報がインデックスファイル２０６に即座に反映される。
【００３７】
さて、情報検索装置１０の検索要求取得応答部１０８は、クライアント端末３０からネットワーク２を介して検索要求を受け取ると、この検索要求を検索部１１０に出力する。検索部１１０は、受け取った検索要求に従ってインデックスファイル２０６を検索し、該当する文書データの情報を抽出する。より具体的には、検索要求には、検索語として、検索用の単語、または、設定ファイル２００によって指定されたデータ項目が含まれている。例えば、検索要求に単語が検索語として含まれている場合、検索部１１０は、キーワードテーブル２０６ｃを参照し、その単語（詳細には、単語識別情報）の重要度が最も大きい順に文書識別情報を抽出する。そして、検索部１１０は、重要度の上位から所定の数（例えば２０など）だけの文書識別情報に対応する題名情報、本文情報および格納元アドレス（ＵＲＬ）などをページテーブル２０６ａから抽出し、検索要求取得応答部１０８を介してクライアント端末３０に送信する。これにより、クライアント端末３０に検索語に対応した文書データの候補が送信されることとなる。また、検索語として、例えば最終編集日時が検索要求に含まれていた場合には、検索部１１０は、ページテーブル２０６ａの各レコードを検索し、該当する文書識別情報に対応する題名情報、本文情報および格納元アドレス（ＵＲＬ）を検索要求取得応答部１０８を介してクライアント端末３０に送信する。なお、検索要求には、検索語として、単語およびデータ項目の各々が含まれていても良いことは勿論である。
【００３８】
このように、本実施形態によれば、グループウェアデータベース２０ａに蓄積されている文書データごとに、検索条件となり得る情報だけがインデックスファイル２０６に予め登録されている。情報検索装置１０は、検索要求を受けた場合には、このインデックスファイル２０６を検索すれば良く、インデックスファイル２０６のデータ量は、グループウェアデータベース２０ａに蓄積されている文書データのデータ量よりも小さいため、グループウェアデータベース２０ａの各文書データを対象として検索するよりも、速く検索が行える。さらに、利用者などが設定ファイル２００によって指定する取得項目を変更すれば、インデックスファイル２０６に登録されるデータ項目を変更することができるため、検索の用途に合わせてインデックスファイル２０６を構成しておくことができる。
また、本実施形態にて説明した情報検索装置１０は、複数のグループウェア間で汎用的に用いられ得るものである。さらに詳述すると、グループウェア毎に設定ファイル２００に記述する取得項目を変更するだけで、グループウェア毎にインデックスファイル２０６が構築されることになる。また、このような構成により、グループウェア毎にインデックスファイル２０６を構築すべく設定ファイル２００を変更したとしても、変更された設定ファイル２００に対応させて情報検索装置１０を動作させるべく、本実施形態に係る情報検索のためのプログラムを再度コンパイルする必要がない。
【００３９】
＜変形例＞
上述した実施形態は、あくまでも例示であって、本発明の一態様を示すものであり、本発明の範囲内で任意に変形可能である。そこで、以下に、各種の変形例について説明する。
【００４０】
例えば、上述した実施形態では、ネットワーク２にグループウェアサーバ２０が１つだけ接続される構成について例示したが、これに限らず、グループウェアサーバ２０が複数接続される構成であっても良い。さらに、夫々のグループウェアサーバ２０には、互いに異なるグループウェアが導入されていても良い。さらに詳述すると、互いに異なる複数のグループウェアサーバの各々のデータベースを統括的に検索することは、グループウェア毎にデータの管理形式（例えばデータ項目の数や名前など）が異なるため、一般的に困難である。これに対して、本変形例は、検索対象となり得るデータ項目の情報だけをインデックスファイル２０６のページテーブル２０６ａに登録する構成となっている。従って、情報検索装置１０がページテーブル２０６ａを検索することは、複数のグループウェアサーバの各々のデータベースを検索することと同等なことであり、これにより、複数のグループウェアサーバの各々のデータベースの検索が実現される。
【００４１】
また、例えば、インデキシング部１０４は、本文データファイル２０２に対して形態素解析を行う際に、例えば「ＰＣ」、「パーソナルコンピュータ」、「パソコン」といった、互いに同一のものを指す単語を一つの単語として扱っても良い。これにより、例えば、検索語として「パソコン」が検索要求に含まれていた場合でも、「ＰＣ」や「パーソナルコンピュータ」といった単語を含む文書データも該当する文書データとして抽出され、検索の精度が向上する。
【００４２】
【発明の効果】
本発明によれば、データベースに蓄積されている情報のうち、検索条件に該当する情報を特定するに要する時間を短縮することが可能な情報検索装置、情報検索方法、プログラムおよび記録媒体が提供される。
【図面の簡単な説明】
【図１】本発明の実施形態に係る情報検索システムの構成を示すブロック図である。
【図２】情報検索装置の機能的構成を示すブロック図である。
【図３】同設定ファイルの一例を示す図である。
【図４】同グループウェアファイルの一例を示す図である。
【図５】同本文データファイルの一例を示す図である。
【図６】同情報データファイルの一例を示す図である。
【図７】同インデックスファイルのデータ構成を示す概念図である。
【図８】同ページテーブルの一例を示す図である。
【図９】同単語テーブルの一例を示す図である。
【図１０】同キーワードテーブルの一例を示す図である。
【図１１】情報検索装置によって実行される登録処理の手順を示すフローチャートである。
【図１２】情報検索装置によって実行されるインデックスファイル修正処理の手順を示すフローチャートである。
【符号の説明】
１０・・・情報検索装置、１０ａ・・・検索用データベース、２０・・・・グループウェアサーバ、２０ａ・・・グループウェアデータベース、３０・・・クライアント端末、１００・・・設定ファイル解析部、１０２・・・データ収集部、１０４・・・インデキシング部、１０６・・・形態素解析用辞書、１０８・・・検索要求取得応答部、１１０・・・検索部、２００・・・設定ファイル、２０６・・・インデックスファイル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information search apparatus, an information search method, a program, and a recording medium that search for information in a database.
[0002]
[Prior art]
In companies and the like, for example, a computer network (hereinafter simply referred to as “network”) such as a LAN (Local Area Network) is configured, and work efficiency is improved by sharing various data in the network. Specifically, software (hereinafter referred to as “groupware”) called groupware or collaborative wear is introduced into any of the computers forming the network, whereby this computer (hereinafter referred to as “groupware server”). Can be accessed from each computer (hereinafter referred to as “client terminal”) connected to the network.
[0003]
In addition, the groupware has a function of searching for corresponding document data from the stored document data in response to a request from the client terminal. Accordingly, the user can easily find desired document data from a large amount of document data managed by the groupware server using the client terminal.
[0004]
[Problems to be solved by the invention]
However, when the groupware server searches for document data, it is common to execute search processing for all the document data, and the search time increases in proportion to the number of document data and the capacity of each document data. There is a problem of becoming. In particular, in a call center that responds to inquiries from customers, this problem becomes more serious because the groupware server needs to quickly retrieve and retrieve document data in response to inquiries from customers.
[0005]
The present invention has been made in view of the above-described circumstances, and is an information search apparatus and information capable of reducing the time required to specify information corresponding to a search condition among information stored in a database. An object is to provide a search method, a program, and a recording medium.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, the present invention associates at least text data including a text sentence with identification information of the text data, and combines a plurality of related information related to the text sentence and the plurality of related information. An information search device that searches a database that associates items to be classified with identification information of text data corresponding to the text sentence, and specifies items including weighted words that can be searched among the items. First storage means for storing information; related information acquisition means for acquiring related information classified into the item specified by the item specifying information from the database; and a word specified by the weighted word as the text data Weighted word adding means for extracting the text data and adding the text data; and the text Text extraction means for extracting a text sentence from the data; morpheme analysis means for analyzing the extracted text sentence by dividing it into a plurality of words; and counting the number of times each of the plurality of words appears in the text sentence. A second memory for storing the associated information acquired by the related information acquiring unit, the word, the number of appearances of the word, and the identification information corresponding to the related information in association with each other. A search condition acquisition unit that acquires a search condition in accordance with an item specified by the item specification information; and related information that corresponds to the search condition among the related information stored in the second storage unit And a search means for specifying the identification information corresponding to the related information.
[0007]
In order to achieve the above object, the present invention includes a CPU and a storage device, and associates text data including at least a text sentence with identification information of the text data, and a plurality of text data related to the text sentence. An information search method in an information search apparatus for searching a database that associates the related information, items for classifying the plurality of related information, and identification information of text data corresponding to the text sentence, wherein the CPU Among the items, a first step of storing in the storage device item specifying information that specifies items including weighted words that can be searched, and the CPU is classified into items specified by the item specifying information A second step of acquiring related information from the database; and the CPU is simply designated by the weighted word. Is extracted from the text data and added to the text data, a fourth process in which the CPU extracts a text sentence from the text data, and the CPU extracts the extracted text sentence. A fifth step of dividing and analyzing a plurality of words; a sixth step in which the CPU counts the number of times each of the plurality of words appears in the text sentence; and the CPU acquiring the related information A seventh step of associating and storing the related information acquired by the means, the word, the number of appearances of the word, and the identification information corresponding to the related information in the storage device; An eighth step of acquiring a search condition in accordance with an item designated by the item designation information, and the CPU corresponds to the search condition from the related information stored in the storage device Identify communicating information, providing information search method in an information retrieval apparatus comprising: a ninth step of identifying the identification information corresponding to the relevant information.
[0008]
According to the information search apparatus and the information search method described above, only items that can be searched are extracted in advance from a plurality of items stored in the database, and a search is performed on the extracted items. Therefore, according to the present invention, the time required to specify the corresponding document data is faster than when the search is executed for all items in the database. Further, the user can change the item to be searched only by changing the item specified by the item specifying information.
[0009]
Here, in the information search device, a text extracting means for extracting a text sentence from the text data, and dividing the extracted text sentence into a plurality of words And analyze Morphological analysis means, and appearance frequency counting means that counts the number of times each of the plurality of words appears in the text sentence, the second storage means includes the word and the word Number of appearances Is preferably stored in association with identification information of text data corresponding to the text sentence. According to this configuration, when a word is acquired as a search condition, the identification information of the text data can be specified in the order that includes the word.
[0010]
In order to achieve the above object, the present invention associates text data including at least a text sentence with identification information of the text data, a plurality of related information related to the text sentence, and the plurality of related information. A computer that searches a database that associates items for classifying information with identification information of text data corresponding to the text sentence, and item designation information for designating items including weighted words that can be searched among the items. First storage means for storing, related information acquisition means for acquiring related information classified into the item specified by the item specifying information from the database, and extracting a word specified by the weighted word from the text data Weighted word adding means for adding the text data, the text data A text extraction means for extracting a text sentence, a morpheme analysis means for analyzing the extracted text sentence by dividing it into a plurality of words, an appearance frequency counter for counting the number of times each of the plurality of words appears in the text sentence Means, second storage means for storing the related information acquired by the related information acquisition means, the word, the number of appearances of the word, and the identification information corresponding to the related information, and the item A search condition acquisition unit that acquires a search condition in accordance with an item specified by the specified information, and related information that corresponds to the search condition is identified from the related information stored in the second storage unit, and It may be stored in a computer-readable recording medium in which a program for functioning as search means for specifying the identification information corresponding to related information is recorded. It is a matter of course.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0012]
FIG. 1 is a diagram showing a configuration of an information search system according to an embodiment of the present invention. In this figure, the groupware server 20 includes a groupware database 20a stored in a storage device such as a magnetic disk. In the groupware database 20a, document data shared among a number of client terminals 30 connected via the network 2 is stored. Here, the document data is data including a text sentence. The groupware server 20 is actually a database in which e-mail data for each user is stored, for example, in addition to a database in which shared document data is stored (that is, the above-described groupware database 20a). And various databases such as a database storing schedule data for each user.
[0013]
In FIG. 1, the information retrieval apparatus 10 is configured by a personal computer or the like, acquires a document data retrieval request from the client terminal 30 via the network 2, and retrieves document data candidates corresponding to the retrieval request. Is transmitted to the client terminal 30. More specifically, the information retrieval apparatus 10 includes a storage device such as a magnetic disk, and the storage database 10a is stored in the storage device. The information search apparatus 10 stores information related to each document data stored in the groupware database 20a in the search database 10a, and stores it in the search database 10a when a search request is acquired from the client terminal 30. To search for information.
[0014]
FIG. 2 is a functional block diagram showing the configuration of the information search apparatus 10 according to this embodiment. In the drawing, the setting file analysis unit 100 specifies information to be stored in the search database 10a (hereinafter referred to as “search information”) among the information related to the document data in accordance with the instruction shown in the setting file 200. And output to the data collection unit 102. Here, the setting file 200 is a data file created by, for example, an administrator of the groupware server 20, and its configuration is shown in FIG. As shown in the figure, the setting file 200 specifies each of acquisition items, weighting words, storage destination addresses, and storage source addresses.
[0015]
The acquisition item specifies which item is to be acquired from the data items managed by the groupware server 20. More specifically, the groupware server 20 includes a groupware file 22 in which related information related to document data is recorded for each data item and is recorded for each document data. FIG. 4 is a diagram showing an example of this groupware file. In this figure, a character string “ITEM_NAME” indicates a data item, and a character string connected with the character string “ITEM_NAME” by an equal sign (=) indicates a data item name. For example, when “ITEM_NAME = Classification”, the data item name is “Classification”. The next line of the data item name (that is, the character string “ITEM_NAME”) is related information of the document data corresponding to the data item name. Specifically, for example, the character string “TYPE_TEXT = technical note” described in the next line of the character string “ITEM_NAME = Classification” is the data item name “classification”. In It shows that the related information of the corresponding document data is “Technical Note”. Therefore, the acquisition item specifies the data item name to be acquired from the data item names (data item name indicated by the character string “ITEM_NAME”) included in the groupware file 22. Although not shown, the groupware file 22 also indicates which document data the groupware file 22 corresponds to.
[0016]
The weighted word in the setting file 200 is for designating a word that is frequently used as a search word. The storage source address indicates an address where a database to be searched is stored. More specifically, as described above, the groupware server 20 generally includes a large number of databases, and therefore, it is necessary to specify which database is to be searched. Therefore, the database to be searched is specified by specifying the address. The storage destination address indicates an address when information extracted according to the search information from each data in the database specified by the storage source address is stored in the search database 10a. In this way, each database to be searched is extracted for each database to be searched by specifying different storage destination addresses for each database to be searched. Many Information can be stored in the search database 10a.
[0017]
In FIG. 2, the data collection unit 102 receives an acquisition item indicated by the search information from the setting file analysis unit 100 from the groupware server 20 via the network 2 and performs the following processing. That is, the data collection unit 102 generates the body data file 202 from the items corresponding to the body part of the document data from among the items acquired from the document data and the groupware file 22, and information data from other than the body part. A file 204 is generated and each file is output to the indexing unit 104. As shown in FIG. 5, in the body data file 202, a word specified by a weighted word (in the illustrated example, “interface device YYY” or the like) is added to the end of the body data (details will be described later). As shown in FIG. 6, the information included in the information data file 204 includes, for example, a title (TITLE) attached to the document data and a storage address (URL: Uniform Resource Locator) of the document data in the groupware database 20a. ) Etc. The function of the data collection unit 102 to acquire document data from the groupware server 20 is realized by an API (Application Program Interface) provided by the groupware manufacturer.
[0018]
The indexing unit 104 performs morphological analysis on the body data file 202 received from the data collection unit 102, executes indexing (table indexing), and registers the execution result in the index file 206. It corresponds to a CPU in a computer. The index file 206 is stored in the search database 10a, and the index file 206 includes a page table 206a, a keyword table 206c, and a word table 206b (see FIG. 7). Each data table will be described later.
[0019]
Here, the morpheme analysis performed by the indexing unit 104 is to decompose a Japanese sentence written in kanji kana mixed into words (morphemes) and specify the reading kana and part of speech of each word. The morphological analysis dictionary 106 is a dictionary used for morphological analysis in the indexing unit 104, and stores various words. To explain further, the indexing unit 104 decomposes the sentence into words (morphemes) by repeatedly extracting from the morpheme analysis dictionary 106 the word that matches the longest part of the continuation of the sentence to be analyzed. Of course, when the text of the text data file is described in a language in which words are separated by a space (for example, English), morphological analysis is not necessary.
[0020]
FIG. 8 is a diagram illustrating an example of the page table described above. The page table 206a is for managing information indicating an outline of each document data. One record of the page table 206a includes document identification information, server identification information, storage source address, last update date / time information, title information, text information, classification information, total word count information, Each of the software-specific document identification information and the reference level information is included.
[0021]
Here, the document identification information is identification information uniquely assigned by the information search apparatus 10 for each document data acquired from the groupware database 20a. The server identification information is information for identifying the groupware server 20 from which the document data is acquired. In this embodiment, as shown in FIG. 8, the information search apparatus 10 assigns each server uniquely. Indicated by a number. The storage source address indicates the storage address of the document data in the groupware database 20a, and is specified by the URL as shown in FIG. The last update date / time information is information indicating the last date / time when the information retrieval apparatus 10 updated the document data information. The title information is information indicating the title (TITLE) of the document data, and is indicated by a character string having a predetermined number of bytes, for example, 256 bytes. The text information indicates a sentence for a predetermined number of characters (for example, 256 bytes) from the top of the text of the document data.
[0022]
The classification information is information indicating the classification of the document data. More specifically, for example, when document data is shared by a network in a call center, the classification information indicates whether the document data is a product technical support document or a product manual. Information is recorded. The total word information indicates the total number of words in the text of the document data. The software-specific document identification information indicates unique identification information assigned to document data by the groupware server 20. The reference level information indicates information indicating whether browsing of the document data is limited to each client terminal connected to the network or whether it is permitted to a terminal outside the network. Here, the server identification information and the software-specific document identification information are included in the page table 206a when the same groupware is installed in many servers and each server has the same identification information. This is because it is possible to uniquely identify which document data of which server even when is assigned to document data.
[0023]
Next, FIG. 9 is a diagram illustrating an example of the above-described word table. The word table 206b is for managing words included in the text of each document data. More specifically, as shown in FIG. 9, in one record of the word table 206b, words, word identification information to which the information search apparatus 10 is uniquely assigned for each word, and the groupware database 20a are accumulated. Among all the document data, the number of word-using documents indicating the number of document data including the word in the text is included. Here, the number of word usage documents is calculated according to the result of the morphological analysis performed on the text data file 202 of the document data by the indexing unit 104. Specifically, the indexing unit 104 performs a morphological analysis on one text data file 202 and decomposes the text into words (morphemes), assigns unique identification information to each word, and stores it in the word table 206b. sign up. Then, the indexing unit 104 increments the value of the number of word use documents corresponding to the registered word identification information by “1”. As a result of such processing being performed on all the document data stored in the groupware database 20a, the number of word-using documents for each word is obtained.
[0024]
FIG. 10 is a diagram illustrating an example of the keyword table described above. This keyword table 206c is for managing how many times one word appears for each word included in the text of each document data. Specifically, as shown in FIG. 10, one record of the keyword table 206c includes word identification information included in the word table 206b, document identification information included in the page table 206a, and the number of appearances. And importance. The number of appearances indicates how many times the word appears in the text of the document data specified by the document identification information, and is obtained by morphological analysis performed by the indexing unit 104. More specifically, the indexing unit 104 decomposes the body of the body data file 202 of document data into words (morphemes), and then counts how many words indicated by the word identification information are included in the body. The frequency of appearance is calculated. The importance level indicates the frequency of word occurrence in the text of all document data, and is calculated by the indexing unit 104 using the following formula.
(Importance) = S × log (N / n)
Here, S is the number of appearances, N is the number of document data stored in the groupware database 20a, and n is the number of word-using documents described above. As shown by this formula, the more document data that contains the same word in the body, the less important the word is. The more frequently the same word appears in the body of one document data, the more The importance of words increases. Here, as described above, since the weighting word is given to the end of the text data file 202 of the document data by the data collection unit 102, the importance of the weighting word becomes relatively high. In particular, since the title (TITLE) of document data often includes words that significantly reflect the content of the text of the document data, the title may be weighted to the text data file 202.
[0025]
In FIG. 2, the search request acquisition response unit 108 receives a search request from the client terminal 30 via the network 2 and outputs it to the search unit 110. This search request acquisition response unit 108 corresponds to a network interface device in a computer. In addition, the search unit 110 searches the index file 206 stored in the search database 10a in response to the search request from the search request acquisition response unit 108, and outputs the search result to the search request acquisition response unit 108. When receiving the search result from the search unit 110, the search request acquisition response unit 108 transmits the search result to the client terminal 30 via the network 2.
[0026]
Next, the operation of the information search apparatus 10 according to this embodiment will be described.
Here, a program for defining each processing procedure described below is stored in a recording medium such as a ROM or a magnetic disk provided in the information search apparatus 10. The program may be a program recorded on a portable recording medium such as an optical disc, a magneto-optical disc, or a magnetic disc and installed in the information search apparatus 10, and the information may be transmitted via the network 2. It may be installed in the search device 10.
[0027]
Now, the information search device 10 executes a registration process for registering in the index file 206 indicating information of each document data stored in the groupware database 20a. Specifically, as illustrated in FIG. 11, first, the setting file analysis unit 100 reads the setting file 200 and specifies an acquisition item, a weighted word, a storage source address, and a storage destination address specified by the setting file 200. The identified information is output to the data collection unit 102 as search information (step Sa1).
[0028]
Next, the data collection unit 102 receives the acquisition item indicated by the search information from the setting file analysis unit 100 from the groupware server 20 via the network 2, and receives the body data file 202 (see FIG. 5) and the information data file. 204 (see FIG. 6) are generated and output to the indexing unit 104 (step Sa2).
[0029]
Then, the indexing unit 104 performs morphological analysis on the body data file 202 received from the data collection unit 102, executes indexing, and registers the execution result in the index file 206 including three data tables. . (Step Sa3). As a result, information related to one piece of document data is registered in the index file 206. Next, the data collection unit 102 determines whether there is unprocessed document data in the groupware database 20a (step Sa4). If the determination result is YES, the information of the remaining document data is stored in the index file 206. The processing procedure is returned to step Sa2. On the other hand, if the determination result in step Sa4 is NO, the data collection unit 102 ends the process. As a result, information of all document data stored in the groupware database 20a is registered in the index file 206.
[0030]
By the way, editing processing such as addition or deletion of document data stored in the groupware database 20a or editing of one document data is frequently performed. Therefore, the information retrieval apparatus 10 performs the following index file correction processing at regular intervals so that the consistency between the information registered in the index file 206 and each document data in the groupware database 20a is not lost. .
[0031]
That is, as shown in FIG. 12, first, the data collection unit 102 receives an acquisition item indicated by the search information from the setting file analysis unit 100 from the groupware server 20 via the network 2, and receives the body data file 202 and The information data file 204 is generated and each is output to the indexing unit 104 (step Sb1). The indexing unit 104 has the document data added (1) or corrected (2) from the information registered in the body data file 202, the information data file 204, and the index file 206. Or {circle around (3)} whether or not editing has been added (step Sb2).
[0032]
More specifically, the indexing unit 104 determines that the document corresponding to the server identification information and the software-specific document identification information included in the information data file 204 is not registered in the page table 206a of the index file 206. It is determined that data has been added. On the other hand, information corresponding to the server identification information and software-specific document identification information included in the information data file 204 is already registered in the page table 206a of the index file 206, but the last update date / time information is the information data file 204. If the index file 206 is different from the index file 206, the indexing unit 104 determines that the document data has been corrected. Furthermore, if each of the server identification information, the software-specific document identification information, and the last update date / time information is the same between the information data file 204 and the index file 206, the indexing unit 104 performs the processing for the document data. It is determined that no editing process has been performed.
[0033]
When the determination result in step Sb2 is determined to have been added (1), the indexing unit 104 executes the same processing as in step Sa3 in the registration processing described above, and this document data Is registered in the index file 206 (step Sb3). Next, the data collection unit 102 determines whether there is unprocessed document data in the groupware database 20a (step Sb4). If the determination result is YES, the process is performed to process the remaining document data. The procedure returns to step Sb1. As a result, the document data information added to the groupware database 20a is newly registered in the index file 206.
[0034]
On the other hand, if it is determined in step Sb2 that (2) it has been corrected, the indexing unit 104 once deletes the information of the index file 206 corresponding to the document data, and then Information corresponding to the data is newly generated and registered in the index file 206. More specifically, the indexing unit 104 first specifies document identification information (see FIG. 8) corresponding to the document data (step Sb5), and includes a page table 206a, a word table 206b, a keyword included in the index file 206. Information relating to the specified document identification information is deleted from each table 206c (step Sb6). Next, the indexing unit 104 generates information corresponding to the document data by the above-described indexing process and registers it in the index file 206 (step Sb7). Next, the data collection unit 102 determines whether there is unprocessed document data in the groupware database 20a (step Sb4). If the determination result is YES, the process is performed to process the remaining document data. The procedure returns to step Sb1. As a result, the modifications made to the document data are reflected in the index file 206. Also, if it is determined that the determination result in step Sb2 is that (3) editing has not been added, the indexing unit 104 advances the processing step to step Sb4.
[0035]
Next, if the determination result in step Sb4 is NO, it means that the process has been executed for all the document data in the groupware database 20a. Therefore, document data corresponding to document identification information that has never been referred to in the index file 206 (page table 206a) during the series of processes described above does not exist in the groupware database 20a. Accordingly, the indexing unit 104 extracts all unidentified document identification information from the page table 206a of the index file 206 (step Sb8), and each piece of information corresponding to the extracted document identification information is included in the index file 206. It deletes from all the tables (step Sb9) and complete | finishes a process. As a result, information corresponding to the document data deleted from the groupware database 20a is deleted from the index file 206. In addition, when document data is deleted, it is only necessary to delete information corresponding to the document identification information from the index file 206, so that the time required for correcting the index file 206 is shortened.
[0036]
As described above, even if the document file information stored in the groupware database 20a is registered in the index file 206 and an editing process such as addition, deletion, or correction is performed on the document data, the above-described processing is performed. By performing the index file correction process repeatedly at regular intervals, the information changed according to the editing process is immediately reflected in the index file 206.
[0037]
Upon receiving a search request from the client terminal 30 via the network 2, the search request acquisition response unit 108 of the information search apparatus 10 outputs this search request to the search unit 110. The search unit 110 searches the index file 206 in accordance with the received search request, and extracts corresponding document data information. More specifically, the search request includes a search word or a data item specified by the setting file 200 as a search word. For example, when a word is included as a search word in the search request, the search unit 110 refers to the keyword table 206c and sets document identification information in descending order of importance of the word (specifically, word identification information). Extract. Then, the search unit 110 extracts title information, text information, storage source addresses (URLs) and the like corresponding to a predetermined number (for example, 20) of document identification information from the top of the importance level from the page table 206a, and searches It transmits to the client terminal 30 via the request acquisition response unit 108. As a result, document data candidates corresponding to the search word are transmitted to the client terminal 30. If the search request includes, for example, the last edit date and time as a search term, the search unit 110 searches each record in the page table 206a, and title information and text information corresponding to the corresponding document identification information. The storage source address (URL) is transmitted to the client terminal 30 via the search request acquisition response unit 108. Of course, the search request may include each of a word and a data item as a search term.
[0038]
As described above, according to the present embodiment, only information that can serve as a search condition is registered in the index file 206 in advance for each document data stored in the groupware database 20a. When the information search device 10 receives a search request, the information search device 10 may search the index file 206. The data amount of the index file 206 is smaller than the data amount of the document data stored in the groupware database 20a. Therefore, the search can be performed faster than the search for each document data in the groupware database 20a. Furthermore, if the user or the like changes the acquisition item specified by the setting file 200, the data item registered in the index file 206 can be changed. Therefore, the index file 206 is configured according to the purpose of the search. be able to.
Further, the information search apparatus 10 described in the present embodiment can be used universally among a plurality of groupware. More specifically, the index file 206 is constructed for each groupware only by changing the acquisition items described in the setting file 200 for each groupware. Further, with this configuration, even if the setting file 200 is changed to construct the index file 206 for each groupware, the information retrieval apparatus 10 is operated in accordance with the changed setting file 200. There is no need to recompile the information retrieval program related to the above.
[0039]
<Modification>
The above-described embodiment is merely an example, shows one aspect of the present invention, and can be arbitrarily modified within the scope of the present invention. Accordingly, various modifications will be described below.
[0040]
For example, in the above-described embodiment, the configuration in which only one groupware server 20 is connected to the network 2 is illustrated. However, the configuration is not limited thereto, and a configuration in which a plurality of groupware servers 20 are connected may be used. Furthermore, different groupware may be introduced into each groupware server 20. More specifically, generally searching each database of a plurality of different groupware servers has different data management formats (for example, the number of data items and names) for each groupware. Have difficulty. On the other hand, this modification has a configuration in which only information on data items that can be searched is registered in the page table 206a of the index file 206. Therefore, searching the page table 206a by the information search apparatus 10 is equivalent to searching each database of a plurality of groupware servers, and thereby searching each database of the plurality of groupware servers. Is realized.
[0041]
Further, for example, when the morphological analysis is performed on the body data file 202, the indexing unit 104 uses, as one word, words indicating the same thing, such as “PC”, “personal computer”, and “personal computer”. May be handled. As a result, for example, even if “PC” is included in the search request as a search term, document data including words such as “PC” and “personal computer” are also extracted as corresponding document data, and the search accuracy is improved. To do.
[0042]
【The invention's effect】
According to the present invention, there are provided an information search device, an information search method, a program, and a recording medium that can reduce the time required to specify information that satisfies a search condition among information stored in a database. The
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an information search system according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a functional configuration of the information search apparatus.
FIG. 3 is a diagram showing an example of the setting file.
FIG. 4 is a diagram showing an example of the groupware file.
FIG. 5 is a diagram showing an example of the text data file.
FIG. 6 is a diagram showing an example of the information data file.
FIG. 7 is a conceptual diagram showing a data structure of the index file.
FIG. 8 is a diagram showing an example of the page table.
FIG. 9 is a diagram showing an example of the word table.
FIG. 10 is a diagram showing an example of the keyword table.
FIG. 11 is a flowchart illustrating a procedure of registration processing executed by the information search apparatus.
FIG. 12 is a flowchart showing a procedure of index file correction processing executed by the information search device.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Information search apparatus, 10a ... Search database, 20 ... Groupware server, 20a ... Groupware database, 30 ... Client terminal, 100 ... Setting file analysis part, 102 Data collection unit, 104 ... Indexing unit, 106 ... Morphological analysis dictionary, 108 ... Search request acquisition response unit, 110 ... Search unit, 200 ... Setting file, 206 ... -Index file.

Claims

The text data including at least the text sentence is associated with the identification information of the text data, the plurality of related information related to the text sentence, the items for classifying the related information, and the text corresponding to the text sentence An information search device for searching a database that associates identification information of data,
First storage means for storing item designation information for designating an item including a weighted word that can be a search target among the items;
Related information acquisition means for acquiring related information classified into the item designated by the item designation information from the database;
Weighting word adding means for extracting a word specified by the weighted word from the text data and adding it to the text data;
Text extraction means for extracting a text sentence from the text data;
Morphological analysis means for dividing and analyzing the extracted text sentence into a plurality of words;
Appearance frequency counting means for counting the number of times each of the plurality of words appears in the text sentence;
Second storage means for storing the related information acquired by the related information acquisition means, the word, the number of appearances of the word, and the identification information corresponding to the related information;
Search condition acquisition means for acquiring a search condition in accordance with an item designated by the item designation information;
Searching means for specifying related information corresponding to the search condition from the related information stored in the second storage means, and specifying the identification information corresponding to the related information. Information retrieval device.

The information search device according to claim 1, wherein the weighted word is a search word that serves as a search condition acquired by the search condition acquisition unit.

An item that has a CPU and a storage device, associates text data including at least a text sentence with identification information of the text data, and classifies the plurality of related information related to the text sentence and the plurality of related information And an information search method in an information search device for searching a database that associates identification information of text data corresponding to the text sentence,
A first step of storing, in the storage device, item designation information for designating an item including a weighted word that can be a search target among the items;
A second process in which the CPU acquires from the database related information classified into items specified by the item specifying information;
A third step in which the CPU extracts a word specified by the weighted word from the text data and adds the extracted word data to the text data;
A fourth process in which the CPU extracts a text sentence from the text data;
A fifth step in which the CPU divides and analyzes the extracted text sentence into a plurality of words;
A sixth step in which the CPU counts the number of times each of the plurality of words appears in the text sentence;
The CPU stores the related information acquired by the related information acquisition unit, the word, the number of appearances of the word, and the identification information corresponding to the related information in association with each other and stored in the storage device Process,
An eighth step in which the CPU acquires a search condition in accordance with an item designated by the item designation information;
The CPU includes the ninth step of identifying the relevant information corresponding to the search condition from the relevant information stored in the storage device and identifying the identification information corresponding to the relevant information. An information search method in an information search apparatus characterized by

The text data including at least the text sentence is associated with the identification information of the text data, the plurality of related information related to the text sentence, the items for classifying the related information, and the text corresponding to the text sentence A computer that searches a database that correlates data identification information,
First storage means for storing item designation information for designating an item including a weighted word that can be a search target among the items;
Related information acquisition means for acquiring related information classified into the item specified by the item specifying information from the database;
A weighted word adding means for extracting a word designated by the weighted word from the text data and adding it to the text data;
Text extraction means for extracting a text sentence from the text data;
Morphological analysis means for analyzing the extracted text sentence by dividing it into a plurality of words,
Appearance frequency counting means for counting the number of times each of the plurality of words appears in the text sentence;
Second storage means for storing the related information acquired by the related information acquisition means, the word, the number of appearances of the word, and the identification information corresponding to the related information;
Search condition acquisition means for acquiring a search condition in accordance with an item specified by the item specification information, and related information corresponding to the search condition is specified from the related information stored in the second storage means. Search means for specifying the identification information corresponding to the related information,
Program to function as.

The text data including at least the text sentence is associated with the identification information of the text data, the plurality of related information related to the text sentence, the items for classifying the related information, and the text corresponding to the text sentence A computer that searches a database that correlates data identification information,
First storage means for storing item designation information for designating an item including a weighted word that can be a search target among the items;
Related information acquisition means for acquiring related information classified into the item specified by the item specifying information from the database;
A weighted word adding means for extracting a word designated by the weighted word from the text data and adding it to the text data;
Text extraction means for extracting a text sentence from the text data;
Morphological analysis means for analyzing the extracted text sentence by dividing it into a plurality of words,
Appearance frequency counting means for counting the number of times each of the plurality of words appears in the text sentence;
Second storage means for storing the related information acquired by the related information acquisition means, the word, the number of appearances of the word, and the identification information corresponding to the related information;
Search condition acquisition means for acquiring a search condition in accordance with an item specified by the item specification information, and related information corresponding to the search condition is specified from the related information stored in the second storage means. A computer-readable recording medium having recorded thereon a program for causing it to function as search means for specifying the identification information corresponding to the related information.