JP2004246824A

JP2004246824A - Voice document search method and apparatus, and voice document search program

Info

Publication number: JP2004246824A
Application number: JP2003038781A
Authority: JP
Inventors: Yoshihiko Hayashi; 林　　良彦; Shoichi Matsunaga; 昭一松永; Yoshihiro Matsuo; 義博松尾; Katsutoshi Ofu; 克年大附
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-17
Filing date: 2003-02-17
Publication date: 2004-09-02

Abstract

【課題】音声認識の誤りの検索精度に対する影響を低減し、かつ、当該音声ドキュメントにおける話題に関連したキーによる検索をも可能にする音声ドキュメントに対する記述を生成する。
【解決手段】本発明は、入力された音声ドキュメントに対して音声認識処理を施し、文字化された音声認識結果を認識信頼度と共に取得し、拡張キー単語抽出条件と音声認識結果を照合して拡張キー単語を抽出し、抽出された拡張キー単語と拡張単語抽出条件を照合し、その結果得られた条件に基づいて外部データベースを検索し、拡張単語を抽出し、拡張単語と音声認識結果に拡張単語を埋め込むことにより、音声ドキュメントを生成する。
【選択図】図１A description is provided for a speech document that reduces the effect of speech recognition errors on search accuracy and that also enables a search using a key related to a topic in the speech document.
The present invention performs a voice recognition process on an input voice document, obtains a transcribed voice recognition result together with a recognition reliability, and compares an expanded key word extraction condition with the voice recognition result. The extended key words are extracted, the extracted extended key words are compared with the extended word extraction conditions, an external database is searched based on the obtained conditions, the extended words are extracted, and the extended words and the speech recognition results are extracted. A speech document is generated by embedding the expanded words.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、音声ドキュメント検索方法及び装置及び音声ドキュメント検索プログラムに係り、特に、録画・録音された音声コンテンツに対して音声認識を適用することにより文字化を行い、その内容を言語によるキーにより検索する音声ドキュメント検索方法及び装置及び音声ドキュメント検索プログラムに関する。
【０００２】
【従来の技術】
音声コンテンツに対して音声認識を適用することにより文字化を行い、その内容を言語によるキーによる検索を可能とするための検索システムの研究開発が行われている（例えば、非特許文献１参照）。音声認識の適用においては、認識誤りが発生することが避けられないため、認識誤りの影響を抑えるための工夫が必要となる。自動的にこれを行う方法として、認識対象の音声コンテンツとは異なる外部データベースを参照することにより、音声コンテンツを拡張する方法が提案されている（例えば、非特許文献２参照）。
【０００３】
一方、音声認識の研究開発においては、音声認識システム自身がその出力である音声認識結果に信頼度を付与する試みがなされている（例えば、非特許文献３参照）。
【０００４】
【非特許文献１】
「マルチメディア情報の解析と統合」有木康雄、人工知能学会情報統合研究会，ＳＩＧ−ＣＩＩ−２０００−Ｎｏｖ，２０００．
【０００５】
【非特許文献２】
”ＤｏｃｕｍｅｎｔＥｘｐａｎｓｉｏｎｆｏｒＳｐｅｅｃｈＲｅｔｒｉｅｖａｌ”，ＡｍｉｔＳｉｎｇｈａｌ，ＦｅｒｎａｎｄｏＰｅｒｅｉｒａ，ＰｒｏｃｅｅｄｉｎｇｓｏｆＡＣＭＳＩＧＩＲ，ｐａｇｅｓ３４−４１，Ｂｅｒｋｅｌｅｙ，ＣＡ，ＵＳＡ，Ａｕｇｕｓｔ１９９９．
【０００６】
【非特許文献３】
「音声認識精度向上のための信頼度尺度の比較」緒方淳、有木康雄、電子情報通信学会技術研究報告（音声研究会ＳＰ２０００−９４）、ｐｐ．１１３−１１８（２０００年１２月）
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来の外部データベースを参照して、音声コンテンツを拡張する方法においては、音声認識結果の全体そのものを質問要求と見做して、類似文書検索の手法による関連する外部データベースを検索し、上位にランクされた音声ドキュメントから拡張単語を抽出している。即ち、この手法においては、音声認識結果に含まれる単語を同一のものとして扱っており、認識の精度についての考慮は行われていない。
【０００８】
従って、場合によっては、誤った認識である可能性の高い部分をキーとした外部データベースの検索が行われる可能性があり、結果として抽出される拡張単語には、検索対象とした音声コンテンツと関連性の少ないものが含まれる可能性が高くなる。
【０００９】
本発明は、上記の点に鑑みなされたもので、音声認識の誤りの検索精度に対する影響を低減し、かつ、当該音声コンテンツにおける話題に関連したキーによる検索をも可能にする音声コンテンツに対するコンテンツの内容記述ドキュメントを生成する音声ドキュメント検索方法及び装置及び音声ドキュメント検索プログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１１】
本発明は、録画・録音された音声トラックを含む音声コンテンツを言語によるキーにより検索する音声ドキュメント検索方法において、
検索対象となる前記音声コンテンツを記憶媒体にから読み出して音声ドキュメント検索装置に入力し（ステップ１）、
音声ドキュメント検索装置において、
入力された音声コンテンツに対して音声認識処理を施し、文字化された音声認識結果を認識信頼度と共に取得し（ステップ２）、
予め記憶媒体に記憶されている拡張キー単語抽出条件と音声認識結果とを照合することにより、音声認識結果を拡張するための検索におけるキーとなる拡張キー単語を抽出し（ステップ３）、
抽出された拡張キー単語と予め記憶手段に記憶されている拡張単語抽出条件とを照合し、その結果得られた条件に基づいて、関連文書を蓄積する外部データベースを検索することにより得られた関連文書集合から音声認識結果を拡張するための拡張単語を取得し（ステップ４）、
拡張単語と音声認識結果に拡張単語を埋め込むことにより、検索対象の音声コンテンツの音声内容記述ドキュメントファイルを生成し（ステップ５）、
音声内容記述ドキュメントファイルを出力する（ステップ６）。
【００１２】
図２は、本発明の原理構成図である。
【００１３】
本発明は、録画・録音された音声トラックを含む音声コンテンツを言語によるキーにより検索する音声ドキュメント検索装置１００であって、
音声ドキュメント検索装置１００の外部に設けられ、関連文書を格納する外部データベース１０００を検索する関連情報検索手段６００と、
検索対象となる音声コンテンツを入力する入力手段２００と、
入力手段２００で入力された音声コンテンツに対して音声認識処理を適用することにより文字化された音声認識結果を認識信頼度と共に得る音声認識手段３００と、
予め定められた拡張キー単語抽出条件に従って、音声認識結果を拡張するための検索におけるキーとなる拡張キー単語を抽出する拡張キー単語抽出手段４００と、
抽出された拡張キー単語を用いて予め定められた拡張単語抽出条件に従って、関連情報検索手段６００により外部データベース１０００を検索して得られた関連文書集合から、音声認識結果を拡張する単語である拡張単語を取得する拡張単語抽出手段５００と、
音声認識結果に拡張単語を埋め込み、検索対象となる音声コンテンツの音声内容を記述することにより、音声内容記述ドキュメントファイルを生成する音声ドキュメント記述生成手段７００と、
生成された音声内容記述ドキュメントファイルを出力する出力手段８００と、を有する。
【００１４】
本発明は、録画・録音された音声トラックを含む音声コンテンツを言語によるキーにより検索する音声ドキュメント検索プログラムであって、
検索対象となる音声コンテンツを入力する入力ステップと、
入力ステップで入力された音声コンテンツに対して音声認識処理を適用することにより文字化された音声認識結果を認識信頼度と共に得る音声認識ステップと、
予め定められた拡張キー単語抽出条件に従って、音声認識結果を拡張するための検索におけるキーとなる拡張キー単語を抽出する拡張キー単語抽出ステップと、
抽出された拡張キー単語を用いて予め定められた拡張単語抽出条件を、音声ドキュメント検索装置の外部に設けられ、関連文書を蓄積する外部データベースを検索する関連情報検索手段に渡すことにより得られた関連文書集合から、音声認識結果を拡張する単語である拡張単語を取得する拡張単語抽出ステップと、
音声認識結果に拡張単語を埋め込み、検索対象の音声コンテンツの内容を記述することにより音声内容記述ドキュメントファイルを生成する音声ドキュメント記述生成ステップと、
生成された音声内容記述ドキュメントファイルを出力する出力ステップと、をコンピュータの制御手段に実行させる。
【００１５】
上記のように、本発明では、音声認識装置が自らの認識結果に対して出力する信頼度を利用して、精度よく認識されたと判定される部分から外部データベース検索のキーとなる語を抽出することを可能とし、質の良い拡張単語を外部データから得ることが可能となる。
【００１６】
【発明の実施の形態】
以下、図面と共に本発明の実施の形態を説明する。
【００１７】
図３は、本発明の一実施の形態における音声ドキュメント検索装置の構成を示す。
【００１８】
同図に示す音声ドキュメント検索システム１００は、入力部２００、音声認識部３００、拡張キー単語抽出部４００、拡張キー単語抽出条件テーブル４１０、拡張単語抽出部５００、拡張単語抽出条件テーブル５１０、関連情報検索部６００、音声ドキュメント記述生成部７００、出力部８００から構成される。
【００１９】
なお、当該音声ドキュメント検索装置１００の外部に外部データベース１０００が設けられているものとする。
【００２０】
また、上記の拡張キー単語抽出条件テーブル４１０、拡張単語抽出条件テーブル５１０は、ハードディスク等の記憶手段に格納されているものとする。
【００２１】
入力部２００は、検索対象となる録画・録音された音声トラックを含む音声コンテンツを入力する。入力される音声コンテンツは、ディジタル信号で表現され、ハードディスク等の記憶媒体に格納されているものとし、入力部２００では、ハードディスク等から当該音声コンテンツを読み出して入力するものとする。
【００２２】
音声認識部３００は、入力部２００でから入力された音声コンテンツに対して、既存の音声認識装置による音声認識処理を適用して文字化（例えば、ＸＭＬ言語）された音声認識結果に音声信頼度を付与して出力する。なお、音声信頼度に関しては、既存の方法を用いるものとする。例えば、「音声認識精度向上のための信頼度尺度の比較」緒方淳、有木康雄、電子情報通信学会技術研究報告（音声研究会ＳＰ２０００−９４）、ｐｐ．１１３−１１８（２０００年１２月）を参照されたい。
【００２３】
拡張キー単語抽出条件テーブル４１０は、拡張キー単語抽出部４００によって参照されるテーブルであり、品詞と認識信頼度に関する条件が設定される。
【００２４】
拡張キー単語抽出部４００は、拡張キー単語抽出条件テーブル４１０に格納されている予め定められた拡張キー単語抽出条件に従って、音声認識部３００から出力された音声認識結果を拡張するための検索におけるキーとなる単語を抽出する。
【００２５】
拡張単語抽出条件テーブル５１０には、拡張単語抽出部５００によって参照されるテーブルであり、拡張単語抽出条件として外部データベース識別子、日付制約条件、最大抽出対象文書数、最大抽出単語数が設定される。
【００２６】
拡張単語抽出部５００は、拡張キー単語抽出部４００において抽出された拡張キー単語を用いて、拡張単語抽出条件テーブル５１０に格納されている予め定められた拡張単語抽出条件に従って関連情報検索部６００に対して、外部データベース１０００を検索することを指示し、関連情報検索部６００から取得した音声認識結果を拡張する単語である拡張単語を取得する。このとき、拡張単語抽出部５００は、関連情報検索部６００に対して、単語集合と外部データベース１０００の識別子、文書日付の制約、最大文書数等を含む指示を与えるものとする。
【００２７】
関連情報検索部６００は、拡張単語抽出部５００からの指示により外部データベース１０００を検索する検索エンジンである。
【００２８】
音声ドキュメント記述生成部７００は、音声認識部３００から取得した音声認識結果と、拡張単語抽出部５００から取得した拡張単語から、検索対象の音声コンテンツの内容を記述する音声ドキュメント記述ファイル９００を生成する。
【００２９】
出力部８００は、音声ドキュメント記述生成部７００において生成された音声ドキュメント記述ファイル９００を読み込んで、当該音声ドキュメント検索装置１００に後続するシステムに出力する。
【００３０】
なお、上記の音声ドキュメント記述生成部７００において、音声ドキュメント記述ファイル９００を生成する代わりに、生成された音声ドキュメントを出力部８００に出力し、出力部８００は、この音声ドキュメントを表示手段に表示するようにしてもよい。
【００３１】
次に、上記の構成における動作を説明する。なお、本発明では、録画・録音されたディジタル形式の音声コンテンツを検索し、検索されたコンテンツ記述などをＸＭＬ、ＨＴＭＬなどのスクリプト言語形式で取得して、コンテンツに含まれるテキスト情報を表示可能にするものとする。
【００３２】
図４は、本発明の一実施の形態における音声ドキュメント検索処理のフローチャートである。
【００３３】
ステップ１０１）入力部２００において、検索対象の音声コンテンツデータを入力する。
【００３４】
ステップ１０２）音声認識部３００において、入力された音声コンテンツデータを音声認識し、音声認識結果を信頼度とからなる音声ドキュメントを拡張キー単語抽出部４００に出力する。
【００３５】
ステップ１０３）拡張キー単語抽出部４００が、取得した音声コンテンツの音声認識結果と拡張キー単語抽出条件テーブル４１０の拡張キー単語抽出条件とを照合し、当該条件と合致する単語集合を拡張キー単語として抽出し、拡張単語抽出部５００に出力する。
【００３６】
ステップ１０４）拡張単語抽出部５００において、拡張キー単語抽出部４００から取得した単語集合に基づいて、拡張単語抽出条件テーブル５１０を検索し、当該拡張単語抽出条件テーブル５１０に指定されている外部データベース１００の識別子、文書の日付に対する制約、最大文書数を取得し、これらを関連情報検索部６００に渡すことで、外部データベース１０００の検索を指示する。
【００３７】
ステップ１０５）関連情報検索部６００は拡張単語抽出部５００から渡された指示の外部データベース１００の識別子、文書の日付に対する制約、最大文書数に基づいて、外部データベース１０００を検索し、条件に適合する関連文書集合を取得し、拡張単語抽出部５００に返却する。
【００３８】
ステップ１０６）拡張単語抽出部５００は取得した関連文書集合から指定された最大抽出単語数の拡張単語を抽出し、音声ドキュメント記述生成部７００に出力する。
【００３９】
ステップ１０７）音声ドキュメント記述生成部７００は、音声認識部３００から得られた音声認識結果に、拡張単語抽出部５００で抽出された拡張単語を埋め込んで、音声ドキュンメント内容記述データを生成し、音声ドキュメント記述ファイル９００に書き込む。または、ファイルを生成せずに生成した音声ドキュメント記述データを出力部８００に出力するようにしてもよい。
【００４０】
ステップ１０８）出力部８００では、音声ドキュメント記述ファイル９００を読み込んで、当該システムに後続するシステムに対して出力する。または、音声ドキュメント記述生成部７００から取得した音声ドキュメントをディスプレイ装置等の表示手段に出力してもよい。
【００４１】
【実施例】
以下、図面と共に本発明の実施例を説明する。
【００４２】
以下、具体例を用いて、本発明の音声ドキュメント検索システムの動作を説明する。
【００４３】
なお、以下の例では、外部データベース１０００は、検索サイト等に設けられているデータベースであるものとする。
【００４４】
図５は、本発明の一実施例における音声認識部により文字化された音声ドキュメントの一部を示す。同図に示す音声ドキュメントは、入力部２００から入力され、音声認識部３００により音声認識され、文字化されたものである。
【００４５】
ここで実際の発声は、
『昨夜からの寒波の訪れで、北海道は大雪となり、新千歳空港発の便など交通機関が大幅に乱れました』
であったとするが、音声認識の誤りのために、
『咲く世からの寒波の訪れで北海道は大雪となら新地都政空港発の便など交通機関が大ハブに乱れまして』
のように文字化されたものである。
【００４６】
図５の音声認識部３００の出力は、ＸＭＬ言語によって構造化されている。即ち、音声ドキュメントｄｏｃは、発声単位であるｐｈｒａｓｅの集合として表現される。各発話単位は、そこに含まれる単語ｗｏｒｄの集合として表現される。各発話単位、及び、そこに含まれる各単語に対しては、その開始時刻と終了時刻がそれぞれｂｅｇｉｎ，ｅｎｄという属性を用いて記録される。さらに、各単語に対しては、音声認識により文字化された単語表記が、ＸＭＬ要素の内容部分に記録されるだけでなく、概単語の品詞情報と音声認識の信頼度がそれぞれｐｏｓ，ｃｏｎｆという属性を用いて記録される。なお、図５に例示した音声認識結果は、本発明の説明に必要な概念を例示するためのものであり、ＸＭＬのタグ構造を含めて、このデータ形式に限る必要はない。また、音声認識部３００としては、このような情報を出力可能な任意の音声認識装置を適用することが可能である。
【００４７】
図６は、本発明の一実施例の拡張キー単語抽出条件テーブルのエントリ例を示す。同図に示す拡張キー単語抽出条件テーブル４１０は、予め設定する拡張キー単語抽出条件を格納する。
【００４８】
同図に示す例には、品詞と認識信頼度に関する三通りの条件が設定されている。同図の例のように、音声認識の信頼度を考慮することにより、正しく認識されている可能性の高い単語を抽出する。また、名詞や動詞などの品詞を有する単語を抽出することにより、拡張単語抽出部５００、関連情報検索部６００によって、関連する文書を外部データベース７００から検索する際に、キーワードとなり得る単語を抽出する。なお、これらの条件は、音声認識部３００に適用する音声認識装置に応じて経験的に設定する。
【００４９】
拡張キー単語抽出部４００は、図５に示すような音声認識結果を図６に示すような拡張キー単語抽出条件と照合し、拡張キー単語を抽出する。図５、図６の例に対しては、以下に示す５つの拡張キー単語（カッコ内は品詞と認識信頼度）が抽出される。
【００５０】
『寒波（名詞，２５０），訪れ（名詞，２５０），北海道（固有名詞，１５０），交通機関（名詞，２５０），乱れ（動詞，３００）』
上記のように抽出された拡張キー単語は、拡張単語抽出部５００へと転送される。
【００５１】
図７は、本発明の一実施例の拡張単語抽出テーブルのエントリ例を示す。同図に示す拡張単語抽出テーブル５１０は、予め設定する拡張単語抽出条件を格納する。同図に示すように、拡張単語抽出条件は、４つのエントリからなる。第１のエントリは、関連情報検索部６００が検索対象とすべき外部データベース７００の識別子である。図７の例では、インターネット上に存在するニュース検索サイト（ｆｏｏ−ｎｅｗｓ．ｃｏｍ）が指定されている。第２のエントリは、検索対象とする文書の日付に対する制約である。同図の例では、「２００３年１月２日から２００３年１月４日」の日付を有する文書のみを拡張単語抽出の対象とすることが指定されている。第３のエントリは、拡張単語を抽出する対象となる文書の最大数を指定する。通常のインターネットのサイト検索やデータベース検索においては、検索要求に対する適合度順に複数の文書が返却されるため、この上位から指定された数の文書を拡張単語抽出の対象とする。同図の例では、上位２件の文書のみを拡張単語の対象とすることが指定されている。第４のエントリは、実際に抽出する拡張単語の最大数を指定する。同図の例では、最大５つの拡張単語を抽出することが指定されている。
【００５２】
拡張単語抽出部５００から関連情報検索部６００へは、拡張キー単語抽出部４００から転送されてきた単語集合と、拡張単語抽出条件テーブル５１０に指定されている外部データベース７００の識別子、文書の日付に対する制約、及び最大文書数が転送される。
【００５３】
関連情報検索部６００は、転送されてきた情報に基づいて外部データベース７００から関連文書の検索を行う。ここで、転送されてきた単語集合が検索要求のキーワードとして用いられる。ここでは、
「寒波、訪れ、北海道、交通機関、乱れ」
という単語集合による検索要求によってニュース検索サイト（ｆｏｏ−ｎｅｗｓ．ｃｏｍ）を検索したところ、指定された条件に適合する文書の中で上位の適合度を持つ２件の関連文書として以下のような内容を持つ文書が検索されたものとする。
【００５４】
・適合度第１位の文書：
『日本海側の強い寒波に伴う悪天候により、昨夜から交通機関に大きな影響が出た。特に北海道では荒れ模様の天気となり、空の便に多くの欠航便が出た。このため、新千歳空港では、乗客があふれる騒ぎとなった。』
・適合度第２位の文書：
『第１級の寒波の訪れにより、日本列島の天候は大荒れとなった。北日本を中心に交通機関が混乱した。』
関連情報検索部６００は、上記のような検索された関連文書集合の内容を拡張単語抽出部５００へと返却する。
【００５５】
拡張単語抽出部５００は、上記のように転送されてきた関連文書集合から指定された最大抽出単語数の拡張単語を抽出する。関連文書集合からの拡張単語抽出の処理としては、情報検索の分野における重要語抽出の手法を適用することができる。例えば、関連文書集合に含まれる各単語に対しては、ｔｆ＊ｉｄｆ法（「情報と言語処理」，徳永建伸、東京大学出版会，１９９９．）を適用した以下の式によりスコアを計算し、この値が上位のものから指定された最大抽出単語数の単語を拡張単語として抽出すればよい。
【００５６】
【数１】

ここで、ｗｉは、関連文書集合に含まれるｉ番目の単語、ｔｆｉｆは、ｗｉの関連文書ｊにおける頻度、ｉｄｆｉは検索対象の文書集合におけるｗｉの文書逆頻度を表す。
【００５７】
なお、検索対象の外部データベース７００がインターネット上の検索サイトなどである場合は、関連文書集合に含まれる単語集合を求めることが必要となる。この場合、既存技術として広く用いられている技術である形態素解析を適用すればよい。また、検索対象のデータベースが物理的・論理的に離れた場所に存在する場合には、ｉｄｆｉの値を求めることが困難であることが多い。この場合は、検索対象と同様の性質を持つ既存のデータコーパスなどの文書集合における値で代用してもよい。
【００５８】
本発明では、適当はｉｄｆｉを仮定することにより、以下の５つの拡張単語が抽出されたものとする（カッコ内は式１により計算されたスコアとする）。
【００５９】
新千歳空港（１４０），欠航便（１３０），北海道（１１０），寒波（１００），交通機関（８０）
上記のように抽出された拡張単語集合は、音声認識部３００が出力した音声認識結果とともに音声ドキュメント記述生成部７００へと転送される。
【００６０】
音声ドキュメント記述生成部７００は、音声認識結果であるＸＭＬ形式のデータに、転送されてきた拡張単語集合を埋め込むことにより、処理対象である音声ドキュメントの内容を記述するＸＭＬ形式のデータを生成する。
【００６１】
図８は、本発明の一実施例における生成されたＸＭＬ形式の音声ドキュメント記述データの例である。
【００６２】
同図に示す音声ドキュメント記述データは、図５に示した音声認識結果に対して、上記の拡張単語を埋め込むことにより、音声ドキュメント記述生成部７００が生成したものである。
【００６３】
図８においては、“ａｄｄｉｔｉｏｎａｌ−ｗｏｒｄｓ”というタグにより拡張単語集合が表現され、“ａｗａｒｄ ”というタグにより各拡張単語が表現されている。また、拡張単語を抽出する際に計算されたスコアの値が“ａｗａｒｄ ”タグにおける“ｓｃｏｒｅ ”という属性により記録されている。このようなデータ形式は、生成される音声ドキュメント記述データを説明するためのものであり、図示されている形式に限らない。
【００６４】
図８に示す音声ドキュメント記述データにおいては、図５の音声認識結果データにおいて正しく認識されていた「北海道」、「寒波」、「交通機関」が拡張単語としても抽出されていることから、当該音声ドキュメントにおける重要な単語であることが認識されている。また、当該音声ドキュメントにおいては発声されていない「欠航便」が拡張単語として抽出されている。これは、この単語が実際には発声されていないにも関わらず関連する単語として重要であることを示しており、検索時にも有用に用いられる可能性がある。さらに、当該音声ドキュメントに対する音声認識において「新地都政空港」として誤認識されていた「新千歳空港」が拡張単語として抽出されており、これは結果として、音声認識の誤りを補正する効果を持つ。
【００６５】
上記のように生成された音声ドキュメント記述データを音声ドキュメント自身の代わりとして用いることにより、音声ドキュメントに対して言語によるキーによって検索することが可能となる。また、この検索処理において、ＸＭＬの文書構造を扱うことのできるＸＭＬ検索エンジンを用いれば、拡張単語を検索対象に含めるかどうかを検索条件に指定するなどの高度な検索が可能となる。
【００６６】
さらに、図８に示したように、音声ドキュメント記述データに発声時間が記録されていれば、その音声ドキュメントの該当部分のみを再生するような音声ドキュメントアクセスも可能となる。
【００６７】
なお、上記の処理は、計算機上のＣＰＵ等の制御手段で行われ、検索結果をディスプレイなどの表示手段で表示するものとする。
【００６８】
また、上記の動作をプログラムとして構築し、当該プログラムをネットワークの通信回線や、フレキシブルディスクやＣＤ−ＲＯＭ等の記憶媒体から計算機上にインストールして、ＣＰＵ等の制御手段に実行させることも可能である。
【００６９】
なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【００７０】
【発明の効果】
上述のように、本発明によれば、音声認識を適用することにより、録画・録音され記憶手段に記憶されている音声ドキュメントの内容を言語によるキーにより検索する音声ドキュメント検索システムを実現することができ、特に、正しく認識され単語から重要語を抽出したり、検索対象に含まれていないが検索に有用な単語を抽出したり、誤って音声認識された単語を補正したりすることにより、検索精度を高めることができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の一実施の形態における音声ドキュメント検索装置の構成図である。
【図４】本発明の一実施の形態における音声ドキュメント検索処理のフローチャートである。
【図５】本発明の一実施例における音声認識部により文字化された音声ドキュメントの一部である。
【図６】本発明の一実施例の拡張キー単語抽出条件テーブルのエントリの例である。
【図７】本発明の一実施例の拡張単語抽出条件テーブルのエントリの例である。
【図８】本発明の一実施例における生成されたＸＭＬ形式の音声ドキュメント記述データの例である。
【符号の説明】
１００音声ドキュメント検索装置
２００入力手段、入力部
３００音声認識手段、音声認識部
４００拡張キー抽出手段、拡張キー抽出部
５００拡張単語抽出手段、拡張単語抽出部
６００関連情報検索手段、関連情報検索部
７００音声ドキュメント記述生成手段、音声ドキュメント生成部
８００出力手段、出力部
９００音声ドキュメント記述ファイル
１０００外部データベース[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice document search method and apparatus, and a voice document search program, and more particularly to characterizing a recorded voice content by applying voice recognition to the voice content, and searching for the content by a language key. The present invention relates to a voice document search method and apparatus, and a voice document search program.
[0002]
[Prior art]
Research and development of a search system for performing character recognition by applying voice recognition to voice content and enabling the content to be searched for using a key in a language have been performed (for example, see Non-Patent Document 1). . In the application of speech recognition, occurrence of a recognition error is inevitable. Therefore, a device for suppressing the influence of the recognition error is required. As a method for automatically performing this, there has been proposed a method of extending audio content by referring to an external database different from the audio content to be recognized (for example, see Non-Patent Document 2).
[0003]
On the other hand, in the research and development of speech recognition, an attempt has been made for the speech recognition system itself to give reliability to a speech recognition result output from the speech recognition system (for example, see Non-Patent Document 3).
[0004]
[Non-patent document 1]
"Analysis and Integration of Multimedia Information" Yasuo Ariki, Artificial Intelligence Society Information Integration Study Group, SIG-CII-2000-Nov, 2000.
[0005]
[Non-patent document 2]
"Document Expansion for Speech Retrieval", Amit Singhal, Fernando Pereira, Proceedings of ACM SIGIR, pages 34-41, Berkeley, CA, USA, 19A., USA.
[0006]
[Non-Patent Document 3]
"Comparison of Reliability Measures for Improving Speech Recognition Accuracy" Jun Ogata, Yasuo Ariki, IEICE Technical Report (Speech Research Group SP2000-94), pg. 113-118 (December 2000)
[0007]
[Problems to be solved by the invention]
However, in the method of expanding the voice content with reference to the above-mentioned conventional external database, the entire voice recognition result itself is regarded as a question request, and a related external database is searched by a similar document search method. Extended words are extracted from the top ranked voice documents. That is, in this method, words included in the speech recognition result are treated as the same word, and no consideration is given to recognition accuracy.
[0008]
Therefore, in some cases, a search of the external database may be performed using a key that is likely to be erroneously recognized as a key. It is more likely that items with low probability are included.
[0009]
SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and reduces the influence of a speech recognition error on search accuracy, and enables a content search for a speech content that enables a search using a key related to a topic in the speech content. An object of the present invention is to provide an audio document search method and apparatus for generating a content description document and an audio document search program.
[0010]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0011]
The present invention relates to an audio document search method for searching audio content including a recorded audio track by a key in a language,
The audio content to be searched is read out from a storage medium and input to an audio document search device (step 1).
In a voice document search device,
A voice recognition process is performed on the input voice content, and a character recognition voice recognition result is obtained together with the recognition reliability (step 2).
By comparing the extended key word extraction condition stored in the storage medium in advance with the speech recognition result, an extended key word as a key in a search for extending the speech recognition result is extracted (step 3).
The extracted expanded key words are compared with the expanded word extraction conditions stored in the storage means in advance, and based on the conditions obtained as a result, a search is performed for an external database that stores related documents. An extended word for extending the speech recognition result is obtained from the document set (step 4),
By embedding the expanded word in the expanded word and the voice recognition result, a voice content description document file of the voice content to be searched is generated (step 5).
An audio content description document file is output (step 6).
[0012]
FIG. 2 is a diagram illustrating the principle of the present invention.
[0013]
The present invention is an audio document search apparatus 100 for searching audio content including a recorded audio track by a key in a language,
A related information search unit 600 that is provided outside the voice document search device 100 and searches an external database 1000 that stores related documents;
Input means 200 for inputting audio content to be searched;
A voice recognition unit 300 that obtains a characterized voice recognition result along with recognition reliability by applying voice recognition processing to voice content input by the input unit 200;
Extended key word extraction means 400 for extracting an extended key word that is a key in a search for extending a speech recognition result according to a predetermined extended key word extraction condition;
An extension, which is a word for extending a speech recognition result from a set of related documents obtained by searching the external database 1000 by the related information search means 600 according to a predetermined expanded word extraction condition using the extracted expanded key words. An expanded word extracting means 500 for obtaining a word,
Voice document description generating means 700 for generating a voice content description document file by embedding an expanded word in the voice recognition result and describing the voice content of the voice content to be searched;
Output means 800 for outputting the generated audio content description document file.
[0014]
The present invention is an audio document search program for searching audio content including a recorded audio track by a key in a language,
An input step of inputting audio content to be searched;
A voice recognition step of obtaining a characterized voice recognition result along with recognition reliability by applying voice recognition processing to voice content input in the input step;
An extended key word extraction step of extracting an extended key word that is a key in a search for extending a speech recognition result according to a predetermined extended key word extraction condition;
A predetermined extended word extraction condition using the extracted extended key word is obtained by passing the condition to a related information search unit that is provided outside the voice document search device and searches an external database that stores related documents. An extended word extraction step of acquiring an extended word that is a word that extends the speech recognition result from the related document set;
A voice document description generating step of generating a voice content description document file by embedding an expanded word in the voice recognition result and describing the content of the voice content to be searched;
And an output step of outputting the generated audio content description document file.
[0015]
As described above, in the present invention, a key word of an external database search is extracted from a portion determined to be accurately recognized using a reliability output by the voice recognition device with respect to its own recognition result. This makes it possible to obtain high-quality extended words from external data.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0017]
FIG. 3 shows a configuration of a voice document search device according to an embodiment of the present invention.
[0018]
The speech document search system 100 shown in FIG. 1 includes an input unit 200, a speech recognition unit 300, an extended key word extraction unit 400, an extended key word extraction condition table 410, an extended word extraction unit 500, an extended word extraction condition table 510, and related information. It comprises a search unit 600, a voice document description generation unit 700, and an output unit 800.
[0019]
It is assumed that an external database 1000 is provided outside the voice document search device 100.
[0020]
The above-mentioned extended key word extraction condition table 410 and extended word extraction condition table 510 are stored in a storage unit such as a hard disk.
[0021]
The input unit 200 inputs audio content including a recorded audio track to be searched. The input audio content is represented by a digital signal and stored in a storage medium such as a hard disk, and the input unit 200 reads and inputs the audio content from the hard disk or the like.
[0022]
The speech recognition unit 300 applies speech reliability to speech content input from the input unit 200 and converts the speech recognition result into a character (for example, an XML language) by applying speech recognition processing using an existing speech recognition device. And output. Note that the existing method is used for the voice reliability. For example, “Comparison of Reliability Measures for Improving Speech Recognition Accuracy”, Jun Ogata, Yasuo Ariki, IEICE Technical Report (Speech Research Group SP2000-94), pp. 139-143. 113-118 (December 2000).
[0023]
The extended key word extraction condition table 410 is a table referred to by the extended key word extraction unit 400, and sets conditions relating to part of speech and recognition reliability.
[0024]
The expanded key word extraction unit 400 is a key for searching for expanding the speech recognition result output from the speech recognition unit 300 according to a predetermined extended key word extraction condition stored in the extended key word extraction condition table 410. Is extracted.
[0025]
The extended word extraction condition table 510 is a table referred to by the extended word extraction unit 500, in which an external database identifier, a date constraint condition, a maximum number of documents to be extracted, and a maximum number of extracted words are set as extended word extraction conditions.
[0026]
The extended word extracting unit 500 uses the extended key word extracted by the extended key word extracting unit 400 to send the related information searching unit 600 in accordance with a predetermined extended word extraction condition stored in the extended word extraction condition table 510. On the other hand, it instructs the external database 1000 to be searched, and obtains an expanded word that is a word that expands the speech recognition result obtained from the related information search unit 600. At this time, it is assumed that the extended word extraction unit 500 gives the related information search unit 600 an instruction including the word set and the identifier of the external database 1000, restrictions on the document date, the maximum number of documents, and the like.
[0027]
The related information search unit 600 is a search engine that searches the external database 1000 according to an instruction from the extended word extraction unit 500.
[0028]
The voice document description generation unit 700 generates a voice document description file 900 that describes the contents of the search target voice content from the voice recognition result obtained from the voice recognition unit 300 and the extended words obtained from the expanded word extraction unit 500. .
[0029]
The output unit 800 reads the audio document description file 900 generated by the audio document description generation unit 700, and outputs it to a system subsequent to the audio document search device 100.
[0030]
Note that the audio document description generation unit 700 outputs the generated audio document to the output unit 800 instead of generating the audio document description file 900, and the output unit 800 displays the audio document on a display unit. You may do so.
[0031]
Next, the operation in the above configuration will be described. According to the present invention, it is possible to search for recorded digital audio content, obtain a description of the searched content in a script language format such as XML or HTML, and display text information included in the content. It shall be.
[0032]
FIG. 4 is a flowchart of a voice document search process according to one embodiment of the present invention.
[0033]
Step 101) The input unit 200 inputs audio content data to be searched.
[0034]
Step 102) The voice recognition unit 300 performs voice recognition on the input voice content data, and outputs a voice document including the reliability of the voice recognition result to the expanded key word extraction unit 400.
[0035]
Step 103) The extended key word extraction unit 400 compares the acquired speech recognition result of the speech content with the extended key word extraction condition of the extended key word extraction condition table 410, and sets a word set that matches the condition as an extended key word. The extracted word is output to the expanded word extracting unit 500.
[0036]
Step 104) The extended word extraction unit 500 searches the extended word extraction condition table 510 based on the word set acquired from the extended key word extraction unit 400, and retrieves the external database 100 specified in the extended word extraction condition table 510. By obtaining the identifier of the document, the restriction on the date of the document, and the maximum number of documents, and passing these to the related information search unit 600, a search of the external database 1000 is instructed.
[0037]
Step 105) The related information search unit 600 searches the external database 1000 based on the identifier of the external database 100 of the instruction passed from the extended word extraction unit 500, the restriction on the date of the document, and the maximum number of documents, and matches the condition. A related document set is acquired and returned to the extended word extraction unit 500.
[0038]
Step 106) The expanded word extracting unit 500 extracts the specified maximum number of extracted words from the acquired related document set and outputs the extracted expanded words to the voice document description generating unit 700.
[0039]
Step 107) The voice document description generation unit 700 embeds the expanded word extracted by the expanded word extraction unit 500 in the voice recognition result obtained from the voice recognition unit 300, generates voice document content description data, and Write to the document description file 900. Alternatively, the generated audio document description data may be output to the output unit 800 without generating a file.
[0040]
Step 108) The output unit 800 reads the audio document description file 900 and outputs it to a system subsequent to the system. Alternatively, the audio document acquired from the audio document description generation unit 700 may be output to a display unit such as a display device.
[0041]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0042]
Hereinafter, the operation of the voice document search system of the present invention will be described using a specific example.
[0043]
In the following example, it is assumed that the external database 1000 is a database provided in a search site or the like.
[0044]
FIG. 5 shows a part of a voice document transcribed by the voice recognition unit in one embodiment of the present invention. The voice document shown in FIG. 3 is input from the input unit 200, is voice-recognized by the voice recognition unit 300, and is converted into a character.
[0045]
The actual utterance here is
"The cold weather from last night caused heavy snowfall in Hokkaido, and transportation was greatly disrupted, including flights from New Chitose Airport."
But due to a speech recognition error,
`` If the cold weather comes from the blooming world and Hokkaido is heavy snow, transportation such as flights from Shinchi Metropolitan Airport is disrupted by a large hub ''
It is a characterized like.
[0046]
The output of the voice recognition unit 300 in FIG. 5 is structured by the XML language. That is, the audio document doc is expressed as a set of phrases which are utterance units. Each utterance unit is expressed as a set of words contained therein. For each utterance unit and each word contained therein, its start time and end time are recorded using attributes "begin" and "end", respectively. Furthermore, for each word, not only is the word notation transcribed by speech recognition recorded in the content part of the XML element, but also the part of speech information of the approximate word and the reliability of speech recognition are called pos and conf, respectively. Recorded using attributes. It should be noted that the speech recognition result illustrated in FIG. 5 is for exemplifying a concept necessary for explaining the present invention, and it is not necessary to limit to this data format including the XML tag structure. Further, as the voice recognition unit 300, any voice recognition device that can output such information can be applied.
[0047]
FIG. 6 shows an example of entries in the extended key word extraction condition table according to one embodiment of the present invention. The extended key word extraction condition table 410 shown in FIG. 8 stores preset extended key word extraction conditions.
[0048]
In the example shown in the figure, three conditions regarding the part of speech and the recognition reliability are set. As in the example shown in FIG. 7, words that are likely to be correctly recognized are extracted by considering the reliability of speech recognition. Further, by extracting a word having a part of speech such as a noun or a verb, a word that can be a keyword when the related document is searched from the external database 700 by the expanded word extraction unit 500 and the related information search unit 600 is extracted. . Note that these conditions are empirically set according to the speech recognition device applied to the speech recognition unit 300.
[0049]
The extended key word extraction unit 400 compares the speech recognition result as shown in FIG. 5 with an extended key word extraction condition as shown in FIG. 6, and extracts an extended key word. For the examples of FIGS. 5 and 6, the following five expanded key words (the part of speech and the recognition reliability in parentheses) are extracted.
[0050]
"Cold wave (noun, 250), visit (noun, 250), Hokkaido (proper noun, 150), transportation (noun, 250), turbulence (verb, 300)"
The extended key words extracted as described above are transferred to the extended word extraction unit 500.
[0051]
FIG. 7 shows an example of entries in the extended word extraction table according to one embodiment of the present invention. The extended word extraction table 510 shown in FIG. 9 stores preset extended word extraction conditions. As shown in the figure, the extended word extraction condition includes four entries. The first entry is an identifier of the external database 700 to be searched by the related information search unit 600. In the example of FIG. 7, a news search site (foo-news.com) existing on the Internet is specified. The second entry is a constraint on the date of the document to be searched. In the example shown in the figure, it is specified that only the document having the date “January 2, 2003 to January 4, 2003” is to be subjected to the expansion word extraction. The third entry specifies the maximum number of documents from which extended words are to be extracted. In a normal Internet site search or database search, a plurality of documents are returned in order of relevance to the search request. Therefore, a specified number of documents from the top are subjected to extended word extraction. In the example of FIG. 7, it is specified that only the top two documents are to be expanded word targets. The fourth entry specifies the maximum number of extended words to be actually extracted. In the example of FIG. 7, extraction of up to five extended words is specified.
[0052]
The extended word extraction unit 500 sends the related information search unit 600 the word set transferred from the extended key word extraction unit 400, the identifier of the external database 700 specified in the extended word extraction condition table 510, and the date of the document. The constraints and the maximum number of documents are transferred.
[0053]
The related information search unit 600 searches for a related document from the external database 700 based on the transferred information. Here, the transferred word set is used as a keyword of the search request. here,
"Cold waves, visits, Hokkaido, transportation, turbulence"
When a news search site (foo-news.com) was searched according to a search request based on the word set, the following content was found as two related documents having the highest relevance among documents meeting the specified conditions. It is assumed that a document having is searched.
[0054]
・ The document with the highest relevance:
“The bad weather associated with the strong cold weather on the Sea of Japan side has had a major impact on transportation since last night. Especially in Hokkaido, the weather was rough and there were many canceled flights on empty flights. As a result, passengers were noisy at New Chitose Airport. 』
・ The document with the second highest conformance:
"The arrival of the first-class cold weather has severely affected the Japanese archipelago. Transportation was confused, especially in northern Japan. 』
The related information search unit 600 returns the contents of the set of related documents searched as described above to the extended word extraction unit 500.
[0055]
The expanded word extracting unit 500 extracts the specified expanded words of the maximum number of extracted words from the related document set transferred as described above. As a process of extracting extended words from a set of related documents, a technique of extracting important words in the field of information retrieval can be applied. For example, for each word included in the set of related documents, a score is calculated by the following formula using the tf * idf method (“Information and Language Processing”, Takenobu Tokunaga, University of Tokyo Press, 1999.). The word having the maximum number of extracted words specified from the one with the highest value may be extracted as an extended word.
[0056]
(Equation 1)

Here, wi is the i-th word included in the related document set, tfif is the frequency of wi in the related document j, and idfi is the inverse frequency of wi in the search target document set.
[0057]
When the external database 700 to be searched is a search site on the Internet or the like, it is necessary to obtain a word set included in the related document set. In this case, morphological analysis, which is a technique widely used as an existing technique, may be applied. Further, when the database to be searched exists in a physically and logically distant place, it is often difficult to obtain the value of idfi. In this case, a value in a document set such as an existing data corpus having the same properties as the search target may be used instead.
[0058]
In the present invention, it is assumed that the following five expanded words are extracted by appropriately assuming idfi (the score in parentheses is the score calculated by Expression 1).
[0059]
New Chitose Airport (140), Canceled flight (130), Hokkaido (110), Cold wave (100), Transportation (80)
The expanded word set extracted as described above is transferred to the voice document description generation unit 700 together with the voice recognition result output by the voice recognition unit 300.
[0060]
The voice document description generation unit 700 generates XML format data describing the content of the voice document to be processed by embedding the transferred extended word set in the XML format data as the voice recognition result.
[0061]
FIG. 8 is an example of generated audio document description data in XML format according to an embodiment of the present invention.
[0062]
The voice document description data shown in FIG. 6 is generated by the voice document description generation unit 700 by embedding the above-described extended words in the voice recognition result shown in FIG.
[0063]
In FIG. 8, an extended word set is represented by a tag "additional-words", and each extended word is represented by a tag "award". Also, the score value calculated when extracting the extended word is recorded by the attribute “score” in the “award” tag. Such a data format is for explaining the generated audio document description data, and is not limited to the illustrated format.
[0064]
In the voice document description data shown in FIG. 8, “Hokkaido”, “Cold Wave”, and “Transportation” that were correctly recognized in the voice recognition result data of FIG. 5 are also extracted as extended words. It is recognized as an important word in the document. In addition, in the voice document, “unsubscribed flights” that have not been uttered are extracted as extended words. This indicates that this word is important as a related word even though it is not actually uttered, and may be usefully used during a search. Furthermore, “New Chitose Airport”, which was erroneously recognized as “Shinchi Metropolitan Airport” in voice recognition for the voice document, is extracted as an expanded word, and as a result, it has the effect of correcting errors in voice recognition.
[0065]
By using the audio document description data generated as described above in place of the audio document itself, it is possible to search the audio document by a key in a language. Also, in this search processing, if an XML search engine that can handle the XML document structure is used, it is possible to perform an advanced search such as designating whether to include an expanded word in a search target as a search condition.
[0066]
Further, as shown in FIG. 8, if the utterance time is recorded in the voice document description data, the voice document can be accessed such that only the corresponding portion of the voice document is reproduced.
[0067]
The above processing is performed by a control unit such as a CPU on a computer, and the search result is displayed on a display unit such as a display.
[0068]
Further, it is also possible to construct the above operation as a program, install the program on a computer from a communication line of a network or a storage medium such as a flexible disk or a CD-ROM, and execute the program by a control means such as a CPU. is there.
[0069]
Note that the present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims.
[0070]
【The invention's effect】
As described above, according to the present invention, by applying voice recognition, it is possible to realize a voice document search system that searches for the content of a voice document recorded and recorded and stored in a storage unit using a language key. In particular, by extracting key words from correctly recognized words, extracting words that are not included in the search target but useful for searching, and correcting words that are incorrectly recognized by speech. Accuracy can be increased.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of a voice document search device according to an embodiment of the present invention.
FIG. 4 is a flowchart of a voice document search process according to one embodiment of the present invention.
FIG. 5 is a part of a voice document transcribed by a voice recognition unit in one embodiment of the present invention.
FIG. 6 is an example of an entry of an extended key word extraction condition table according to an embodiment of the present invention.
FIG. 7 is an example of an entry of an extended word extraction condition table according to an embodiment of the present invention.
FIG. 8 is an example of generated audio document description data in XML format according to an embodiment of the present invention.
[Explanation of symbols]
100 Voice document search device
200 input means, input section
300 Voice recognition means, voice recognition unit
400 Extended key extracting means, extended key extracting section
500 Extended word extraction means, extended word extraction unit
600 Related information search means, related information search unit
700 Voice document description generation means, voice document generation unit
800 output means, output section
900 Voice document description file
1000 external database

Claims

An audio document search method for searching audio content including a recorded audio track by a key in a language,
The audio content to be searched is read from a storage medium and input to an audio document search device,
In the voice document search device,
Performing a voice recognition process on the input voice content, to obtain a characterized voice recognition result together with the recognition reliability,
By comparing an expanded key word extraction condition stored in a storage medium in advance with the voice recognition result, an expanded key word serving as a key in a search for expanding the voice recognition result is extracted,
The extracted expanded key words are compared with expanded word extraction conditions stored in a storage medium in advance, and are obtained by searching an external database that accumulates related document sets based on the conditions obtained as a result. Extended words for extending the speech recognition result from the related document set
By embedding the expanded word in the expanded word and the voice recognition result, a voice content description document file of the voice content to be searched is generated,
A voice document search method, comprising outputting the voice content description document file.

An audio document search device for searching audio content including a recorded audio track by a key in a language,
A related information search unit provided outside the voice document search device and searching an external database that stores related documents;
Input means for inputting the audio content to be searched;
Voice recognition means for obtaining a characterized voice recognition result together with recognition reliability by applying voice recognition processing to the voice content input by the input means,
Extended key word extraction means for extracting an extended key word that is a key in a search for extending a speech recognition result according to a predetermined extended key word extraction condition;
A word for expanding the speech recognition result from a related document set obtained by searching the external database by the related information search means according to a predetermined expanded word extraction condition using the extracted expanded key word. Extended word extracting means for acquiring an extended word;
Voice document description generating means for embedding the extended word in the voice recognition result and generating a voice content description document file describing the content of the voice content to be searched;
Output means for outputting the generated audio content description document description file.

An audio document search program for searching audio content including a recorded audio track by a key in a language,
An input step of inputting the audio content to be searched;
A voice recognition step of obtaining a transcribed voice recognition result together with recognition reliability by applying voice recognition processing to the voice content input in the input step,
An extended key word extraction step of extracting an extended key word that is a key in a search for extending a speech recognition result according to a predetermined extended key word extraction condition;
An extended word extraction condition predetermined using the extracted extended key word is obtained by passing the condition to a related information search unit provided outside the voice document search device and searching an external database that stores related documents. An extended word extraction step of acquiring an extended word that is a word for extending the speech recognition result from the related document set,
A voice document description generating step of generating a voice content description document file of the voice content to be searched by embedding the extended word in the voice recognition result;
An output step of outputting the generated audio content description document file, and a control unit of a computer for executing the audio document search program.