JP6174802B2

JP6174802B2 - Information processing apparatus and information processing system

Info

Publication number: JP6174802B2
Application number: JP2016530777A
Authority: JP
Inventors: 文也工藤; 知明秋富
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2017-08-02
Anticipated expiration: 2034-07-04
Also published as: JPWO2016002062A1; WO2016002062A1

Description

本発明は情報処理装置および情報処理システムに関する。より具体的には、テーブル形式のデータに対して、分析を支援する情報処理装置および情報処理システムに関する。 The present invention relates to an information processing apparatus and an information processing system. More specifically, the present invention relates to an information processing apparatus and an information processing system that support analysis of data in a table format.

近年、企業で蓄積されてきた業績に関するビッグデータを活用し、業績向上に繋がる要因を分析するシステムの開発が盛んに行われている。分析者は様々な情報を含む大量のデータに対して、条件を絞ることで要因を限定して目的との関係性を調べる。このとき、どの程度の粒度で条件を絞るかが重要である。ある店舗の売上向上について分析する場合を例に挙げる。例えば顧客の推移を時間ごとに絞って調べる場合、１分ごと、１時間ごと、６時間ごとのように、絞る時間の粒度によって分析結果が大きく異なる。このように分析者は、様々な条件、粒度にデータを加工する、または関係性を使って分析する。しかし分析するデータのサイズが大きくなるにつれ、こうしたデータの加工や関係性の発見を分析者が人手で行うことは困難になってきた。そこでこのような分析を支援するシステムの開発が求められている。 In recent years, the development of systems for analyzing factors that lead to improvement of business performance by utilizing big data related to business performance accumulated by companies has been actively performed. The analyst examines the relationship with the purpose by limiting the conditions for a large amount of data including various information by narrowing down the conditions. At this time, it is important to narrow down the conditions at what granularity. Take the case of analyzing the sales improvement of a certain store as an example. For example, when examining customer transitions by time, analysis results vary greatly depending on the granularity of time to be narrowed, such as every minute, every hour, and every six hours. In this way, the analyst processes the data into various conditions and granularities, or analyzes it using relationships. However, as the size of the data to be analyzed has increased, it has become difficult for analysts to manually process such data and discover relationships. Therefore, development of a system that supports such analysis is required.

特開２００３−２２２７７号JP 2003-22277 A

本願発明に先駆けて、本願発明者らは、背景技術で述べたデータの粒度について、特に収集したデータ中におけるカラム間の粒度関係を抽出することを検討した。 Prior to the invention of the present application, the inventors of the present application examined the extraction of the granularity relationship between columns in the collected data, particularly regarding the granularity of data described in the background art.

図１に、分析対象として扱うテーブルの具体例、及び、カラム間の粒度関係についての具体例を示す。テーブル001において、一つのサンプルを表すテーブルの各行をレコードと呼び、「顧客ID」002、「年齢」003、「入店時刻」004のようなテーブルの各列をカラムと呼ぶ。 FIG. 1 shows a specific example of a table handled as an analysis target and a specific example of a granularity relationship between columns. In the table 001, each row of the table representing one sample is called a record, and each column of the table such as “customer ID” 002, “age” 003, and “entry time” 004 is called a column.

ここで、テーブル中のカラムには、格納されるレコードの粒度が異なるカラムが存在することがある。例えば、「商品分類」005は、「商品名」006の辞書的な上位概念であるため、「商品名」006を包含する。また、「年齢」003は、どの顧客も１つの年齢の値を持ち、同一年齢の顧客が複数人居る可能性があるため、「顧客ID」002を情報量的に包含する。 Here, the columns in the table may include columns having different granularities of stored records. For example, “product classification” 005 is a lexicographic concept of “product name” 006, and therefore includes “product name” 006. Further, “age” 003 includes “customer ID” 002 in terms of information amount because every customer has one age value and there may be a plurality of customers of the same age.

このように、より粒度の細かい「商品名」006や「顧客ID」002を子カラム、粗い「商品分類」005や「年齢」003を親カラムと呼ぶ。すなわち、テーブル001において包含関係があるカラムの具体例は、「商品名」006と「商品分類」005、「顧客ID」002と「年齢」003である。一方、包含関係がないカラムの具体例は、例えば「商品分類」005と「気温」008等である。予め包含関係を見つけることができれば、分析の際にその情報を活用できる。 In this way, the “product name” 006 and “customer ID” 002 with finer granularity are called child columns, and the coarse “product classification” 005 and “age” 003 are called parent columns. That is, specific examples of columns having an inclusion relationship in the table 001 are “product name” 006, “product classification” 005, “customer ID” 002, and “age” 003. On the other hand, specific examples of columns having no inclusion relationship are “product classification” 005 and “temperature” 008, for example. If an inclusive relationship can be found in advance, the information can be used for analysis.

包含関係を抽出する技術の一例として、特許文献１には、所定の単語の同義語や上位概念となる単語などの概念階層に関する情報を予めシソーラス辞書に登録しておき、検索に際して単語間の概念階層を考慮した検索方法を提供する技術が記載されている。 As an example of a technique for extracting an inclusion relationship, Patent Document 1 registers information related to a concept hierarchy such as a synonym of a predetermined word or a word that is a superordinate concept in advance in a thesaurus dictionary, and a concept between words at the time of search. A technique for providing a search method considering a hierarchy is described.

特許文献1に記載の技術は、シソーラス辞書に、同義語や上位概念となる単語に関する情報が保存されている単語については、単語と単語の概念関係を知ることができる。しかしながら、例えば「年齢」は通常のシソーラス辞書においては、「顧客ID」の上位概念とは登録されない。さらに、今回対象とするテーブルにおけるカラム名は、テーブルの作成者が自由に割り当てることができるものである。そのため、商品の名前カラムのカラム名が、例えば「Label」、「Product Name」、「pro_name」、「商品名」のようにテーブルの作成者によって異なり、その結果、シソーラス辞書に登録されていない単語になることが当然にあり得る。これらの場合、特許文献１に記載の技術を用いて上位概念の単語を検索する方法は適用できない。 The technique described in Patent Literature 1 can know the concept relationship between words for words in which information on synonyms and words that are higher-level concepts is stored in a thesaurus dictionary. However, for example, “age” is not registered as a superordinate concept of “customer ID” in a normal thesaurus dictionary. Furthermore, column names in the target table can be freely assigned by the creator of the table. Therefore, the column name of the product name column varies depending on the creator of the table, for example, “Label”, “Product Name”, “pro_name”, “Product Name”, and as a result, the word is not registered in the thesaurus dictionary. Of course it can be. In these cases, the method of searching for a superordinate word using the technique described in Patent Document 1 cannot be applied.

以上を踏まえ、本願発明の目的は、テーブル形式のデータに対して、カラム間の包含関係の抽出をより容易とする技術を提供することにある。 Based on the above, an object of the present invention is to provide a technique that makes it easier to extract inclusion relations between columns for table-format data.

本願発明による課題を解決する手段のうち代表的なものを例示すれば、情報処理方法であって、第１のカラムおよび第２のカラムが含まれる入力テーブルに対し、第１のカラムの情報量である第１の情報量と、第２のカラムの情報量である第２の情報量と、を算出する第１の工程と、第１の情報量と第２の情報量の大小関係を比較する第２の工程と、第２のカラムに対する第１のカラムの情報量である第１の条件付き情報量を算出する第３の工程と、第１の条件付き情報量に基づいて第１のカラムと第２のカラムの包含関係を判定する第４の工程と、を有することを特徴とする。 A representative example of means for solving the problems according to the present invention is an information processing method, which is an information amount of the first column with respect to an input table including the first column and the second column. The first step of calculating the first information amount and the second information amount of the second column are compared with the magnitude relationship between the first information amount and the second information amount. A second step of calculating, a third step of calculating a first conditional information amount which is an information amount of the first column with respect to the second column, and a first step based on the first conditional information amount And a fourth step of determining an inclusion relationship between the column and the second column.

または、情報処理システムであって、第１のカラムおよび第２のカラムが含まれる入力テーブルを記憶する記憶部と、第１のカラムの情報量である第１の情報量と第２のカラムの情報量である第２の情報量と、を算出し、第１の情報量と第２の情報量の大小関係を比較する情報量計算処理部と、第２のカラムに対する第１のカラムの情報量である、第１の条件付き情報量を算出し、第１の条件付き情報量に基づいて、第１のカラムと第２のカラムの包含関係を判定する包含関係計算処理部と、を有することを特徴とする。 Alternatively, in the information processing system, the storage unit that stores the input table including the first column and the second column, the first information amount that is the information amount of the first column, and the second column An information amount calculation processing unit that calculates a second information amount, which is an information amount, and compares the magnitude relationship between the first information amount and the second information amount, and information on the first column relative to the second column An inclusion relation calculation processing unit that calculates a first conditional information quantity, which is a quantity, and determines an inclusion relation between the first column and the second column based on the first conditional information quantity It is characterized by that.

本願発明によれば、テーブル形式のデータにおいて、カラム間の包含関係を抽出することがより容易となる。 According to the present invention, it is easier to extract inclusion relations between columns in table format data.

カラムの包含関係を説明する模式図。The schematic diagram explaining the inclusion relation of a column. システム全体の構成を示すブロック図。The block diagram which shows the structure of the whole system. 包含関係計算抽出部における処理フロー図。The processing flow figure in an inclusion relation calculation extraction part. 情報量の計算における処理フロー図。The processing flow figure in calculation of information amount. 条件付き情報量の計算における処理フロー図。The processing flow figure in calculation of the amount of conditional information. 全カラムを子カラムの対象とする処理フロー図。The processing flow figure which makes all the columns the object of a child column. 入力テーブルを示す図。The figure which shows an input table. 包含関係情報テーブルを示す図。The figure which shows an inclusion relationship information table. 親フラグテーブルを示す図。The figure which shows a parent flag table. 包含関係計算処理の具体例を示す模式図。The schematic diagram which shows the specific example of an inclusion relationship calculation process. カラムを自動生成する処理フロー図。The processing flow figure which generates a column automatically. 全カラムの情報の具体例を示す図。The figure which shows the specific example of the information of all the columns. 型の分類を示す図。The figure which shows classification | category of a type | mold. 時刻テーブル生成処理における処理フロー図。The processing flowchart in a time table production | generation process. テーブル自動生成処理における処理フロー図。The processing flow figure in table automatic generation processing. 時刻テーブルと自動生成テーブルの具体例を示す図。The figure which shows the specific example of a time table and an automatic generation table. 包含関係情報テーブルを示す図。The figure which shows an inclusion relationship information table. 出力テーブルを示す図。The figure which shows an output table.

以下、本発明の一実施形態を、図面を参照して説明する。図２は、本発明のシステム全体のハードウェア構成を示すブロック図である。本実施例に係る情報処理システム100は、中央処理装置101および記憶装置102を含む情報処理装置100、入力装置103及び出力装置104を有する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the hardware configuration of the entire system of the present invention. An information processing system 100 according to the present embodiment includes an information processing apparatus 100 including a central processing unit 101 and a storage device 102, an input device 103, and an output device 104.

中央処理装置101は、記憶装置102に格納されたプログラムを実行するプロセッサであり、情報量計算処理部105、包含関係計算処理部106等を有する。記憶装置102は、例えば磁気記憶装置、フラッシュメモリ等の大容量かつ不揮発性の記憶装置であり、入力テーブル107、出力テーブル108、包含関係情報テーブル109等を記憶する。入力装置103は、キーボード、マウスなどのユーザインターフェースであり、出力装置104は、ディスプレイ装置、プリンタなどのユーザインターフェースである。ここで、情報処理装置100と、入力装置103および出力装置104とはネットワークを介して接続されている構成としたが、この点は特に限定されない。本システムは物理的に一つの計算機上に構築されても、物理的には一つ又は複数の計算機上に構成された論理区画上に構築されてもよい。情報量計算処理部105は、例えば図４で後述する、各カラムの情報量を計算する。包含関係計算処理部106は、例えば図５で後述する、子カラムに対する親カラムの条件付き情報量を計算する。 The central processing unit 101 is a processor that executes a program stored in the storage device 102, and includes an information amount calculation processing unit 105, an inclusion relation calculation processing unit 106, and the like. The storage device 102 is a large-capacity non-volatile storage device such as a magnetic storage device or a flash memory, and stores an input table 107, an output table 108, an inclusion relationship information table 109, and the like. The input device 103 is a user interface such as a keyboard and a mouse, and the output device 104 is a user interface such as a display device and a printer. Here, the information processing device 100, the input device 103, and the output device 104 are connected via a network, but this point is not particularly limited. This system may be physically constructed on one computer, or physically constructed on a logical partition configured on one or a plurality of computers. The information amount calculation processing unit 105 calculates the information amount of each column, which will be described later with reference to FIG. The inclusion relation calculation processing unit 106 calculates the conditional information amount of the parent column with respect to the child column, which will be described later with reference to FIG.

次に、情報処理システム100がテーブル形式のデータに対してカラム間の包含関係を抽出する方法を、図３を用いて説明する。図３は、中央処理装置101における処理を示したフロー図である。図３のフローにおいて、入力となるデータは、入力テーブル110および子カラム情報203である。入力テーブル110は、図７上段のような、例えば店舗の売上と顧客に関する情報が格納されているテーブルであり、テーブルの各列（顧客ID、年齢、入店時刻、…）のそれぞれにデータが含まれる。以下、このテーブルの各列を「カラム」と呼ぶ。また、テーブルの各行はレコードと呼ばれ、それぞれ一つのサンプルを表す。子カラム情報203が示す「子カラム」とは、図７下段の「商品分類」のように、入力テーブル110から任意に選択された一つのカラムである。図３では、子カラムを「商品分類」に固定した場合のフローを説明し、全カラムを子カラムの対象とするフローについては図６で説明する。 Next, a method by which the information processing system 100 extracts the inclusion relationship between columns for data in a table format will be described with reference to FIG. FIG. 3 is a flowchart showing processing in the central processing unit 101. In the flow of FIG. 3, input data is an input table 110 and child column information 203. The input table 110 is a table in which, for example, information on store sales and customers is stored as shown in the upper part of FIG. 7, and data is stored in each column (customer ID, age, store entry time,...) Of the table. included. Hereinafter, each column of this table is referred to as a “column”. Each row of the table is called a record and represents one sample. The “child column” indicated by the child column information 203 is one column arbitrarily selected from the input table 110 as “product classification” in the lower part of FIG. In FIG. 3, a flow when the child column is fixed to “product classification” will be described, and a flow in which all the columns are targets of the child column will be described with reference to FIG.

まず、ステップ301において、入力テーブル110に含まれるカラムから、親カラム候補となるカラムを一つ選択し、親カラム情報302を生成する。図７下段に親カラム情報302の例を示す。親カラム候補は、子カラム情報203のカラムと親フラグテーブル309においてフラグが立っているカラム以外のカラムを対象として、入力テーブル110から順に選択される。ここで、親フラグテーブル309は包含関係が既知であるフラグ関係を示すテーブルである。このテーブルを用いることで、親カラムとして調べる必要のない組み合わせを除くことができる。親フラグテーブル309の具体例を図９に示す。親フラグテーブル309では、例えば「商品名」と「商品分類」が包含関係にあるとしてフラグが立っている。 First, in step 301, one column as a parent column candidate is selected from the columns included in the input table 110, and the parent column information 302 is generated. An example of the parent column information 302 is shown in the lower part of FIG. Parent column candidates are selected in order from the input table 110 for columns other than the column of the child column information 203 and the column for which the flag is set in the parent flag table 309. Here, the parent flag table 309 is a table indicating a flag relationship whose inclusion relationship is known. By using this table, combinations that do not need to be examined as parent columns can be excluded. A specific example of the parent flag table 309 is shown in FIG. In the parent flag table 309, for example, “product name” and “product classification” are flagged as having an inclusion relationship.

次に、ステップ303において、親カラム情報302のカラムの情報量H(X)と子カラム情報203のカラムの情報量H(Y)を算出する。ここで、情報量が大きいほどカラムのユニークなレコード数が多いことを示している。ユニークなレコード数とは、重複を許さないカラムのレコード数である。 Next, in step 303, the column information amount H (X) of the parent column information 302 and the column information amount H (Y) of the child column information 203 are calculated. Here, the larger the amount of information, the greater the number of unique records in the column. The number of unique records is the number of records in a column that does not allow duplication.

次に、ステップ304では、情報量の比較を行い、0<H(親カラム)<H(子カラム)を満たす時Yesと判定する。全てのレコードが同じ値を持つカラムは情報量が0となり、必ずH(親カラム)<H(子カラム)を満たすが、意味のない関係であるため、0<H(親カラム)を条件に加えてこのようなカラムを除いている。 Next, in Step 304, the amount of information is compared, and if 0 <H (parent column) <H (child column) is satisfied, it is determined Yes. Columns with the same value in all records have the amount of information 0, and they always satisfy H (parent column) <H (child column), but they are meaningless, so 0 <H (parent column) In addition, such columns are excluded.

ステップ304でYesと判定された場合は、ステップ305において、親カラム情報302と子カラム情報203に対し、子カラムに対する親カラムの条件付き情報量H(X|Y)を算出する。 If it is determined Yes in step 304, in step 305, the conditional information amount H (X | Y) of the parent column for the child column is calculated for the parent column information 302 and the child column information 203.

次に、ステップ306で、子カラムによる親カラムの包含判定を行う。このステップは、カラム間の情報量の観点から見た包含関係を判定するステップであり、上記H(X|Y)が0である場合にYesと判断する。また、カラム内のデータにノイズが含まれた場合を考慮して、0に十分近い値所定の閾値α（例えば0.1）未満であるとき、Yesと判定する。 Next, in step 306, the inclusion determination of the parent column by the child column is performed. This step is a step of determining an inclusion relationship from the viewpoint of the amount of information between columns. If H (X | Y) is 0, it is determined Yes. In consideration of the case where noise is included in the data in the column, when the value is sufficiently close to 0 and less than a predetermined threshold value α (for example, 0.1), it is determined as Yes.

ステップ307では、以上の判定の結果を包含関係情報テーブル109に登録する。ステップ306でYesと判定された場合には、カラム名および対応する親カラム名を包含関係情報テーブル109に登録する。 In step 307, the above determination result is registered in the inclusion relation information table 109. If it is determined Yes in step 306, the column name and the corresponding parent column name are registered in the inclusion relation information table 109.

次に、ステップ308で親フラグテーブル309を更新し、ステップ310で、包含関係情報テーブル109を更新する。 Next, the parent flag table 309 is updated in step 308, and the inclusion relation information table 109 is updated in step 310.

ここで図４を用いて、図３のステップ303の詳細を説明する。まず、ステップ401で、親カラム情報302と、子カラム情報203を入力としNullを含むレコードを除いた親カラム情報、子カラム情報を出力する。このとき、親カラム情報302のレコードと子カラム情報203のレコードはペアで扱うため、いずれかがNullを含む場合は両方のレコードが除かれる。以上の具体例を図１０に示す。例えば入力テーブル110から、親カラム情報302として「商品分類」カラムが、子カラム情報203として「商品名」カラムが選択されている。これに対しステップ401を実行すると、「商品名」カラムにおける5番目のレコードには「スリッパ」が入っているが、「商品分類」カラムにおける5番目のレコードはnullであるため、5番目のレコードが削除された、親カラム情報701および子カラム情報702が出力される。その後、変数Zを初期化し(ステップ402)、各カラム中のユニークなレコードを抽出してXに格納する(ステップ403)。 Here, the details of step 303 in FIG. 3 will be described with reference to FIG. First, in step 401, parent column information 302 and child column information 203 are input, and parent column information and child column information excluding records including Null are output. At this time, since the record of the parent column information 302 and the record of the child column information 203 are handled as a pair, if either of them includes Null, both records are excluded. A specific example of the above is shown in FIG. For example, from the input table 110, the “product classification” column is selected as the parent column information 302, and the “product name” column is selected as the child column information 203. On the other hand, when step 401 is executed, the fifth record in the “product name” column contains “slipper”, but the fifth record in the “product classification” column is null, so the fifth record Is deleted, the parent column information 701 and the child column information 702 are output. Thereafter, the variable Z is initialized (step 402), a unique record in each column is extracted and stored in X (step 403).

次に、ステップ404で、X中の各レコードxに対して情報量を、以下の（１）により算出する。 Next, in step 404, the amount of information for each record x in X is calculated according to (1) below.

ここで、
x：カラムの各ユニークなレコード
X：レコードxの集合
p(x)：カラムのレコードがxである確率
H(X)：カラムの情報量
である。この計算を、ステップ405において親カラムの全てのxに対して行う。そして、ステップ406および408〜410で、子カラムに対して同様の計算を行うことで、親カラムおよび子カラムについて情報量が算出できる(ステップ407)。here,
x: Each unique record in the column
X: Set of records x
p (x): Probability that the column record is x
H (X): column information amount. This calculation is performed for all x in the parent column in step 405. In steps 406 and 408 to 410, the same amount of information is calculated for the child column, whereby the information amount can be calculated for the parent column and the child column (step 407).

具体的な演算は以下の通りである。まず、親カラム情報701に対して、情報量を算出する。親カラム情報701において、「商品分類」カラムのレコード数は9である。その中で、「商品分類」中の「文房具」であるレコードに注目すると、レコード数は3であるのでx=「文房具」である確率はp(x)=3/9である。同様にx=「食べ物」のときp(x)=4/9、x=「キッチン」のときp(x)=2/9であるため、X=「商品分類」カラムの情報量は、以下の式(2)よりH(X)=1.53と計算される。 The specific calculation is as follows. First, the amount of information is calculated for the parent column information 701. In the parent column information 701, the number of records in the “product classification” column is nine. Among them, paying attention to the record “stationery” in “product classification”, since the number of records is 3, the probability that x = “stationery” is p (x) = 3/9. Similarly, when x = “food”, p (x) = 4/9, and when x = “kitchen”, p (x) = 2/9, so the amount of information in the X = “product classification” column is From the equation (2), H (X) = 1.53 is calculated.

ここで、
x：「商品分類」カラムの各ユニークなレコード
X：レコードxの集合
p(x)：「商品分類」カラムのレコードがxである確率
H(X)：「商品分類」カラムの情報量
である。同様に、子カラム情報702に対して情報量を算出と、式(3)より情報量はH(Y)=2.95と計算される。here,
x: Each unique record in the “Product Classification” column
X: Set of records x
p (x): Probability that the record in the “product classification” column is x
H (X): Information amount in the “product classification” column. Similarly, when the amount of information is calculated for the child column information 702, the amount of information is calculated as H (Y) = 2.95 from Equation (3).

ここで、
y：「商品名」カラムの各ユニークなレコード
Y：レコードyの集合
p(y)：「商品名」カラムのレコードがyである確率
H(Y)：「商品名」カラムの情報量
である。従って、0<H(X)<H(Y)であるため、後段のステップ304でYesと判定されることとなる。すなわち、本具体例では、「商品分類」カラムより「商品名」カラムの方が大きな情報量を持つカラムである。言い方を変えると、「商品分類」カラムより「商品名」カラムの方がユニークなレコード数が大きい。ユニークなレコード数とは重複を許さないカラムのレコード数である。here,
y: Each unique record in the “Product Name” column
Y: Set of records y
p (y): Probability that the record in the “Product Name” column is y
H (Y): Information amount in the “product name” column. Therefore, since 0 <H (X) <H (Y), it is determined as Yes in the subsequent step 304. That is, in this specific example, the “product name” column has a larger amount of information than the “product classification” column. In other words, the “product name” column has more unique records than the “product classification” column. The number of unique records is the number of records in a column that does not allow duplication.

次に図５を用いて、図３のステップ305の詳細を説明する。図４と同様に、Nullを含むレコードを除いた親カラム情報、子カラム情報を出力し(ステップ401)、Zを初期化し(ステップ501)、各カラム中のユニークなレコードを抽出してX、Yに格納する(ステップ502)。 Next, details of step 305 in FIG. 3 will be described with reference to FIG. As in FIG. 4, the parent column information and child column information excluding the record including Null are output (step 401), Z is initialized (step 501), and the unique record in each column is extracted to extract X, Store in Y (step 502).

次に、ステップ503から505で、X、Y中の各レコードx、yについて、子カラムに対する親カラムの条件付き情報量506を算出する。ステップ503における、条件付き情報量の計算は、以下の式(4)によって行う。 Next, in steps 503 to 505, for each record x and y in X and Y, a conditional information amount 506 of the parent column with respect to the child column is calculated. The calculation of the conditional information amount in step 503 is performed by the following equation (4).

x：親カラムの各ユニークレコード
y：子カラムの各ユニークレコード
X：レコードxの集合
Y：レコードyの集合
p(y) ：子カラムのレコードがyである確率
p(x、y)：親カラムのレコードがxかつ子カラムのレコードがyである確率
H(X|Y)：子カラムのレコードが決まったとき親カラムのレコードに残る情報量
以上の計算を、図１０の親カラム情報701および子カラム情報702を引き続き用いて説明する。以下、各記号の意味は以下の通りである。
x：「商品分類」カラムの各ユニークレコード
y：「商品名」カラムの各ユニークレコード
X：レコードxの集合
Y：レコードyの集合
p(y)：「商品数」カラムのレコードがyである確率
p(x、y)：「商品分類」カラムのレコードがxかつ「商品名」カラムのレコードがyである確率
H(X|Y)：「商品名」カラムのレコードが決まったとき「商品分類」カラムのレコードに残る情報量
まず、x=「文房具」、y=「ペン」とすると、「商品名」カラムのレコード数は9、「商品名」カラムが「ペン」であるレコード数は1、「商品名」カラムが「ペン」かつ「商品分類」が「文房具」であるレコード数は1であるので、p(y)=1/9であり、p(x、y)=1/9である。同様にしてx=「食べ物」、y=「お茶」のときp(y)=2/9、p(x、y)=2/9である。x: Each unique record in the parent column
y: Each unique record in the child column
X: Set of records x
Y: Set of records y
p (y): Probability that the child column record is y
p (x, y): Probability that the record in the parent column is x and the record in the child column is y
H (X | Y): Amount of information remaining in parent column record when child column record is determined The above calculation will be described with reference to parent column information 701 and child column information 702 in FIG. Hereinafter, the meaning of each symbol is as follows.
x: Each unique record in the “Product Classification” column
y: Each unique record in the “Product Name” column
X: Set of records x
Y: Set of records y
p (y): Probability that the record in the “Product Count” column is y
p (x, y): Probability that the record in the “product classification” column is x and the record in the “product name” column is y
H (X | Y): The amount of information that remains in the record in the “Product Classification” column when the record in the “Product Name” column is determined. The number of records is 9, the number of records whose "product name" column is "pen" is 1, the number of records whose "product name" column is "pen" and "product classification" is "stationery" is 1, p (y) = 1/9 and p (x, y) = 1/9. Similarly, when x = “food” and y = “tea”, p (y) = 2/9 and p (x, y) = 2/9.

以上の要領で、全てのレコードについてp(y)とp(x、y)を用いて、X=「商品分類」カラムの全レコードに対するY=「商品名」カラムの全レコードの条件付き情報量は式(5)よりH(X|Y)=0と計算される。 With the above procedure, using p (y) and p (x, y) for all records, X = all records in the “Product Classification” column, Y = conditional information amount of all records in the “Product Name” column Is calculated as H (X | Y) = 0 from Equation (5).

ここで、ステップ306では、H(X|Y)<αを満たす場合にYesと判定される。ここでαは、0≦α<0.1程度の範囲を持つ閾値である。具体例では条件を満たすためYesと判定され、これは言い換えると「商品分類」カラムは「商品名」カラムに包含されていると判定される。仮にXとYを入れ替え、「商品分類」カラムに対する「商品名」カラムの条件付き情報量H(Y|X)を計算したものを式(6)に示す。 Here, in step 306, it is determined Yes if H (X | Y) <α is satisfied. Here, α is a threshold having a range of about 0 ≦ α <0.1. In the specific example, it is determined as Yes because the condition is satisfied. In other words, it is determined that the “product classification” column is included in the “product name” column. If the X and Y are interchanged and the conditional information amount H (Y | X) of the “product name” column for the “product classification” column is calculated, equation (6) is shown.

式(6)よりH(Y|X)=1.42となるが、これはH(Y|X)>αであるため、包含関係判定部306でNoと判定される値である。言い換えると、「商品名」カラムは「商品分類」カラムに包含されていないと判定される。つまり「商品分類」カラムは「商品名」カラムの親カラムであるが、逆は成り立たないことを示している。なお、以上の計算は説明のためのものであり、式(6)の計算は本願発明の実施に必須ではないことに留意されたい。 From equation (6), H (Y | X) = 1.42, but since H (Y | X)> α, this is a value determined as No by the inclusion relationship determination unit 306. In other words, it is determined that the “product name” column is not included in the “product classification” column. That is, the “product classification” column is a parent column of the “product name” column, but the reverse is not true. It should be noted that the above calculation is for explanation, and the calculation of equation (6) is not essential for the implementation of the present invention.

このように、ステップ306では、片方のカラムのレコードが決まったとき、もう片方のカラムのレコードが唯一に決まる傾向にあるか判定している。入力テーブル110の具体例で説明すると、「商品名」が「ペン」であるとき「商品分類」は必ず「文房具」であるし、「商品名」が「お茶」であるとき「商品分類」は必ず「食べ物」であるため、「商品分類」カラムは「商品名」カラムの親カラムである。一方で「商品分類」が「文房具」であるとき、「商品名」は「ペン」「鉛筆」「消しゴム」のいずれかであり唯一に決まらないため、「商品名」は「商品分類」の親ではない。以上のように、情報量判定部304でYesかつ包含関係判定部306でYesと判定された親カラム情報302のカラムと子カラム情報203のカラムは包含関係にあるカラムである。 As described above, in step 306, when the record of one column is determined, it is determined whether the record of the other column tends to be determined uniquely. As a specific example of the input table 110, when “product name” is “pen”, “product category” is always “stationery”, and when “product name” is “tea”, “product category” is Since it is always “food”, the “product classification” column is a parent column of the “product name” column. On the other hand, when “Product category” is “Stationery”, “Product name” is one of “Pen”, “Pencil”, and “Eraser” and is not uniquely determined, so “Product name” is the parent of “Product category”. is not. As described above, the column of the parent column information 302 and the column of the child column information 203 determined as Yes by the information amount determination unit 304 and Yes by the inclusion relationship determination unit 306 are columns having an inclusion relationship.

以上で、子カラムを「商品分類」に固定した場合の、包含関係の抽出が完了した。次に、全カラムを子カラムとするフローを図６で説明する。 This completes the extraction of the inclusion relationship when the child column is fixed to “product classification”. Next, a flow in which all columns are child columns will be described with reference to FIG.

まず、ステップ202で、入力テーブル110中の任意の一つのカラムを子カラムとして選択し、子カラム情報203を出力する。次に、ステップ204で、図３〜５にて説明した一連のフローを実行することで、カラム間の包含関係情報を抽出し、包含関係情報テーブル109を更新する。ステップ205で、ステップ204の処理を全てのカラムに対して実行する。 First, in step 202, one arbitrary column in the input table 110 is selected as a child column, and child column information 203 is output. Next, in step 204, the series of flows described with reference to FIGS. 3 to 5 are executed to extract inclusion relation information between columns and update the inclusion relation information table 109. In step 205, the process of step 204 is executed for all columns.

ステップ205で、全てのカラムに対してステップ204の処理を実行したことを確認すると、必要に応じてステップ206で直近度計算処理を実行し、包含関係情報テーブル109を更新する。以上で、入力テーブル110に含まれる全カラムについて、包含関係が自動で算出できる。ここで、直近度603とは、カラム名601と親カラム名602に格納されるカラム間の包含関係の近さを示す値であり、カラム名601が同一のレコードについて、親カラム名602のレコード中に格納されているカラムについて、そのカラムのユニークなレコード数が多い順に番号が与えられる。直近度603の具体例は図１７で説明する。 If it is confirmed in step 205 that the processing of step 204 has been executed for all columns, the proximity calculation processing is executed in step 206 and the inclusion relation information table 109 is updated as necessary. As described above, the inclusion relation can be automatically calculated for all the columns included in the input table 110. Here, the nearest degree 603 is a value indicating the closeness of the inclusive relation between the columns stored in the column name 601 and the parent column name 602, and the record of the parent column name 602 for the record having the same column name 601. The columns stored in the column are numbered in descending order of the number of unique records in that column. A specific example of the latest degree 603 will be described with reference to FIG.

本実施例におけるテーブルとは、一般的なデータベースにおけるテーブルと同一の概念であると考えて問題ないが、本発明はテーブルがデータベース内のテーブルに限定されるものではなく、プログラム上のメモリ領域に格納されている形態でも良いし、テキストファイル形式、CSVファイル形式などあらゆる形態のデータに置き換えても良い。 The table in this embodiment has no problem considering that it is the same concept as a table in a general database, but the present invention is not limited to the table in the database, and the table is not limited to the memory area on the program. A stored form may be used, or data in any form such as a text file format or a CSV file format may be replaced.

図８は、包含関係情報テーブル109の具体例である。カラム名601には、図７の入力テーブル110において包含関係を算出した子カラムのカラム名が格納される。親カラム名602には、カラム名601のカラムに対して親カラムが存在する場合、親カラムのカラム名が格納される。例えば、図３のステップ307において、レコード604に親カラム情報302のカラムである「商品分類」と子カラム情報203のカラムである「商品名」が登録されている。 FIG. 8 is a specific example of the inclusion relationship information table 109. The column name 601 stores the column name of the child column whose inclusion relation is calculated in the input table 110 of FIG. In the parent column name 602, when a parent column exists for the column of the column name 601, the column name of the parent column is stored. For example, in step 307 of FIG. 3, “product classification” that is a column of parent column information 302 and “product name” that is a column of child column information 203 are registered in the record 604.

図９は親フラグテーブル309の具体例である。親フラグテーブル309は、行と列に同じ要素を持つ正方行列である。正方行列の中で、斜線を引いた場所は使用しない。親フラグテーブル309の初期状態は空のテーブルである。入力テーブル110から、ステップ307において「商品名」カラムと「商品分類」カラムが包含関係にあると登録されると、ステップ308によって親フラグテーブル309の「商品名」「商品分類」に対応する場所にフラグ1が格納される。同様に包含関係にあると判定されたカラムの組み合わせに対応する場所にフラグが格納される。 FIG. 9 is a specific example of the parent flag table 309. The parent flag table 309 is a square matrix having the same elements in rows and columns. Do not use the shaded area in the square matrix. The initial state of the parent flag table 309 is an empty table. When it is registered from the input table 110 that the “product name” column and the “product classification” column are in an inclusive relationship in step 307, the location corresponding to “product name” and “product classification” in the parent flag table 309 in step 308 Stores flag 1. Similarly, a flag is stored at a location corresponding to a combination of columns determined to be in an inclusive relationship.

このように、本実施例に係る情報処理方法は、第１のカラム（親カラム）および第２のカラム（子カラム）が含まれるテーブル110を入力とし、第１のカラムの情報量である第１の情報量H(X)および第２のカラムの情報量である第２の情報量H(Y)を求める工程303と、第１の情報量と第２の情報量を比較する工程304と、第２のカラムに対する第１のカラムの情報量である第１の条件付き情報量H(X|Y)を求める工程305と、第１の条件付き情報量に基づいて、第１のカラムと第２のカラムの包含関係を判定する工程306とを有することを特徴とする。また、本実施例に係る情報処理システムは、第１のカラムおよび第２のカラムが含まれる入力テーブルを記憶する記憶部102と、第１のカラムの情報量である第１の情報量と第２のカラムの情報量である第２の情報量と、を算出し、第１の情報量と第２の情報量の大小関係を比較する情報量計算処理部105と、第２のカラムに対する第１のカラムの情報量である第１の条件付き情報量を算出し、第１の条件付き情報量に基づいて第１のカラムと第２のカラムの包含関係を判定する包含関係計算処理部106と、を有することを特徴とする。 As described above, the information processing method according to the present embodiment uses the table 110 including the first column (parent column) and the second column (child column) as an input, and is the information amount of the first column. A step 303 for obtaining a first information amount H (X) and a second information amount H (Y) which is the information amount of the second column; and a step 304 for comparing the first information amount and the second information amount. A step 305 for obtaining a first conditional information amount H (X | Y) that is an information amount of the first column with respect to the second column, and the first column based on the first conditional information amount, And a step 306 of determining an inclusion relationship of the second column. In addition, the information processing system according to the present embodiment includes a storage unit 102 that stores an input table including a first column and a second column, a first information amount that is an information amount of the first column, and a first information amount. A second information amount that is an information amount of the second column, and an information amount calculation processing unit 105 that compares the magnitude relation between the first information amount and the second information amount, and a second information amount for the second column. An inclusion relation calculation processing unit 106 that calculates a first conditional information quantity that is an information quantity of one column and determines an inclusion relation between the first column and the second column based on the first conditional information quantity. It is characterized by having.

係る特徴により、本実施例に係る情報処理方法および情報処理システムは、テーブル形式のデータに対して、カラム間の包含関係を抽出することがより容易となる。特に、この情報処理方法は、データの情報量に基づき包含関係を抽出する方法であるため、シソーラス辞書等は当然に不要であり、さらに、作成者が任意のカラム名を付与していたとしても問題なく実施が可能となる。 With such a feature, the information processing method and the information processing system according to the present embodiment can more easily extract the inclusion relation between the columns with respect to the data in the table format. In particular, since this information processing method is a method of extracting inclusion relations based on the amount of data information, a thesaurus dictionary is naturally unnecessary, and even if the creator has given an arbitrary column name Implementation is possible without problems.

図１１は実施例２における情報処理システム100のフロー図であり、実施例１の図６に対応するものである。実施例１との違いは、子カラム情報203によって入力される子カラムの型によって動作を変え、親カラムを自動生成する機能が追加されている点である。これにより、入力テーブル110中に子カラムの親カラムが存在しない場合でも親カラムを生成できる。 FIG. 11 is a flowchart of the information processing system 100 in the second embodiment, and corresponds to FIG. 6 in the first embodiment. The difference from the first embodiment is that a function for automatically generating a parent column is added by changing the operation depending on the type of the child column input by the child column information 203. As a result, the parent column can be generated even when the parent column of the child column does not exist in the input table 110.

図１１では、ユーザから入力テーブル110に加え、全カラムの情報テーブル801を入力とする。全カラムの情報テーブル801には、入力テーブル110に関する全てのカラムの情報が格納されている。全カラムの情報テーブル801の具体例を図１２に示す。入力テーブル110からステップ202によってカラムが一つ選択され、子カラム情報203が出力される。子カラムの情報203と全カラムの情報テーブル801を入力として、ステップ802において子カラム情報203のカラムの型を、「時刻型」「数値型」「文字列型」の三種類のいずれかに判定し、選択された型に応じて、それぞれステップ805、ステップ803、ステップ204が実行される。ステップ802では、全カラムの情報テーブル801に格納されている、子カラム情報203のカラムの型名を元に型が判定される。型の判定する型の分類について図１３に示す。 In FIG. 11, in addition to the input table 110 from the user, the information table 801 for all columns is input. The all column information table 801 stores information on all the columns related to the input table 110. A specific example of the information table 801 for all columns is shown in FIG. One column is selected from the input table 110 in step 202, and child column information 203 is output. Using child column information 203 and all column information table 801 as input, in step 802, the column type of child column information 203 is determined to be one of three types: “time type”, “numeric type”, and “string type”. Then, step 805, step 803, and step 204 are executed according to the selected type. In step 802, the type is determined based on the column type name of the child column information 203 stored in the information table 801 for all columns. FIG. 13 shows the type classification for determining the type.

ステップ802において時刻型と判定されると、ステップ805が実行され、時刻テーブル807が出力される。時刻テーブル生成処理部805の詳細なフローを図１４に、時刻テーブル807の具体例を図１６に示す。 If the time type is determined in step 802, step 805 is executed and a time table 807 is output. A detailed flow of the time table generation processing unit 805 is shown in FIG. 14, and a specific example of the time table 807 is shown in FIG.

ステップ802において数値型と判定されると、ステップ803が実行され、子カラム情報203のカラム中の各レコードについて頻度分布が算出される。そして、得られた結果を用いてステップ804において、数値型として扱うか文字列型として扱うか判定される。具体的には、例えば頻度分布が一様分布に従う場合はNo、正規分布やその他の分布に従う場合はYesと判定するというように、子カラム情報203におけるカラムの要素がIDとして扱われている数値なのか、値として扱われている数値なのか判定する。ステップ804でYesと判定されると、ステップ806が実行され、自動生成テーブル808が出力される。自動生成テーブル生成処理部806の詳細を図１５に、自動生成テーブルの具体例を図１６に示す。 If it is determined in step 802 that it is a numeric type, step 803 is executed, and a frequency distribution is calculated for each record in the column of the child column information 203. Then, using the obtained result, in step 804, it is determined whether it is handled as a numeric type or a character string type. Specifically, a numerical value in which the column element in the child column information 203 is treated as an ID, for example, No is determined when the frequency distribution follows a uniform distribution, and Yes is determined when the frequency distribution follows a normal distribution or other distributions. Or whether it is a numerical value treated as a value. If it is determined Yes in step 804, step 806 is executed, and the automatic generation table 808 is output. FIG. 15 shows details of the automatic generation table generation processing unit 806, and FIG. 16 shows a specific example of the automatic generation table.

ステップ802において文字列型と判定されると、ステップ204が実行される。ステップ204の詳細は図３〜５で説明したものと同様である。 If it is determined in step 802 that the character string type, step 204 is executed. The details of step 204 are the same as those described with reference to FIGS.

その後、入力テーブル110と時刻テーブル807または自動生成テーブル808を入力として、ステップ809によりテーブルが結合される。この結合は、入力テーブル110が一般的なリレーショナルデータベースにおけるテーブルの場合、結合クエリーであるinner joinを用いて行われ、テキスト形式など他の形式の場合も同等の処理で行われる。結合の際にキーとなるカラムは、子カラム情報テーブルのカラムである。ステップ810では、包含関係情報を包含関係情報テーブル812に格納する。包含関係情報テーブル812の具体例を図１７に示す。 Thereafter, the input table 110 and the time table 807 or the automatic generation table 808 are input, and the tables are joined in step 809. This join is performed using inner join which is a join query when the input table 110 is a table in a general relational database, and is performed in the same manner in other formats such as a text format. The column that becomes the key at the time of joining is a column of the child column information table. In step 810, the inclusion relation information is stored in the inclusion relation information table 812. A specific example of the inclusion relation information table 812 is shown in FIG.

ステップ205においてYesと判定されると、出力テーブル108と包含関係情報テーブル812が出力され、必要に応じてステップ206により包含関係情報テーブル812が更新される。出力テーブル108の具体例を図１８に示す。今回は選択された子カラムに対して初めに型の判定を行ったが、初めにステップ204を実行し、親カラムが見つからなかった場合に型の判定を行い、ステップ805やステップ806を実行することも可能である。 If it is determined Yes in step 205, the output table 108 and the inclusion relation information table 812 are output, and the inclusion relation information table 812 is updated in step 206 as necessary. A specific example of the output table 108 is shown in FIG. This time, the type is first determined for the selected child column, but first, step 204 is executed, and if the parent column is not found, the type is determined, and step 805 and step 806 are executed. It is also possible.

図１２は全カラムの情報テーブル801の具体例を示している。全カラムの情報テーブル801はカラム名、型名、閾値を格納する。カラム名は入力テーブル110の各カラムの名前、型名は一般的なリレーショナルデータベースにおける型名のことであり、int、float、double、decimal、bit、boolean、char、string、time、date、datetime等が挙げられる。閾値は、ユーザもしくはシステムから与えられる数値である。 FIG. 12 shows a specific example of the information table 801 for all columns. The information table 801 for all columns stores column names, model names, and threshold values. The column name is the name of each column in the input table 110, the type name is the type name in a general relational database, such as int, float, double, decimal, bit, boolean, char, string, time, date, datetime, etc. Is mentioned. The threshold value is a numerical value given from the user or the system.

図１３は型の分類テーブル902を示している。ステップ802では、型の分類テーブル902のとおりに判定される。ここに上げた型以外の型についても同様にして型分類を定義して判定する。 FIG. 13 shows a type classification table 902. In step 802, the determination is made according to the type classification table 902. The types other than the types listed here are determined by defining the type classification in the same manner.

図１４は、ステップ805の詳細なフロー図を示している。この処理では、時刻型である子カラムに対して粒度を様々に変えたカラムを生成する。子カラム情報203を入力としてステップ1001でカラム中のレコードの開始日時と終了日時を、レコードの最大最小値を求めることで取得する。次に、ステップ1002では、開始日時から終了日時までの範囲で時刻テーブル807を生成する。時刻テーブル807の具体例を図１６に示す。 FIG. 14 shows a detailed flowchart of step 805. In this process, a column with various granularities is generated for a child column that is a time type. In step 1001, the child column information 203 is inputted, and the start date and time and end date and time of the record in the column are obtained by obtaining the maximum and minimum values of the record. Next, in step 1002, a time table 807 is generated in the range from the start date to the end date. A specific example of the time table 807 is shown in FIG.

図１５は、ステップ806の詳細なフロー図を示している。この処理では、子カラムに対して粒度を様々に変えたカラムを生成する。子カラム情報203を入力としてステップ1101でカラム中のレコードの最大値と最小値を取得し、その他にも、カラム中のレコードの頻度分布情報等を取得する。次に、ステップ1101で求めた子カラムの情報を元に、ステップ1102において自動生成テーブル808を生成する。ステップ1102では、全カラムの情報テーブル801から取得した閾値で最大値と最小値の範囲を分割した値を持つテーブルや、子カラムの頻度分布を元にテーブルを生成できる。自動生成テーブル808の具体例を図１６に示す。 FIG. 15 shows a detailed flowchart of step 806. In this process, a column with various particle sizes is generated for the child column. In step 1101, the child column information 203 is input, and the maximum and minimum values of the records in the column are acquired. In addition, the frequency distribution information of the records in the column is acquired. Next, based on the child column information obtained in step 1101, the automatic generation table 808 is generated in step 1102. In step 1102, a table having values obtained by dividing the range of the maximum value and the minimum value with the threshold values acquired from the information table 801 for all columns, and a table based on the frequency distribution of child columns can be generated. A specific example of the automatic generation table 808 is shown in FIG.

図１６は、時刻テーブル807と自動生成テーブル808の具体例である。ステップ1001により得られた開始の日時と終了の日時を元に時刻テーブル807が生成される。このとき生成される時刻テーブルは、10分刻みの時刻をレコードに持つカラム、1時間刻みの時刻をレコードに持つカラムなど様々な刻み幅の時刻カラムである。同様に、ステップ1101により得られた最大値と最小値を元に自動生成テーブル808が生成される。ここで「ReCalc_気温」カラム1506には、全カラムの情報テーブル801におけるカラム名中のレコード「気温」カラムに対応する「閾値」カラムのレコード「2」によって、入力テーブル110における「気温」カラムのレコードの最大値と最小値の範囲を2分割した値が格納されている。また、最大値と最小値を閾値で分割する方法以外にも、頻度分布情報抽出1102で抽出した頻度分布情報を元にして、頻度が均等になるように閾値で分割したカラムを生成することも可能である。 FIG. 16 is a specific example of the time table 807 and the automatic generation table 808. A time table 807 is generated based on the start date and time and the end date and time obtained in step 1001. The time table generated at this time is a time column of various increments such as a column having a time of 10 minutes in the record and a column having a time of 1 hour in the record. Similarly, the automatic generation table 808 is generated based on the maximum value and the minimum value obtained in step 1101. Here, in the “ReCalc_temperature” column 1506, the “temperature” column in the input table 110 includes the “threshold” column record “2” corresponding to the record “temperature” column in the column name in the information table 801 of all columns. Stores the value obtained by dividing the range of the maximum value and minimum value of this record into two. Besides the method of dividing the maximum value and the minimum value by the threshold value, it is also possible to generate a column divided by the threshold value so that the frequencies are equal based on the frequency distribution information extracted by the frequency distribution information extraction 1102. Is possible.

図１７は、包含関係情報テーブル812の具体例である。カラム名601には、入力テーブル110における全てのカラムのカラム名が格納される。親カラム名602には、カラム名601のカラムに対して親カラムが存在する場合、親カラムのカラム名が格納される。直近度603は、カラム名601と親カラム名602に格納されるカラム間の包含関係の近さである。カラム名601のレコードが同じ親カラム名602のレコード中に格納されているカラムについて、そのカラムのユニークなレコード数が大きい順に番号が与えられる。レコード1201では、「気温」の親カラムとして自動生成テーブル生成処理部806により生成された「ReCalc_気温」が登録されている。レコード1208と1209では、カラム間の包含関係情報抽出処理部204により「顧客ID」の親カラムとして「年齢」、「商品名」の親カラムとして「商品分類」が抽出され登録されている。これらのレコードは、一つの子カラムに対して一つの親カラムしか存在しないため、直近度には1が格納されている。 FIG. 17 is a specific example of the inclusion relation information table 812. The column name 601 stores the column names of all the columns in the input table 110. In the parent column name 602, when a parent column exists for the column of the column name 601, the column name of the parent column is stored. The latest degree 603 is the closeness of the inclusion relationship between the columns stored in the column name 601 and the parent column name 602. For columns in which the record of the column name 601 is stored in the record of the same parent column name 602, numbers are given in descending order of the number of unique records in the column. In the record 1201, “ReCalc_temperature” generated by the automatic generation table generation processing unit 806 is registered as a parent column of “temperature”. In records 1208 and 1209, “age” is extracted as a parent column of “customer ID” and “product classification” is extracted and registered as a parent column of “product name” by the inclusion relation information extraction processing unit 204 between the columns. Since these records have only one parent column for one child column, 1 is stored in the latest degree.

これに対しレコード1202、1203、1204では、時刻テーブル807が生成されたため、「10分ごとの時刻」「1時間ごとの時刻」「6時間ごとの時刻」が親カラムとしてそれぞれ登録されている。ここで「入店時刻」は親カラムを三つ持つため、各親カラムについてユニークなレコード数が大きい順に直近度603を求める。ここでは「10分ごとの時刻」「1時間ごとの時刻」「6時間ごとの時刻」の順にユニークなレコード数が大きいため、この順に直近度に順位が格納される。また「10分ごとの時刻」に対して「1時間ごとの時刻」「6時間ごとの時刻」、「1時間ごとの時刻」に対して「6時間ごとの時刻」が親カラムであるため、それぞれレコード1205、1206、1207に登録される。レコード1210のように親カラムが存在しない場合、親カラム名や直近度には何も格納されない。以上のようにして包含関係情報テーブル812が得られる。 On the other hand, in the records 1202, 1203, and 1204, since the time table 807 is generated, “time every 10 minutes”, “time every hour”, and “time every 6 hours” are registered as parent columns. Here, since “entry time” has three parent columns, the latest degree 603 is obtained in descending order of the number of unique records for each parent column. Here, since the number of unique records increases in the order of “time every 10 minutes”, “time every 1 hour”, and “time every 6 hours”, the rank is stored in this order in this order. In addition, because “time every 10 minutes” “time every hour” “time every 6 hours” and “time every hour” are parent time columns “time every 6 hours” They are registered in records 1205, 1206, and 1207, respectively. When there is no parent column as in the record 1210, nothing is stored in the parent column name or the latest degree. The inclusion relation information table 812 is obtained as described above.

図１８は、出力テーブル108の具体例である。入力テーブル110に対して、ステップ805やステップ806で生成された時刻テーブル807や自動生成テーブル808がステップ809で結合され、出力テーブル108が出力される。出力テーブル108には、ステップ805によって、「入店時刻」カラム1301に対する親カラムとして「10分ごとの時刻」カラム1302、「1時間ごとの時刻」カラム1303、「6時間ごとの時刻」カラム1304が生成され、結合されている。また自動生成テーブル生成処理部806によって、「気温」カラム1305に対する親カラムとして「ReCalc_気温」カラム1306が生成され、結合されている。人工的に生成されたカラムを含むカラム間の包含関係情報は、包含関係情報テーブル812に格納されている。このように、本実施例に係る発明によって出力された出力テーブル108とカラムの包含関係情報テーブル812を用いることで、入力テーブル中のあらゆる型のカラムに対して包含関係を抽出することができる。 FIG. 18 is a specific example of the output table 108. The time table 807 and the automatic generation table 808 generated in step 805 and step 806 are combined with the input table 110 in step 809, and the output table 108 is output. In step 805, the output table 108 includes a “time every 10 minutes” column 1302, a time every hour column 1303, and a time every 6 hours column 1304 as parent columns for the “entry time” column 1301. Is generated and combined. Further, the “ReCalc_temperature” column 1306 is generated and combined as a parent column for the “temperature” column 1305 by the automatic generation table generation processing unit 806. Inclusion relationship information between columns including artificially generated columns is stored in the inclusion relationship information table 812. As described above, by using the output table 108 and the column inclusion relation information table 812 output by the invention according to this embodiment, it is possible to extract inclusion relations for all types of columns in the input table.

110：情報処理システム、101：中央処理装置、102：記憶装置、103：入力装置、104：出力装置、105：情報量計算処理部、106：包含関係計算処理部、107：入力テーブル、108：出力テーブル、109：包含関係情報テーブル、110：入力テーブル、202, 204, 205, 206, 301, 303, 304, 305, 306, 307, 308, 401, 402, 403, 404, 405, 406, 501, 502, 503, 504, 505, 802, 803, 804, 805, 806, 809, 810, 1001, 1002, 1101, 1102：ステップ、203, 702：子カラム情報、302, 701：親カラム情報、309：親フラグテーブル、407：情報量、506：条件付き情報量、601：カラム名、602：親カラム名、603：直近度、604, 1201〜1210：レコード、801：全カラムの情報テーブル、807：時刻テーブル、808：自動生成テーブル、812：包含関係情報テーブル、902：型の分類テーブル、1301〜1306：カラム。 110: Information processing system, 101: Central processing unit, 102: Storage device, 103: Input device, 104: Output device, 105: Information amount calculation processing unit, 106: Inclusion relation calculation processing unit, 107: Input table, 108: Output table, 109: Inclusion relationship information table, 110: Input table, 202, 204, 205, 206, 301, 303, 304, 305, 306, 307, 308, 401, 402, 403, 404, 405, 406, 501 , 502, 503, 504, 505, 802, 803, 804, 805, 806, 809, 810, 1001, 1002, 1101, 1102: Step, 203, 702: Child column information, 302, 701: Parent column information, 309 : Parent flag table, 407: Information amount, 506: Conditional information amount, 601: Column name, 602: Parent column name, 603: Proximity, 604, 1201 to 1210: Record, 801: Information table for all columns, 807 : Time table, 808: Automatic generation table, 812: Inclusion relation information table, 902: Type classification table, 1301-1306: Column.

Claims

For an input table that includes a first column and a second column,
A first step of calculating a first information amount that is an information amount of the first column and a second information amount that is an information amount of the second column;
A second step of comparing a magnitude relationship between the first information amount and the second information amount;
A third step of calculating a first conditional information amount, which is an information amount of the first column with respect to the second column;
A fourth step of determining an inclusion relationship between the first column and the second column based on the first conditional information amount;
An information processing method characterized by comprising:

In claim 1,
In the second step, the third step is executed when the first information amount is larger than 0 and smaller than the second information amount.

In claim 1,
In the fourth step, when the first conditional information amount is 0 or more and less than a predetermined threshold value, it is determined that the first column includes the second column. Method.

In claim 1,
The information processing method further comprising a fifth step of determining whether a type of a record included in the second column is a time type, a numeric type, or a character string type.

In claim 4,
A sixth step of generating a plurality of columns in which the step size of the record included in the second column is changed when it is determined in the fifth step that the type of the second column is a time type; An information processing method characterized by further comprising:

In claim 4,
When it is determined in the fifth step that the type of the second column is a numeric type, it is determined whether a record included in the second column is handled as a numeric type or a character string type An information processing method further comprising a seventh step.

In claim 6,
An eighth step of generating a plurality of columns obtained by dividing the record range of the second column by a predetermined threshold when it is determined in the seventh step that the second column is handled as a numerical type; An information processing method comprising:

In claim 3,
The input table further includes a third column determined to include the second column;
In the information processing method, the inclusion relation table storing the inclusion relations of the first, second, and third columns indicates the latest order in which the number of unique records included in the first and third columns is large. An information processing method further comprising a ninth step of storing the degree.

In claim 1,
An information processing method, further comprising: a tenth step of deleting the record when the first column or the second column includes a null record.

A storage unit for storing an input table including a first column and a second column;
A first information amount that is an information amount of the first column and a second information amount that is an information amount of the second column are calculated, and the first information amount and the second information are calculated. An information amount calculation processing unit for comparing magnitude relations of amounts;
A first conditional information amount that is an information amount of the first column with respect to the second column is calculated, and the first column and the second column are calculated based on the first conditional information amount. An information processing system comprising: an inclusion relation calculation processing unit that determines an inclusion relation of columns.