WO2018181988A1

WO2018181988A1 - Method for identifying interdependence

Info

Publication number: WO2018181988A1
Application number: PCT/JP2018/013877
Authority: WO
Inventors: 努森; 河村　隆
Original assignee: 公立大学法人福島県立医科大学
Priority date: 2017-03-31
Filing date: 2018-03-30
Publication date: 2018-10-04
Also published as: JP6820621B2; JPWO2018181988A1

Abstract

[Problem] To specify efficiently, accurately and with statistical significance an interdependence between a plurality of phenomena indicated in data, even if various types of data acquired independently on the basis of different criteria are used comprehensively and uniformly over a large range. [Solution] This method for identifying interdependence between a first phenomenon and a second phenomenon is characterized by including a step of: acquiring a data set including binary data relating to the first phenomenon and binary data relating to the second phenomenon, from data including information of the first phenomenon and information of the second phenomenon for N samples; aggregating the number of samples in a 2x2 contingency table, from the acquired data set; calculating a Fisher exact probability P on the basis of the aggregated 2x2 contingency table; and calculating -log10 P/N on the basis of the calculated Fisher exact probability P and N.

Description

How to identify interdependencies

The present invention relates to an information processing method for large-scale data, a computer program for executing the method, and a recording medium storing the program. Specifically, the present invention relates to a method for specifying interdependency between two events, a computer program for executing the method, and a recording medium storing the program.

With the recent development of computer technology, data is collected by various means, and a large amount of data including different types of data is accumulated. These large-scale data are expected to contain useful information, and if this is analyzed effectively, the relationship between multiple events contained in these data is statistically significant. Through the identification, it is expected that the characteristics of the unknown event can be accurately identified. However, these large-scale data are often obtained independently under various different conditions, and the accuracy of the analysis results may be reduced due to noise accompanying the contained data. It is not easy to perform efficient analysis using comprehensive and comprehensive data over a large area.

The mutual information amount between a plurality of events is used as a quantity representing a measure of interdependency between the events. By calculating the mutual information amount between a plurality of events, it is expected that the interdependency between the plurality of events can be specified, and thereby the characteristics of the events can be specified. Although the mutual information amount of X and Y is defined as in the equation described later, as shown in the equation of the definition, conventionally, the mutual information amount does not consider the number N of samples from which this is calculated, It was not considered to take into account statistical significance. Further, as shown in the formula of the definition, it was not considered that the mutual information amount can be calculated using a combination of data obtained under different conditions. A technique for analyzing a large amount of data using the mutual information amount is used for processing various information such as documents, sounds, images, positions, life, astronomy, finance, and sales. As an algorithm for analyzing life information data, for example, ARACNE is known (Non-Patent Document 1).

By the way, Fisher's exact test is a statistical test method used to analyze data classified into two categories mainly when the number of samples is small, and has been used for various statistical processing ( Non-Patent Documents 2 to 3). The relationship between Fisher's exact probability and mutual information has not been known so far.

Even when various data acquired independently under different conditions are used comprehensively and uniformly over a large range, the present invention provides interdependencies of a plurality of events shown in these data. Is statistically significantly and efficiently specified.

The inventors of the present invention have been diligently studying, using Fisher's exact probability P calculated based on the 2 × 2 contingency table and the number of samples N used to create the contingency table, −log ₁₀ It has been found that the mutual information amount can be approximately calculated by calculating P / (Nlog ₁₀ 2). That is, the present inventors obtain a data set including binary data from data including N samples, create a 2 × 2 contingency table, and use Fisher's exact data based on the data set. calculates the probability P, using the N and the _P, by calculating the _{-log 10 P / (Nlog 10 2} ), to calculate the mutual information between events, the interdependence of each other the event I found out that it can be identified. Fischer's exact probability P is a concept that has been studied in probability theory, whereas mutual information is a concept that has been studied mainly in information theory. The discovery is extremely groundbreaking. Here, since log ₁₀ 2 is a constant, the interdependency between the events can be specified in the calculation of -log ₁₀ P / N. In this _{specification,} the calculation of the _-log 10 P / N, in the broad _sense, and imply the calculation of the _{-log 10 P / (Nlog 10 2} ).

That is, in the first aspect, the present invention provides a method for specifying interdependency between a first event and a second event, wherein the information on the first event and the information on the second event are represented by N samples. Calculated based on a 2 × 2 contingency table that aggregates the number of samples from a data set that includes binary data for the first event and binary data for the second event, obtained from data that includes The method includes the step of calculating −log ₁₀ P / N based on the Fisher's exact probability P and the N.

Fischer's exact probability P uses statistics, so meta-analysis can be performed by a conventionally known method. According to the meta-analysis, a plurality of Fisher's exact probabilities P calculated based on data obtained under different conditions such as data on different types of samples are integrated, and Fisher's exact probabilities for these overall data are integrated. P can be calculated. Therefore, in the above interdependency identification method, the Fisher's exact probability P is calculated based on the data acquired under different conditions, and the calculated Fisher's exact probabilities P are integrated and obtained. By using the exact Fisher's probability, the interdependency between events can be specified based on the whole data acquired under different conditions. This made it possible to calculate the mutual information between events based on the entire data acquired under different conditions, and to identify the interdependencies between the events, which could not be done in the past Is. The exact probability P of a plurality of Fishers can be integrated by meta-analysis even if the criteria for summing up the 2 × 2 contingency table for that is different, so that 2 × 2 for calculating the correct probability The criteria for summing up the contingency tables may be different.

Accordingly, the present invention is the method according to the first aspect, wherein the Fisher's exact probability P is (1) information on the first event and information on the second event is N _1. The binary data and the second event for the first event, obtained from the data included for the samples, based on the first criterion for the first event and the first criterion for the second event Fischer's exact probability P ₁ calculated based on a 2 × 2 contingency table in which the number of samples is aggregated from a data set including binary data of (2), (2) information on the first event and second Binary values for the first event, obtained from data containing information on N events for the N ₂ samples, based on the second criterion for the first event and the second criterion for the second event About the data and the second event From a data set containing the value data, and counts the number of samples was calculated on the basis of a 2 × 2 contingency table, and a exact P ₂ Fisher, a plurality of Fisher's exact, using meta-analysis And providing a method that is calculated by a method including a step of integrating them.

In the third aspect, the present invention provides a computer program for executing the method described in the first aspect or the second aspect.

In the fourth aspect, the present invention provides a recording medium storing the computer program according to the third aspect.

According to the present invention, statistical significance is considered as a value considering Fisher's exact probability and the number of samples, unlike the conventional method which does not consider the number of samples N and lacks consideration for statistical significance. However, the mutual information amount between the events can be calculated by calculating the mutual information amount between the events. Further, according to the invention, since the mutual information amount between events is calculated using meta-analysis, even if it is data on different types of samples acquired under different conditions, By calculating the mutual information amount between events, the interdependency between the events can be specified. Therefore, it is possible to identify interdependencies between events more accurately and statistically significantly based on a large amount of data while reducing bias due to the characteristics of various samples included in the entire data. it can. Furthermore, in the present invention, after calculating the exact probability of Fisher, applying the significance level for the calculated value, according to the obtained result, discarding the data of the exact probability of Fisher, If you do something that you do not want to use for subsequent calculations, you can reduce the noise associated with various data, and significantly reduce the data that is not significant, thereby reducing the computational burden and more accurately and statistically. Significantly, interdependencies between events can be identified.

It is a graph which shows the relationship between Fisher's exact probability p and mutual information amount MI. It is the graph which arranged the mutual information amount with EGFR computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog ₁₀ 2 to mutual information and EGFR calculated for each gene. It is the graph which arranged the mutual information amount with RB1 calculated about each gene from left to right in order with a high numerical value. Vertical axis indicates a value obtained by multiplying the Nlog ₁₀ 2 to mutual information and RB1 calculated for each gene. It is the graph which arranged the mutual information amount with IFNG computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog ₁₀ 2 to mutual information and IFNG calculated for each gene. It is the graph which arranged the mutual information amount with GRM1 computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog ₁₀ 2 to mutual information and GRM1 calculated for each gene.

The present invention provides a method for specifying the interdependency between the first event and the second event. Here, as an example of an event, the state grasped as an observation result about an object is mentioned. Examples of objects include genes and words. Other examples of objects include documents, sound, images, location, life, astronomy, finance, sales, etc. An example of a state is that it differs from the average property of the object. Examples of events include genetic changes, epigenetic changes, and rising or falling stock prices. Another example of an event is that multiple words are used in the same sentence, and that sales include sales of a specific product.

Examples of gene changes include gene sequence mutations, gene expression product changes, and gene modification changes. Examples of gene sequence mutations include gene base sequence mutations, gene copy number changes on the chromosome, and gene modification changes. Examples of gene base sequence mutations include gene point mutations, addition of base sequences to genes, and deletion of base sequences in genes. Examples of gene expression products include proteins, mRNA, and miRNA (micro-RNA). Examples of changes in gene expression products include changes in the expression level of gene expression products, changes in the expression location of gene expression products, formation of gene expression product complexes, and degradation of gene expression product complexes Is mentioned. Examples of gene modification include DNA methylation and histone modification. Examples of histone modifications include acetylation, methylation, ubiquitination, phosphorylation, and SUMOylation. Examples of gene modification include post-translational modification. Examples of post-translational modifications include functional group addition, protein or peptide addition, amino acid chemistry conversion, and structural conversion. Examples of functional group addition include acylation, acetylation, alkylation, amidation, biotinylation, formylation, gamma carboxylation, glutamylation, glycosylation, glycylation, heme, hydroxylation, iodination, isoprenylation, Lipoylation (prenylation, GPI anchor formation, myristoylation, farnesylation, geranylgeranylation, etc.), covalent bond addition to nucleotides or derivatives (ADP ribosylation, FAD linkage, etc.), redox reaction, polyethylene glycolation, phosphatidylinositol Phosphopantetheinylation, phosphorylation, pyroglutamate formation, racemization, tyrosine sulfation, selenoylation. Examples of protein or peptide addition include ISG, SUMO, and ubiquitination. Examples of conversion of amino acid chemistry include citrullination or deamination, deamidation. Examples of the structure conversion include disulfide and protease.

Examples of genes include genes of mammals such as humans, monkeys, mice and rats. An example of an epigenetic change is a change that is inherited through cell division and is independent of a change in DNA base sequence.

In the first event and the second event, the first and second are symbols for distinguishing the first event from the second event, and do not limit the order of these events. Here, the first event and the second event may be the same state for different objects, or may be different states for the same object. For example, the first event may be a mutation in the base sequence of gene A, and the second event may be a mutation in the base sequence of gene B. Further, for example, the first event may be a mutation in the sequence of gene A, and the second event may be a change in the expression level of the expression product of gene A. Here, gene A and gene B indicate different genes.

Examples of events include those represented by presence / absence and those represented by numerical values. Examples of what is represented by a numerical value include those represented by a discrete quantity exceeding 2 and those represented by a continuous quantity. The first event and the second event may be expressed differently. For example, the first event is expressed by the presence or absence of the first event, and the second event is a discrete quantity exceeding 2. It may be expressed.

In the present invention, data including information on the first event and information on the second event for N samples is used. Here, the N samples are, for example, N subjects having a common property that give an observation result about the event. Examples of N include numerical values such as 10 or more, 100 or more, 1,000 or more, 10,000 or more, 100,000 or more. The larger N is, the more accurately the interdependency between the first event and the second event can be specified. Examples of the common properties include those derived from living organisms, derived from humans, derived from humans with diseases, derived from humans with cancer, and humans with specific types of cancer. It is derived from. Examples of the subject include cells of living organisms such as humans, organs, and other biological samples.

Examples of certain types of cancer include leukemia, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma, multiple myeloma, brain tumor, breast cancer, endometrial cancer, cervical cancer, ovarian cancer, esophageal cancer, stomach cancer, Appendiceal cancer, colon cancer, liver cancer, hepatocellular carcinoma, gallbladder cancer, bile duct cancer, pancreatic cancer, adrenal cancer, gastrointestinal stromal tumor, mesothelioma, head and neck cancer, laryngeal cancer, oral cancer, oral floor cancer, gingiva Cancer, tongue cancer, buccal mucosa cancer, salivary gland cancer, sinus cancer, maxillary sinus cancer, frontal sinus cancer, ethmoid sinus cancer, sphenoid sinus cancer, thyroid cancer, kidney cancer, lung cancer, osteosarcoma, prostate cancer, Testicular tumor (testicular cancer), renal cell cancer, bladder cancer, rhabdomyosarcoma, skin cancer, anal cancer.

Since organisms affected by disease, particularly cancer, have an amplified interaction between genes, cells, organs and other biological samples derived from disease, particularly cancer-affected organisms, are interdependent for different genes. It is suitable as a sample for specifying sex.

The data used in the present invention includes information on the first event and information on the second event for N samples. Here, in the data including information on the first event and information on the second event for N samples, for example, each of the N samples includes information on the first event and information on the second event. including. Here, as examples of event information, (1) if the event is represented by the presence / absence of the event, information on whether or not the event occurred for the sample can be cited. (2) In the case of a numerical value, the numerical value for the sample is given.

In the present invention, binary data for the first event and binary data for the second event are obtained from data including the information on the first event and the information on the second event for N samples. The containing dataset is retrieved. Here, examples of binary data for events include data on the presence / absence when events are represented by presence / absence, and data above or below a reference value when events are represented by numerical values. For example, in the case where (1) an event is represented by presence / absence, a data set including binary data regarding the event can be obtained by using the event information included in the data as it is (2 ) If the event is expressed numerically, set a reference value, determine that the event information for the sample included in the data is greater than or less than the reference value, and obtain binary data as the determination result This can be obtained by repeating for N samples. Acquisition of a data set including binary data for the first event and binary data for the second event is performed by, for example, (1) performing the above-described method for the information of the first event, (2) Perform the above-described method for the information of the second event, acquire the binary data for the second event, and (3) combine the acquired binary data. Can be done. The data set including the binary data for the first event and the binary data for the second event acquired in the above may be in a form using a linear index, for example.

The method of the present invention uses the data set including the binary data for the first event and the binary data for the second event, which is obtained in the above, to express the event represented by the presence or absence and the numerical value. It can be used regardless of the type of event such as an event to be expressed, an event expressed by a discrete quantity exceeding 2, an event expressed by a continuous quantity. Therefore, the method of the present invention is suitable for repeatedly performing a plurality of events. Since the method of the present invention can be performed using the same algorithm even when it is repeatedly performed for a plurality of events, a unified analysis can be easily performed.

The functions of each gene in the living body are diverse, the parameters that specify the state of each gene are diverse, and each parameter can take a continuous or discrete value. It was not easy to identify the interdependencies of various genes using the data included in a unified manner. The method of the present invention can be used regardless of the type of information about various genes, and even when repeatedly performed on various genes, it can be performed using a common technique, so it is easily unified. Analysis can be performed. Therefore, the method of the present invention is suitable for using data including information on a plurality of genes in a unified manner to specify the interdependence of these genes.

In the present invention, the number of samples is aggregated in a 2 × 2 contingency table from a data set including binary data for the first event and binary data for the second event. Aggregation of the number of samples from the data set including the binary data into the 2 × 2 contingency table is, for example, that the binary data for the first event and the binary data for the second event are both In the case where it is expressed by presence / absence, it may be performed by aggregating a, b, c, and d which are the number of samples corresponding to the conditions of each column in Table 1 below. Note that the sum of a to d is N as the number of samples included in the data set.

In the tabulation on the 2 × 2 contingency table, the table may not be used as long as a, b, c, and d, which are the number of samples corresponding to the above conditions, are tabulated. For example, (1) a condition that there is a first event and a second event (2) a condition that there is a first event and there is no second event, (3) there is no first event, Set the condition that there are two events, and (4) the condition that there is no first event and no second event, and each of the N samples is one of the conditions (1) to (4) By determining whether it is true, each of the N samples is classified into each of the above conditions, this is repeated for all N samples, and the number of samples classified into each of the conditions is totaled (1 ) To (4), the sample numbers a, b, c, and d may be acquired as the number of samples corresponding to the conditions (4) to (4). In this case, (1) a is the number of samples where there is a first event and there is a second event, which makes N total samples, and (2) b is the number of samples where there are N total samples, first (3) c is the number of samples without the first event and with the second event, which makes all N samples ( 4) d is the number of samples without the first event and without the second event that make up all N samples.

In the present invention, Fischer's exact probability P is calculated based on a 2 × 2 contingency table in which the number of samples is counted. In calculating Fischer's exact probability P, first, p is calculated by the a, b, c, d and N and the following equations.

Next, assume all the data sets that are less likely to occur than the above-described data set in which the number of samples is aggregated in a 2 × 2 contingency table as shown in Table 1 above, and for each data set, similarly 2 Count the number of samples in a × 2 contingency table, and similarly calculate p using the above equation. Fischer's exact probability P can be calculated by summing all the calculated p's.

In the present invention, the exact P of the Fischer calculated, on the basis of the _{N, -log 10 P / (Nlog} 10 2) is calculated. The calculation of −log ₁₀ P / (Nlog ₁₀ 2) may be performed based on P and N, for example, based on a computer.

According to our findings, -log ₁₀ P / (Nlog ₁₀ 2) approximates the mutual information between the first event and the second event. Here, the mutual information amount is an amount representing a measure of interdependence between two random variables used in information theory. The mutual information amount is a measure of the information amount shared by X and Y. The mutual information MI between the two discrete random variables X and Y is defined by the following equation, for example.

In the above equation, p (x _i , y _j ) is a simultaneous distribution function of X and Y, and p (x _i ) and p (y _j ) are marginal probability distribution functions of X and Y, respectively.

Also, the mutual information I (X; Y) between the two continuous random variables X and Y is defined by the following equation, for example.

In the above equation, p (x, y) is a simultaneous distribution density function of X and Y, and p (x) and p (y) are marginal probability density functions of X and Y, respectively.

These formulas mean that the mutual information is calculated by calculating the expected value of the joint probability of two variables in all possible data ranges and calculating the sum.

A mutual information MI between the first event and a second _event, every relationship between _{-log 10 P / (Nlog 10 2} ), shows the headline findings of the present inventors as follows. First, consider the following contingency tables in Table 2 and Table 3 between two random variables A and B, and assume that they take two values, A and A ′, and B and B ′, respectively.

Table 2 shows the relative frequencies of the random variable combinations. Therefore, X ₀ , X ₁ , X ₂ , and X ₃ are ratios of AB, A′B, AB ′, and A′B ′, respectively. Table 3 shows the frequency itself obtained by multiplying the relative frequency by N.

At that time, the mutual information MI is defined as follows. Here, the logarithm is a natural logarithm.

On the other hand, the main term of the p-value of Fisher's exact test is as follows.

Taking log on both sides,

Using the Stirling formula, logN! Is approximated by (NlogN−N) and using X ₀ + X ₁ + X ₂ + X ₃ = 1,

Therefore,

As described above, the present inventors have found that the mutual information MI between events is approximately equal to a constant multiple of the −log ₁₀ P value obtained by logarithmically converting Fisher's exact probability P. Here, N indicates the number of samples. When N → ∞, both sides approach the same value.

Furthermore, the present inventors, as shown in Examples below, when the number of samples is 1019, mutual information is _found that can be sufficiently approximated by _{-log 10 P / (Nlog 10 2} ), N is the In such a case, it was found that the interdependency between the first event and the second event can be accurately specified by using -log ₁₀ P / (Nlog ₁₀ 2). Therefore, in the present invention, N, which is the number of samples, is preferably 100 or more, more preferably 500 or more, and still more preferably 1,000 or more. Conventionally, Fischer's exact probability P is often used when the number of samples is small, that is, when the number of N is small. The present invention obtains an excellent effect by using Fisher's exact probability P for the analysis of data having a large number of samples, and is epoch-making. Further, the conventional calculation of mutual information is performed without considering the number of samples N, and there is a lack of consideration regarding statistical significance. For example, mutual information amount calculated by the data of 10 cases, but is only 10 ^-100 statistical significance compared with the mutual information amount based on 1000 cases of the data, the conventional method of calculating the mutual information Did not distinguish between these. In the present invention, the mutual information calculation method using -log ₁₀ P / (Nlog ₁₀ 2) is to obtain the mutual information approximately using the number of samples N, and considers the weight of the data. Mutual information can be calculated as a thing, and it is epoch-making.

Thus, −log ₁₀ P / (Nlog ₁₀ 2) calculated as described above approximates the mutual information amount of the first event and the second event, and by using this, the first event And the interdependency of the second event can be identified. Here, interdependencies particular first event and the second event may be performed to evaluate the value itself of _-log 10 P / calculated as above (Nlog ₁₀ 2). In the interdependency of the particular first event and a second event, the first event and the _{_{-log 10 P / (Nlog 10 2}} ) the same method as calculating the performed on the second event, instead of the second event, the value of the carried out for a third event different from the second event, the resulting first event and _-log 10 P was calculated for a third event / (Nlog ₁₀ 2), first event and _-log 10 P / may perform comparison between the value of (Nlog ₁₀ 2) calculated for the second event. Here, the third event may have a known interdependency with the first event. An example of the known interdependence is that experimental results already exist that support the degree or meaning of interdependence. Also, mutual when the dependency of the specific, may be calculated _{_{-log 10 P / (Nlog 10 2}} ) is a mutual information itself _{but, - (log 10 P) /} N may be calculated. -(Log ₁₀ P) / N is a numerical value indicating the high degree of interdependence, and it is possible to compare the high degree of interdependence using this numerical value. It can also be determined that the higher the numerical value, the stronger the interdependence. By these methods, the interdependency between the first event and the second event can be specified more accurately.

Similarly, for a plurality of events that are different from each other, respectively, the first event to calculate the value of the _{_{-log 10 P / (Nlog 10 2}} ), the values calculated for the plurality of events, a first event the _-log 10 P / may be compared with the value of (Nlog ₁₀ 2) calculated for 2 events. By these methods, the interdependency between the first event and the second event can be specified more accurately.

Further, the plurality of events different from each other, respectively, in accordance with the magnitude of the calculated value of the _{_{-log 10 P / (Nlog 10 2}} ) for the first event, create a list that ranks the event, the The nature of the first event may be specified based on the list. In specifying the nature of the first event based on the list, the nature of the event included in the list may be considered. The list can also be created by ranking according to the magnitude of -log ₁₀ P / N without calculating -log ₁₀ P / (Nlog ₁₀ 2).

The number of events for which the value of −log ₁₀ P / (Nlog ₁₀ 2) is calculated is, for example, the total number of events having the same properties as the first event and the second event. For example, if the first event and the second event is for any of the human _gene, the number of examples of events which calculates the value of the _{-log 10 P / (Nlog 10 2} ) , the human The total number of genes is about 20,000. When the nature of the first event is specified based on the list, the number of events included in the list is, for example, the total number of events having the same nature as the first event and the second event. 50% or less, 20% or less, or 10% or less.

If the event is for a gene, examples of interdependencies that can be identified include those related to the molecular cellular function, physiological function, disease relevance, biological pathways of the gene, and cell surface Examples include interactions between molecules, metabolic pathways, molecular functional pathways, and drug targeting. Examples of disease relevance include the onset and progression of cancer, immune allergic diseases, neuropsychiatric disorders, and congenital abnormalities.

In the present invention, even if the sample to be used is derived from a patient suffering from cancer, it is possible to specify the interdependence of genes not related to cancer. Examples of genes not related to cancer include genes related to the nervous system, immune system, metabolism, and endocrine. Conversely, in the present invention, even when the sample used is derived from a patient who does not suffer from cancer, it is possible to specify the interdependence of genes related to cancer. . By using the interdependency specified in the present invention, it is possible to specify a target molecule or a drug for a disease. In addition, by using the interdependency specified in the present invention, it is possible to search for orphan receptor ligands.

In the case where the event is about a word, for example, in the case where the event is that a specific word is used in a specific sentence, examples of the interdependence to be specified include the meaning of the word .

The Fisher's exact probability P used in the calculation of -log ₁₀ P / (Nlog ₁₀ 2) in the method of the present invention is the Fisher's exact probability P ₁ and a plurality of Fisher's exact probabilities P ₂ . The accuracy probability may be calculated by a method including a step of integrating using meta-analysis. Here, Fisher's exact probability P ₁ is obtained from the data including the information of the first event and the information of the second event for N ₁ samples, the first criterion and the second event for the first event. From a data set containing binary data for the first event and binary data for the second event, acquired based on the first criterion for It is calculated based on this. In addition, Fisher's exact probability P ₂ is obtained from the data including the information of the first event and the information of the second event for N ₂ samples, for the second criterion and the second event for the first event. Based on a 2 × 2 contingency table summarizing the number of samples from the data set containing the binary data for the first event and the binary data for the second event, acquired based on the second criterion of Calculated.

In the calculation of Fisher's exact probability P ₁ , data including information on the first event and information on the second event for N ₁ samples is obtained in the same manner as described above, except for the difference between N and N _1. be able to. In the calculation of Fisher's exact probability P ₂ , data including information on the first event and information on the second event for N ₂ samples is obtained in the same manner as described above, except for the difference between N and N _2. be able to. Here, the total of N ₁ and N ₂ does not exceed N, but may be the same as N or may be smaller than N. The N samples described above include the N ₁ samples and the N ₂ samples. N ₁ samples are preferably N ₁ subjects with a common property, giving observations about events, and N ₂ samples are preferably common giving observations about events N ₂ main bodies having the following properties. The property common to the N _one subjects and the property common to the N _two subjects may not completely match. For example, the property common to N ₁ subjects may be derived from human breast cancer disease, and the property common to N ₂ subjects may be derived from human lung cancer disease. Even in this case, N samples including N ₁ samples and N ₂ samples have a common property derived from human cancer diseases.

In calculating the Fisher's exact probability P ₁ , a data set including binary data for a first event and binary data for a second event is obtained from the first criterion and the second for the first event. Obtained based on the first criterion for the event. In the calculation of Fisher's exact probability P ₂ , a data set including binary data for the first event and binary data for the second event is the second criterion for the first event and Obtained based on a second criterion for the second event.

Here, the acquisition of the data set is based on the first criterion for the first event and the first criterion for the second event, and the second criterion and the second event for the first event. This can be done as described above except that it is based on the second criterion. The first criterion for the first event and the first criterion for the second event are the binary data for the first event and the binary value for the second event for N ₁ samples, respectively. It is a standard for acquiring data. The second criterion for the first event and the second criterion for the second event are respectively binary data for the first event and 2 for the second event for N ₂ samples, respectively. This is a standard for obtaining value data. Examples of the standard include the presence / absence in the case where the event is represented by the presence / absence, and the reference value for classifying the event by the numerical value when the event is represented by the numerical value. In the case of using the reference value, for example, it can be converted into binary data depending on whether the numerical value is equal to or higher than the reference value or the numerical value is less than the reference value. The first criterion for the first event and the second criterion for the first event may be the same or different. For example, when the first event is represented by a numerical value, the reference value serving as the first reference and the reference value serving as the second reference may be the same numerical value or different numerical values. Also good. Also, the first criterion for the first event and the first criterion for the second event may be the same or different, and the second criterion for the first event And the second criteria for the second event may be the same or different. For example, when both the first event and the second event are represented by numerical values, a reference value serving as a first reference for the first event and a reference value serving as a first reference for the second event May be the same numerical value or different numerical values.

As described above, in the present invention, the data that is the basis of the data set is obtained by using the data represented by the discrete quantity exceeding 2 and the data converted from the data represented by the continuous quantity to the binary data. Regardless of whether the data is discrete, continuous, or binary data, regardless of whether the data sample is heterogeneous or homogeneous, statistically analyze various data It can be used for processing, and analysis results based on a wide range of data can be obtained.

By performing the same method as described above using the data set including the binary data for the first event and the binary data for the second event obtained as described above, N ₁ samples are obtained. A 2 × 2 contingency table in which the number of samples is tabulated according to the first criterion for the first event and the first criterion for the second event can be obtained. Similarly, for N ₂ samples, obtain a 2 × 2 contingency table that counts the number of samples according to the second criterion for the first event and the second criterion for the second event. be able to. Calculation of Fisher's exact probability P ₁ from the 2 × 2 contingency table for the obtained N ₁ samples can be performed in the same manner as the calculation of Fisher's exact probability P described above. Similarly, calculation of Fisher's exact probability P ₂ from the 2 × 2 contingency table for the obtained N ₂ samples can be performed in the same manner as the calculation of Fisher's exact probability P described above.

The Fischer's exact probability P used in the present invention was calculated by a method including a step of integrating the Fischer's exact probability P ₁ and the Fischer's exact probability P ₂ including the Fisher's exact probability P ₂ using meta-analysis. It may be a thing. Here, the Fischer's exact probability includes the Fischer's exact probability P ₁ and the Fischer's exact probability P ₂ , and the number thereof is, for example, 2 but even if it exceeds this number, Good. In addition to the Fischer's exact probability P ₁ and the Fischer's exact probability P ₂ , the Fischer's exact probability P _n calculated by a method similar to these may be included in the Fischer's exact probability P 2. The number of Fisher's exact probabilities to be integrated using meta-analysis is not particularly limited, but is 2 to 100, for example.

Various methods for integration using meta-analysis are known. For example, Rosental, R. et al. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage describes a method for calculating P _overall by integrating p values obtained under different study conditions. Integration using meta-analysis can be performed as follows, for example, for a one-sided test in Fisher's exact test. First, let p _{i be} the exact probability of each Fisher to be integrated, and this is converted into a Z value (z _i ).

Z _overall, which is the sum of Z values divided by the square root of the number (k) to be integrated, follows a normal distribution.

By calculating p _overall which is an integrated P value from this Z _overall , the exact probability of each fisher can be integrated.

Conventionally, calculation of mutual information by integrating data obtained under various conditions has not been performed. In the present invention, as described above, by using Fisher's exact probability P integrated using meta-analysis, for example, a combination of data obtained under various conditions can be used to generate a wide range of data. Mutual information can be calculated on the basis of this, and this will affect the bias under these conditions while integrating the findings under various different conditions (eg, different cell lineages, various internal and external conditions, etc.). The interdependency between events can be specified more accurately without receiving it.

In the present invention, since a large amount of data can be analyzed using a common method, the method of the present invention is suitable for implementation by a computer. In the present invention, the above method may be performed by a computer program for executing this method. Examples of the computer program include a program for causing a computer to function as means for performing each step of the above-described method.

Examples of the computer program include a computer,
(1) means for performing a step of obtaining data including information on the first event and information on the second event for N samples;
(2) From the data including the information of the first event and the information of the second event for N samples, the data including the binary data for the first event and the binary data for the second event Means for performing a step of obtaining a set;
(3) Based on the criteria for the first event and the criteria for the second event, determine whether each of the N samples corresponds to a 2 × 2 contingency table type, Means for performing a step of classifying each of the samples into the respective types;
(4) Each of the N samples is classified into each type, and this is repeated for all the N samples, and the number of samples classified into each type is totaled to obtain 2 × 2 from the data set. Means for performing a step of counting the number of samples in the contingency table;
(5) means for performing a step of calculating Fisher's exact probability P based on a 2 × 2 contingency table in which the number of samples is tabulated, and (6) the calculated Fisher's exact probability P and the N Basically, a program for causing a function to perform a step of calculating −log ₁₀ P / (Nlog ₁₀ 2) can be given.

The program can be executed by causing the computer to read it and causing the hardware resources of the computer and the loaded software to function in a coordinated manner. Examples of hardware resources include arithmetic means such as a CPU and storage means such as a memory.

The computer program may be stored in a recording medium. Examples of the recording medium include optical reading means such as CD-ROM and DVD, and information storage means such as semiconductor memory, flexible disk, and hard disk.

Example 1:
Data from a breast cancer invasive cancer patient with a sample number of 1019 (BRCA) was downloaded from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/). This data contained information about 20,000 genes. Each breast invasive cancer patient was classified into two types based on whether mRNA expression of CLSTN3 (calsyntenin 3) as a target gene was more than twice or less than that of the wild type. Similarly, regarding the mRNA expression of other remaining genes, each breast invasive cancer patient was classified into two types based on whether it was more than twice or less than twice that of the wild type. Based on the data after classification, according to the above criteria, the number of breast invasive cancer patients was counted in a 2 × 2 contingency table for each of CLSTN3 (calsyntenin 3) and other remaining genes. Based on the counted number, the mutual information amount with CLSTN3 (Calsyntenin 3) was calculated for each gene using the formula for defining the mutual information amount described above. In addition, based on the counted number, Fisher's exact probability p was calculated for each gene. For each gene, the calculated mutual information with CLSTN3 (Calsyntin 3) and the value of -log (p) obtained from Fisher's exact probability p were plotted on a graph.

The results are shown in FIG. As shown in FIG. 1, in the number of samples 1019, there was a linear relationship between the mutual information amount and -log (p). Thus, when N is large, there is a proportional relationship between -log (p) for Fisher's exact probability p and the mutual information.
Similar results were obtained when each breast invasive cancer patient was classified based on the presence or absence of point mutations.

Example 2:
Acute myeloid leukemia, bladder urothelial cancer, breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, renal renal cell carcinoma, renal papillary cell carcinoma, lung adenocarcinoma, lung squamous cell carcinoma , Ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, prostate cancer, rectal adenocarcinoma, cutaneous melanoma, gastric adenocarcinoma, thyroid cancer, endometrial cancer, cancer cell line (CCLE) Data about the sample was downloaded from TCGA (http://cancergenome.nih.gov/). The above-mentioned CCLE is not case data but data using 1021 types of established cancer cells. The data for each sample included 66-1021 cases as samples and contained information about 20,000 genes.

Using EGFR (epidermal growth factor receptor, epidermal growth factor receptor) as the target gene, for each of the 19 types of samples, each of the remaining genes was 2 × 2 between the EGFR and the EGFR in the same manner as in Example 1. The number of samples was counted in the contingency table, and based on this, Fisher's exact probability P was calculated.

The EGFR calculated for each sample and Fisher's exact probability P for each gene were integrated using a meta-analysis method (Rosenthal, 1984). That is, each P value into a Z value, calculates a _{Z overall} value by integrating the Z value, converts the further calculated _{Z overall} value, for each _gene, to obtain a _{P overall} value. Calculating a _{P overall} values obtained, based on the number _{N all} of all samples using integrated, on the basis of the N, for each _{gene, -log} 10 _P overall _/ a (N all _{log 10} 2) did.

FIG. 2 shows the results of arranging the calculated values in descending order. Further, 2001 genes obtained by adding EGFR to 2000 genes having a large calculated value were analyzed with Ingenuity Pathway Analysis (IPA) (registered trademark) analysis software manufactured by Qiagen. Table 5 below shows the top five results of the standard path (Canonical Pathways) in IPA.

The third predicted pathway was EGF signaling. Thus, for 19 types of samples, when Fisher's exact probability was integrated by meta-analysis, the interdependence between EGFR and each gene could be specified accurately.

Example 3:
The same method as in Example 2 was carried out except that RB1 (RB Transcribal Compressor 1), IFNG (interferon gamma) and GRM1 (glutamate metabotropic receptor 1) were respectively used as target genes. The results of arranging each gene of interest in order from the gene with the highest calculated value are shown in FIGS.

In addition, for IFNG, a gene list of 2000 genes having a large calculated value was analyzed with IPA (registered trademark) analysis software. As a result, the highest prediction of Upstream Regulator in IPA (registered trademark) was IFNG. Thus, IFNG could be predicted without IFNG. Table 5 below shows the top five results of the standard path (Canonical Pathways) in IPA (registered trademark).

The predicted pathway is in good agreement with that of the known IFNG. These results strongly suggest that the analysis result according to the present invention, which is an analysis target of IPA (registered trademark), is highly accurate, and also indicate that the present invention is useful for disease regions other than cancer.

Similarly, a gene list of 2000 genes having a large amount of mutual information with GRM1 was analyzed with IPA (registered trademark) analysis software. Table 6 below shows the top 15 results of those with an activity z-score of 3 or more in disease or function annotations (Disease & Functions Annotation).

It can be seen that the predicted GRM1 function is very consistent with the known GRM1 function. Thus, when the Fisher's exact probability was integrated by meta-analysis for a large number of samples, the interdependence between GRM1 and each gene could be identified very accurately.

Example 4:
For sales for one week at store A in the supermarket chain, a purchase history of about 5000 samples is downloaded from the POS system. This data includes information about the contents of individual purchases. The 5000 samples are classified into two types based on whether or not a product belonging to the “rice ball” category has been purchased. Similarly, other product categories (the number of product categories is about 300) are classified into two types based on whether or not they are purchased. By using the same method as in the first embodiment, samples are tabulated in a 2 × 2 contingency table for “rice ball” and each product category, and Fisher's exact probability P is calculated based on the tabulation result. This is done for all about 200 product categories.

For other stores B to Z of the supermarket chain, the exact probability P of “rice ball” and Fisher of each product category is calculated in the same manner, and integrated using the meta-analysis method in the same manner as in the second embodiment. Based on N _all of the total number of samples used by integration, −log ₁₀ P _overall / (N _all log ₁₀ 2) is calculated for each product category.

It can be analyzed that the product category with a high value obtained by this calculation is often purchased at the same time as “rice ball”. For example, if it is analyzed that a supermarket customer who purchases “rice ball” often purchases “cup miso soup” at the same time, sales can be increased by displaying both of them adjacently.

Example 5:
Download the stock price transition data for 2017 for stocks traded on the first section of the Tokyo Stock Exchange (about 2000 stocks). There are about 240 trading days in 2017, and each day is a sample. Next, the data of the dollar-yen exchange rate (price of 1 dollar converted into yen) in 2017 is downloaded. Using the dollar-yen exchange rate data, it is classified into two types based on whether the dollar-yen exchange rate on the sample date is higher than the previous day rate. Next, using the data of the stock price transition, the stock prices of each company are classified into two types based on whether or not the stock price at the end of trading is higher than the stock trading time. In the same manner as in the first embodiment, samples are tabulated in a 2 × 2 contingency table for fluctuations in the dollar-yen exchange rate and the stock price of the company, and Fisher's exact probability P is calculated based on the tabulation results. This is done for about 2000 stock prices.

The calculated P of each brand is integrated by using the meta-analysis method in the same manner as in the second embodiment for each industry according to the classification in the TSE industry classification. Based on N _all of the total number of samples used in integration, −log ₁₀ P _overall / (N _all log ₁₀ 2) is calculated for each industry.

It can be predicted that an industry with a high value obtained by this calculation has a high tendency for the stock price to fluctuate in conjunction with the dollar-yen exchange rate.

Claims

A method for identifying interdependencies between a first event and a second event, wherein the first event is obtained from data including information on the first event and information on the second event for N samples. Fisher's exact probability P calculated based on a 2 × 2 contingency table in which the number of samples is aggregated from the data set including the binary data for and the binary data for the second event, and the N A step of calculating −log 10 P / N based on
2. The method of claim 1, wherein the Fisher's exact probability P is calculated from: (1) data including first event information and second event information for N 1 samples; From a data set including binary data for the first event and binary data for the second event, acquired based on the first criterion for and the first criterion for the second event, Fischer's exact probability P 1 calculated based on a 2 × 2 contingency table summarizing the number, and (2) data including information on the first event and information on the second event for N 2 samples From binary data for the first event and binary data for the second event, obtained based on the second criterion for the first event and the second criterion for the second event From the data set, sample It was aggregated, calculated on the basis of a 2 × 2 contingency table, and a exact P 2 Fisher's exact multiple Fischer was calculated by a method comprising the step of integrating using a meta-analysis Is the way.
A computer program for executing the method according to claim 1.
A recording medium storing the computer program according to claim 3.