+

WO2018181988A1 - Method for identifying interdependence - Google Patents

Method for identifying interdependence Download PDF

Info

Publication number
WO2018181988A1
WO2018181988A1 PCT/JP2018/013877 JP2018013877W WO2018181988A1 WO 2018181988 A1 WO2018181988 A1 WO 2018181988A1 JP 2018013877 W JP2018013877 W JP 2018013877W WO 2018181988 A1 WO2018181988 A1 WO 2018181988A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
data
samples
calculated
information
Prior art date
Application number
PCT/JP2018/013877
Other languages
French (fr)
Japanese (ja)
Inventor
努 森
河村 隆
Original Assignee
公立大学法人福島県立医科大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 公立大学法人福島県立医科大学 filed Critical 公立大学法人福島県立医科大学
Priority to JP2019509406A priority Critical patent/JP6820621B2/en
Publication of WO2018181988A1 publication Critical patent/WO2018181988A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to an information processing method for large-scale data, a computer program for executing the method, and a recording medium storing the program. Specifically, the present invention relates to a method for specifying interdependency between two events, a computer program for executing the method, and a recording medium storing the program.
  • the mutual information amount between a plurality of events is used as a quantity representing a measure of interdependency between the events.
  • the mutual information amount of X and Y is defined as in the equation described later, as shown in the equation of the definition, conventionally, the mutual information amount does not consider the number N of samples from which this is calculated, It was not considered to take into account statistical significance. Further, as shown in the formula of the definition, it was not considered that the mutual information amount can be calculated using a combination of data obtained under different conditions.
  • Non-Patent Document 1 A technique for analyzing a large amount of data using the mutual information amount is used for processing various information such as documents, sounds, images, positions, life, astronomy, finance, and sales.
  • ARACNE As an algorithm for analyzing life information data, for example, ARACNE is known (Non-Patent Document 1).
  • Fisher's exact test is a statistical test method used to analyze data classified into two categories mainly when the number of samples is small, and has been used for various statistical processing ( Non-Patent Documents 2 to 3). The relationship between Fisher's exact probability and mutual information has not been known so far.
  • the present invention provides interdependencies of a plurality of events shown in these data. Is statistically significantly and efficiently specified.
  • the inventors of the present invention have been diligently studying, using Fisher's exact probability P calculated based on the 2 ⁇ 2 contingency table and the number of samples N used to create the contingency table, ⁇ log 10 It has been found that the mutual information amount can be approximately calculated by calculating P / (Nlog 10 2). That is, the present inventors obtain a data set including binary data from data including N samples, create a 2 ⁇ 2 contingency table, and use Fisher's exact data based on the data set. calculates the probability P, using the N and the P, by calculating the -log 10 P / (Nlog 10 2 ), to calculate the mutual information between events, the interdependence of each other the event I found out that it can be identified.
  • Fischer's exact probability P is a concept that has been studied in probability theory, whereas mutual information is a concept that has been studied mainly in information theory.
  • the discovery is extremely technological.
  • log 10 2 is a constant, the interdependency between the events can be specified in the calculation of -log 10 P / N.
  • the calculation of the -log 10 P / N in the broad sense, and imply the calculation of the -log 10 P / (Nlog 10 2 ).
  • the present invention provides a method for specifying interdependency between a first event and a second event, wherein the information on the first event and the information on the second event are represented by N samples. Calculated based on a 2 ⁇ 2 contingency table that aggregates the number of samples from a data set that includes binary data for the first event and binary data for the second event, obtained from data that includes The method includes the step of calculating ⁇ log 10 P / N based on the Fisher's exact probability P and the N.
  • Fischer's exact probability P uses statistics, so meta-analysis can be performed by a conventionally known method. According to the meta-analysis, a plurality of Fisher's exact probabilities P calculated based on data obtained under different conditions such as data on different types of samples are integrated, and Fisher's exact probabilities for these overall data are integrated. P can be calculated. Therefore, in the above interdependency identification method, the Fisher's exact probability P is calculated based on the data acquired under different conditions, and the calculated Fisher's exact probabilities P are integrated and obtained. By using the exact Fisher's probability, the interdependency between events can be specified based on the whole data acquired under different conditions.
  • the present invention is the method according to the first aspect, wherein the Fisher's exact probability P is (1) information on the first event and information on the second event is N 1.
  • the binary data and the second event for the first event obtained from the data included for the samples, based on the first criterion for the first event and the first criterion for the second event Fischer's exact probability P 1 calculated based on a 2 ⁇ 2 contingency table in which the number of samples is aggregated from a data set including binary data of (2), (2) information on the first event and second Binary values for the first event, obtained from data containing information on N events for the N 2 samples, based on the second criterion for the first event and the second criterion for the second event About the data and the second event From a data set containing the value data, and counts the number of samples was calculated on the basis of a 2 ⁇ 2 contingency table, and a exact P 2 Fisher, a plurality of Fisher's exact, using meta-analysis And providing a method that is calculated by a method
  • the present invention provides a computer program for executing the method described in the first aspect or the second aspect.
  • the present invention provides a recording medium storing the computer program according to the third aspect.
  • the mutual information amount between the events can be calculated by calculating the mutual information amount between the events.
  • the mutual information amount between events is calculated using meta-analysis, even if it is data on different types of samples acquired under different conditions, By calculating the mutual information amount between events, the interdependency between the events can be specified. Therefore, it is possible to identify interdependencies between events more accurately and statistically significantly based on a large amount of data while reducing bias due to the characteristics of various samples included in the entire data. it can.
  • Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and IFNG calculated for each gene. It is the graph which arranged the mutual information amount with GRM1 computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and GRM1 calculated for each gene.
  • the present invention provides a method for specifying the interdependency between the first event and the second event.
  • an event the state grasped as an observation result about an object is mentioned.
  • objects include genes and words.
  • Other examples of objects include documents, sound, images, location, life, astronomy, finance, sales, etc.
  • An example of a state is that it differs from the average property of the object.
  • Examples of events include genetic changes, epigenetic changes, and rising or falling stock prices.
  • Another example of an event is that multiple words are used in the same sentence, and that sales include sales of a specific product.
  • Examples of gene changes include gene sequence mutations, gene expression product changes, and gene modification changes.
  • Examples of gene sequence mutations include gene base sequence mutations, gene copy number changes on the chromosome, and gene modification changes.
  • Examples of gene base sequence mutations include gene point mutations, addition of base sequences to genes, and deletion of base sequences in genes.
  • Examples of gene expression products include proteins, mRNA, and miRNA (micro-RNA).
  • Examples of changes in gene expression products include changes in the expression level of gene expression products, changes in the expression location of gene expression products, formation of gene expression product complexes, and degradation of gene expression product complexes Is mentioned.
  • Examples of gene modification include DNA methylation and histone modification.
  • histone modifications include acetylation, methylation, ubiquitination, phosphorylation, and SUMOylation.
  • gene modification include post-translational modification.
  • post-translational modifications include functional group addition, protein or peptide addition, amino acid chemistry conversion, and structural conversion.
  • Examples of functional group addition include acylation, acetylation, alkylation, amidation, biotinylation, formylation, gamma carboxylation, glutamylation, glycosylation, glycylation, heme, hydroxylation, iodination, isoprenylation, Lipoylation (prenylation, GPI anchor formation, myristoylation, farnesylation, geranylgeranylation, etc.), covalent bond addition to nucleotides or derivatives (ADP ribosylation, FAD linkage, etc.), redox reaction, polyethylene glycolation, phosphatidylinositol Phosphopantetheinylation, phosphorylation, pyroglutamate formation, racemization, tyrosine sulfation, selenoylation.
  • Examples of protein or peptide addition include ISG, SUMO, and ubiquitination. Examples of conversion of amino acid chemistry include citrullination or dea
  • genes include genes of mammals such as humans, monkeys, mice and rats.
  • An example of an epigenetic change is a change that is inherited through cell division and is independent of a change in DNA base sequence.
  • the first and second event are symbols for distinguishing the first event from the second event, and do not limit the order of these events.
  • the first event and the second event may be the same state for different objects, or may be different states for the same object.
  • the first event may be a mutation in the base sequence of gene A
  • the second event may be a mutation in the base sequence of gene B.
  • the first event may be a mutation in the sequence of gene A
  • the second event may be a change in the expression level of the expression product of gene A.
  • gene A and gene B indicate different genes.
  • Examples of events include those represented by presence / absence and those represented by numerical values. Examples of what is represented by a numerical value include those represented by a discrete quantity exceeding 2 and those represented by a continuous quantity.
  • the first event and the second event may be expressed differently. For example, the first event is expressed by the presence or absence of the first event, and the second event is a discrete quantity exceeding 2. It may be expressed.
  • N samples are, for example, N subjects having a common property that give an observation result about the event.
  • N include numerical values such as 10 or more, 100 or more, 1,000 or more, 10,000 or more, 100,000 or more. The larger N is, the more accurately the interdependency between the first event and the second event can be specified.
  • the common properties include those derived from living organisms, derived from humans, derived from humans with diseases, derived from humans with cancer, and humans with specific types of cancer. It is derived from.
  • Examples of the subject include cells of living organisms such as humans, organs, and other biological samples.
  • cancers examples include leukemia, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma, multiple myeloma, brain tumor, breast cancer, endometrial cancer, cervical cancer, ovarian cancer, esophageal cancer, stomach cancer, Appendiceal cancer, colon cancer, liver cancer, hepatocellular carcinoma, gallbladder cancer, bile duct cancer, pancreatic cancer, adrenal cancer, gastrointestinal stromal tumor, mesothelioma, head and neck cancer, laryngeal cancer, oral cancer, oral floor cancer, gingiva Cancer, tongue cancer, buccal mucosa cancer, salivary gland cancer, sinus cancer, maxillary sinus cancer, frontal sinus cancer, ethmoid sinus cancer, sphenoid sinus cancer, thyroid cancer, kidney cancer, lung cancer, osteosarcoma, prostate cancer, Testicular tumor (testicular cancer), renal cell cancer, bladder cancer, rhabdomyosarcoma, skin cancer, anal cancer.
  • the data used in the present invention includes information on the first event and information on the second event for N samples.
  • each of the N samples includes information on the first event and information on the second event. including.
  • event information (1) if the event is represented by the presence / absence of the event, information on whether or not the event occurred for the sample can be cited. (2) In the case of a numerical value, the numerical value for the sample is given.
  • binary data for the first event and binary data for the second event are obtained from data including the information on the first event and the information on the second event for N samples.
  • the containing dataset is retrieved.
  • binary data for events include data on the presence / absence when events are represented by presence / absence, and data above or below a reference value when events are represented by numerical values.
  • a data set including binary data regarding the event can be obtained by using the event information included in the data as it is (2 ) If the event is expressed numerically, set a reference value, determine that the event information for the sample included in the data is greater than or less than the reference value, and obtain binary data as the determination result This can be obtained by repeating for N samples.
  • Acquisition of a data set including binary data for the first event and binary data for the second event is performed by, for example, (1) performing the above-described method for the information of the first event, (2) Perform the above-described method for the information of the second event, acquire the binary data for the second event, and (3) combine the acquired binary data. Can be done.
  • the data set including the binary data for the first event and the binary data for the second event acquired in the above may be in a form using a linear index, for example.
  • the method of the present invention uses the data set including the binary data for the first event and the binary data for the second event, which is obtained in the above, to express the event represented by the presence or absence and the numerical value. It can be used regardless of the type of event such as an event to be expressed, an event expressed by a discrete quantity exceeding 2, an event expressed by a continuous quantity. Therefore, the method of the present invention is suitable for repeatedly performing a plurality of events. Since the method of the present invention can be performed using the same algorithm even when it is repeatedly performed for a plurality of events, a unified analysis can be easily performed.
  • each gene in the living body are diverse, the parameters that specify the state of each gene are diverse, and each parameter can take a continuous or discrete value. It was not easy to identify the interdependencies of various genes using the data included in a unified manner.
  • the method of the present invention can be used regardless of the type of information about various genes, and even when repeatedly performed on various genes, it can be performed using a common technique, so it is easily unified. Analysis can be performed. Therefore, the method of the present invention is suitable for using data including information on a plurality of genes in a unified manner to specify the interdependence of these genes.
  • the number of samples is aggregated in a 2 ⁇ 2 contingency table from a data set including binary data for the first event and binary data for the second event.
  • Aggregation of the number of samples from the data set including the binary data into the 2 ⁇ 2 contingency table is, for example, that the binary data for the first event and the binary data for the second event are both In the case where it is expressed by presence / absence, it may be performed by aggregating a, b, c, and d which are the number of samples corresponding to the conditions of each column in Table 1 below. Note that the sum of a to d is N as the number of samples included in the data set.
  • the table may not be used as long as a, b, c, and d, which are the number of samples corresponding to the above conditions, are tabulated. For example, (1) a condition that there is a first event and a second event (2) a condition that there is a first event and there is no second event, (3) there is no first event, Set the condition that there are two events, and (4) the condition that there is no first event and no second event, and each of the N samples is one of the conditions (1) to (4) By determining whether it is true, each of the N samples is classified into each of the above conditions, this is repeated for all N samples, and the number of samples classified into each of the conditions is totaled (1 ) To (4), the sample numbers a, b, c, and d may be acquired as the number of samples corresponding to the conditions (4) to (4).
  • (1) a is the number of samples where there is a first event and there is a second event, which makes N total samples
  • (2) b is the number of samples where there are N total samples
  • first (3) c is the number of samples without the first event and with the second event, which makes all N samples
  • 4) d is the number of samples without the first event and without the second event that make up all N samples.
  • Fischer's exact probability P is calculated based on a 2 ⁇ 2 contingency table in which the number of samples is counted.
  • p is calculated by the a, b, c, d and N and the following equations.
  • the exact P of the Fischer calculated, on the basis of the N, -log 10 P / (Nlog 10 2) is calculated.
  • the calculation of ⁇ log 10 P / (Nlog 10 2) may be performed based on P and N, for example, based on a computer.
  • the mutual information amount is an amount representing a measure of interdependence between two random variables used in information theory.
  • the mutual information amount is a measure of the information amount shared by X and Y.
  • the mutual information MI between the two discrete random variables X and Y is defined by the following equation, for example.
  • p (x i , y j ) is a simultaneous distribution function of X and Y
  • p (x i ) and p (y j ) are marginal probability distribution functions of X and Y, respectively.
  • the mutual information I (X; Y) between the two continuous random variables X and Y is defined by the following equation, for example.
  • p (x, y) is a simultaneous distribution density function of X and Y
  • p (x) and p (y) are marginal probability density functions of X and Y, respectively.
  • Table 2 shows the relative frequencies of the random variable combinations. Therefore, X 0 , X 1 , X 2 , and X 3 are ratios of AB, A′B, AB ′, and A′B ′, respectively. Table 3 shows the frequency itself obtained by multiplying the relative frequency by N.
  • the mutual information MI is defined as follows.
  • the logarithm is a natural logarithm.
  • the present inventors have found that the mutual information MI between events is approximately equal to a constant multiple of the ⁇ log 10 P value obtained by logarithmically converting Fisher's exact probability P.
  • N indicates the number of samples.
  • N which is the number of samples, is preferably 100 or more, more preferably 500 or more, and still more preferably 1,000 or more.
  • Fischer's exact probability P is often used when the number of samples is small, that is, when the number of N is small.
  • the present invention obtains an excellent effect by using Fisher's exact probability P for the analysis of data having a large number of samples, and is epoch-making. Further, the conventional calculation of mutual information is performed without considering the number of samples N, and there is a lack of consideration regarding statistical significance. For example, mutual information amount calculated by the data of 10 cases, but is only 10 -100 statistical significance compared with the mutual information amount based on 1000 cases of the data, the conventional method of calculating the mutual information Did not distinguish between these.
  • the mutual information calculation method using -log 10 P / (Nlog 10 2) is to obtain the mutual information approximately using the number of samples N, and considers the weight of the data. Mutual information can be calculated as a thing, and it is epoch-making.
  • ⁇ log 10 P / (Nlog 10 2) calculated as described above approximates the mutual information amount of the first event and the second event, and by using this, the first event And the interdependency of the second event can be identified.
  • interdependencies particular first event and the second event may be performed to evaluate the value itself of -log 10 P / calculated as above (Nlog 10 2).
  • the first event and the -log 10 P / (Nlog 10 2 ) the same method as calculating the performed on the second event, instead of the second event, the value of the carried out for a third event different from the second event, the resulting first event and -log 10 P was calculated for a third event / (Nlog 10 2), first event and -log 10 P / may perform comparison between the value of (Nlog 10 2) calculated for the second event.
  • the third event may have a known interdependency with the first event.
  • An example of the known interdependence is that experimental results already exist that support the degree or meaning of interdependence.
  • -log 10 P / (Nlog 10 2 ) is a mutual information itself but, - (log 10 P) / N may be calculated.
  • -(Log 10 P) / N is a numerical value indicating the high degree of interdependence, and it is possible to compare the high degree of interdependence using this numerical value. It can also be determined that the higher the numerical value, the stronger the interdependence.
  • a first event the -log 10 P / may be compared with the value of (Nlog 10 2) calculated for 2 events.
  • the plurality of events different from each other, respectively, in accordance with the magnitude of the calculated value of the -log 10 P / (Nlog 10 2 ) for the first event create a list that ranks the event, the The nature of the first event may be specified based on the list. In specifying the nature of the first event based on the list, the nature of the event included in the list may be considered.
  • the list can also be created by ranking according to the magnitude of -log 10 P / N without calculating -log 10 P / (Nlog 10 2).
  • the number of events for which the value of ⁇ log 10 P / (Nlog 10 2) is calculated is, for example, the total number of events having the same properties as the first event and the second event.
  • the first event and the second event is for any of the human gene
  • the number of examples of events which calculates the value of the -log 10 P / (Nlog 10 2 ) the human The total number of genes is about 20,000.
  • the number of events included in the list is, for example, the total number of events having the same nature as the first event and the second event. 50% or less, 20% or less, or 10% or less.
  • examples of interdependencies that can be identified include those related to the molecular cellular function, physiological function, disease relevance, biological pathways of the gene, and cell surface Examples include interactions between molecules, metabolic pathways, molecular functional pathways, and drug targeting. Examples of disease relevance include the onset and progression of cancer, immune allergic diseases, neuropsychiatric disorders, and congenital abnormalities.
  • the sample to be used is derived from a patient suffering from cancer
  • genes not related to cancer include genes related to the nervous system, immune system, metabolism, and endocrine.
  • the sample used is derived from a patient who does not suffer from cancer
  • the interdependency specified in the present invention it is possible to specify a target molecule or a drug for a disease.
  • examples of the interdependence to be specified include the meaning of the word .
  • the Fisher's exact probability P used in the calculation of -log 10 P / (Nlog 10 2) in the method of the present invention is the Fisher's exact probability P 1 and a plurality of Fisher's exact probabilities P 2 .
  • the accuracy probability may be calculated by a method including a step of integrating using meta-analysis.
  • Fisher's exact probability P 1 is obtained from the data including the information of the first event and the information of the second event for N 1 samples, the first criterion and the second event for the first event. From a data set containing binary data for the first event and binary data for the second event, acquired based on the first criterion for It is calculated based on this.
  • Fisher's exact probability P 2 is obtained from the data including the information of the first event and the information of the second event for N 2 samples, for the second criterion and the second event for the first event. Based on a 2 ⁇ 2 contingency table summarizing the number of samples from the data set containing the binary data for the first event and the binary data for the second event, acquired based on the second criterion of Calculated.
  • N 1 samples are preferably N 1 subjects with a common property, giving observations about events, and N 2 samples are preferably common giving observations about events N 2 main bodies having the following properties.
  • the property common to the N one subjects and the property common to the N two subjects may not completely match.
  • the property common to N 1 subjects may be derived from human breast cancer disease
  • the property common to N 2 subjects may be derived from human lung cancer disease.
  • N samples including N 1 samples and N 2 samples have a common property derived from human cancer diseases.
  • a data set including binary data for a first event and binary data for a second event is obtained from the first criterion and the second for the first event. Obtained based on the first criterion for the event.
  • a data set including binary data for the first event and binary data for the second event is the second criterion for the first event and Obtained based on a second criterion for the second event.
  • the acquisition of the data set is based on the first criterion for the first event and the first criterion for the second event, and the second criterion and the second event for the first event. This can be done as described above except that it is based on the second criterion.
  • the first criterion for the first event and the first criterion for the second event are the binary data for the first event and the binary value for the second event for N 1 samples, respectively. It is a standard for acquiring data.
  • the second criterion for the first event and the second criterion for the second event are respectively binary data for the first event and 2 for the second event for N 2 samples, respectively. This is a standard for obtaining value data.
  • Examples of the standard include the presence / absence in the case where the event is represented by the presence / absence, and the reference value for classifying the event by the numerical value when the event is represented by the numerical value.
  • the reference value for example, it can be converted into binary data depending on whether the numerical value is equal to or higher than the reference value or the numerical value is less than the reference value.
  • the first criterion for the first event and the second criterion for the first event may be the same or different.
  • the reference value serving as the first reference and the reference value serving as the second reference may be the same numerical value or different numerical values. Also good.
  • first criterion for the first event and the first criterion for the second event may be the same or different
  • the second criterion for the first event And the second criteria for the second event may be the same or different.
  • a reference value serving as a first reference for the first event and a reference value serving as a first reference for the second event May be the same numerical value or different numerical values.
  • the data that is the basis of the data set is obtained by using the data represented by the discrete quantity exceeding 2 and the data converted from the data represented by the continuous quantity to the binary data. Regardless of whether the data is discrete, continuous, or binary data, regardless of whether the data sample is heterogeneous or homogeneous, statistically analyze various data It can be used for processing, and analysis results based on a wide range of data can be obtained.
  • N 1 samples are obtained.
  • a 2 ⁇ 2 contingency table in which the number of samples is tabulated according to the first criterion for the first event and the first criterion for the second event can be obtained.
  • N 2 samples obtain a 2 ⁇ 2 contingency table that counts the number of samples according to the second criterion for the first event and the second criterion for the second event. be able to.
  • Calculation of Fisher's exact probability P 1 from the 2 ⁇ 2 contingency table for the obtained N 1 samples can be performed in the same manner as the calculation of Fisher's exact probability P described above.
  • calculation of Fisher's exact probability P 2 from the 2 ⁇ 2 contingency table for the obtained N 2 samples can be performed in the same manner as the calculation of Fisher's exact probability P described above.
  • the Fischer's exact probability P used in the present invention was calculated by a method including a step of integrating the Fischer's exact probability P 1 and the Fischer's exact probability P 2 including the Fisher's exact probability P 2 using meta-analysis. It may be a thing.
  • the Fischer's exact probability includes the Fischer's exact probability P 1 and the Fischer's exact probability P 2 , and the number thereof is, for example, 2 but even if it exceeds this number, Good.
  • the Fischer's exact probability P n calculated by a method similar to these may be included in the Fischer's exact probability P 2.
  • the number of Fisher's exact probabilities to be integrated using meta-analysis is not particularly limited, but is 2 to 100, for example.
  • Z overall which is the sum of Z values divided by the square root of the number (k) to be integrated, follows a normal distribution.
  • the method of the present invention is suitable for implementation by a computer.
  • the above method may be performed by a computer program for executing this method.
  • Examples of the computer program include a program for causing a computer to function as means for performing each step of the above-described method.
  • Examples of the computer program include a computer, (1) means for performing a step of obtaining data including information on the first event and information on the second event for N samples; (2) From the data including the information of the first event and the information of the second event for N samples, the data including the binary data for the first event and the binary data for the second event Means for performing a step of obtaining a set; (3) Based on the criteria for the first event and the criteria for the second event, determine whether each of the N samples corresponds to a 2 ⁇ 2 contingency table type, Means for performing a step of classifying each of the samples into the respective types; (4) Each of the N samples is classified into each type, and this is repeated for all the N samples, and the number of samples classified into each type is totaled to obtain 2 ⁇ 2 from the data set.
  • the program can be executed by causing the computer to read it and causing the hardware resources of the computer and the loaded software to function in a coordinated manner.
  • hardware resources include arithmetic means such as a CPU and storage means such as a memory.
  • the computer program may be stored in a recording medium.
  • the recording medium include optical reading means such as CD-ROM and DVD, and information storage means such as semiconductor memory, flexible disk, and hard disk.
  • Example 1 Data from a breast cancer invasive cancer patient with a sample number of 1019 (BRCA) was downloaded from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/). This data contained information about 20,000 genes. Each breast invasive cancer patient was classified into two types based on whether mRNA expression of CLSTN3 (calsyntenin 3) as a target gene was more than twice or less than that of the wild type. Similarly, regarding the mRNA expression of other remaining genes, each breast invasive cancer patient was classified into two types based on whether it was more than twice or less than twice that of the wild type.
  • CLSTN3 calsyntenin 3
  • the number of breast invasive cancer patients was counted in a 2 ⁇ 2 contingency table for each of CLSTN3 (calsyntenin 3) and other remaining genes.
  • the mutual information amount with CLSTN3 (Calsyntenin 3) was calculated for each gene using the formula for defining the mutual information amount described above.
  • Fisher's exact probability p was calculated for each gene.
  • the calculated mutual information with CLSTN3 (Calsyntin 3) and the value of -log (p) obtained from Fisher's exact probability p were plotted on a graph.
  • Example 2 Acute myeloid leukemia, bladder urothelial cancer, breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, renal renal cell carcinoma, renal papillary cell carcinoma, lung adenocarcinoma, lung squamous cell carcinoma , Ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, prostate cancer, rectal adenocarcinoma, cutaneous melanoma, gastric adenocarcinoma, thyroid cancer, endometrial cancer, cancer cell line (CCLE) Data about the sample was downloaded from TCGA (http://cancergenome.nih.gov/). The above-mentioned CCLE is not case data but data using 1021 types of established cancer cells. The data for each sample included 66-1021 cases as samples and contained information about 20,000 genes.
  • CCLE cancer cell line
  • each of the remaining genes was 2 ⁇ 2 between the EGFR and the EGFR in the same manner as in Example 1.
  • the number of samples was counted in the contingency table, and based on this, Fisher's exact probability P was calculated.
  • FIG. 2 shows the results of arranging the calculated values in descending order. Further, 2001 genes obtained by adding EGFR to 2000 genes having a large calculated value were analyzed with Ingenuity Pathway Analysis (IPA) (registered trademark) analysis software manufactured by Qiagen. Table 5 below shows the top five results of the standard path (Canonical Pathways) in IPA.
  • IPA Ingenuity Pathway Analysis
  • the third predicted pathway was EGF signaling.
  • Fisher's exact probability was integrated by meta-analysis, the interdependence between EGFR and each gene could be specified accurately.
  • Example 3 The same method as in Example 2 was carried out except that RB1 (RB Transcribal Compressor 1), IFNG (interferon gamma) and GRM1 (glutamate metabotropic receptor 1) were respectively used as target genes. The results of arranging each gene of interest in order from the gene with the highest calculated value are shown in FIGS.
  • IFNG IFNG
  • IPA registered trademark
  • Table 5 shows the top five results of the standard path (Canonical Pathways) in IPA (registered trademark).
  • Example 4 For sales for one week at store A in the supermarket chain, a purchase history of about 5000 samples is downloaded from the POS system. This data includes information about the contents of individual purchases. The 5000 samples are classified into two types based on whether or not a product belonging to the “rice ball” category has been purchased. Similarly, other product categories (the number of product categories is about 300) are classified into two types based on whether or not they are purchased.
  • samples are tabulated in a 2 ⁇ 2 contingency table for “rice ball” and each product category, and Fisher's exact probability P is calculated based on the tabulation result. This is done for all about 200 product categories.
  • the product category with a high value obtained by this calculation is often purchased at the same time as “rice ball”. For example, if it is analyzed that a supermarket customer who purchases “rice ball” often purchases “cup miso soup” at the same time, sales can be increased by displaying both of them adjacently.
  • Example 5 Download the stock price transition data for 2017 for stocks traded on the first section of the Tokyo Stock Exchange (about 2000 stocks). There are about 240 trading days in 2017, and each day is a sample. Next, the data of the dollar-yen exchange rate (price of 1 dollar converted into yen) in 2017 is downloaded. Using the dollar-yen exchange rate data, it is classified into two types based on whether the dollar-yen exchange rate on the sample date is higher than the previous day rate. Next, using the data of the stock price transition, the stock prices of each company are classified into two types based on whether or not the stock price at the end of trading is higher than the stock trading time.
  • samples are tabulated in a 2 ⁇ 2 contingency table for fluctuations in the dollar-yen exchange rate and the stock price of the company, and Fisher's exact probability P is calculated based on the tabulation results. This is done for about 2000 stock prices.
  • the calculated P of each brand is integrated by using the meta-analysis method in the same manner as in the second embodiment for each industry according to the classification in the TSE industry classification. Based on N all of the total number of samples used in integration, ⁇ log 10 P overall / (N all log 10 2) is calculated for each industry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

[Problem] To specify efficiently, accurately and with statistical significance an interdependence between a plurality of phenomena indicated in data, even if various types of data acquired independently on the basis of different criteria are used comprehensively and uniformly over a large range. [Solution] This method for identifying interdependence between a first phenomenon and a second phenomenon is characterized by including a step of: acquiring a data set including binary data relating to the first phenomenon and binary data relating to the second phenomenon, from data including information of the first phenomenon and information of the second phenomenon for N samples; aggregating the number of samples in a 2x2 contingency table, from the acquired data set; calculating a Fisher exact probability P on the basis of the aggregated 2x2 contingency table; and calculating -log10 P/N on the basis of the calculated Fisher exact probability P and N.

Description

相互依存性の特定方法How to identify interdependencies
 本発明は、大規模データの情報処理方法、当該方法を実行させるためのコンピュータ用プログラム、及び当該プログラムを保存した記録媒体に関する。詳しくは、本発明は、2つの事象間の相互依存性の特定方法、当該方法を実行させるためのコンピュータ用プログラム、及び当該プログラムを保存した記録媒体に関する。 The present invention relates to an information processing method for large-scale data, a computer program for executing the method, and a recording medium storing the program. Specifically, the present invention relates to a method for specifying interdependency between two events, a computer program for executing the method, and a recording medium storing the program.
 昨今のコンピュータ技術の発展により、各種の手段によって、データが収集されており、異なる種類のデータを含む大量のデータが蓄積されている。これらの大規模のデータには、有用な情報が含まれていると期待されており、これを効果的に解析すれば、これらのデータに含まれる複数の事象相互の関係を統計的に有意に特定することを通じて、未知の事象の特性を正確に特定することができると期待される。しかしながら、これらの大規模のデータは、各種の異なる条件のもとで独立に取得されたものである場合も多く、含まれるデータに伴うノイズによって解析結果の精度が低下することもあり、このようなデータを、大規模な範囲にわたって網羅的に統一的に用いて効率的に解析を行うことは容易ではなかった。 With the recent development of computer technology, data is collected by various means, and a large amount of data including different types of data is accumulated. These large-scale data are expected to contain useful information, and if this is analyzed effectively, the relationship between multiple events contained in these data is statistically significant. Through the identification, it is expected that the characteristics of the unknown event can be accurately identified. However, these large-scale data are often obtained independently under various different conditions, and the accuracy of the analysis results may be reduced due to noise accompanying the contained data. It is not easy to perform efficient analysis using comprehensive and comprehensive data over a large area.
 複数の事象間の相互情報量は、当該事象間の相互依存性の尺度を表す量として用いられている。複数の事象間の相互情報量を算出することにより、複数の事象間の相互依存性を特定することができ、これにより、当該事象の特性を特定することができると期待される。XとYの相互情報量は後述する式のとおり定義されるが、当該定義の式に示されるように、従来、相互情報量は、これを算出したサンプルの数Nを考慮するものではなく、統計的有意性を考慮するものとしては考えられていなかった。また、当該定義の式に示されるように、相互情報量は、異なる条件下で得られたデータを組み合わせて用いて算出することができるとは考えられていなかった。なお、相互情報量を用いて大量のデータを解析する技術は、文書、音声、画像、位置、生命、天文、金融、売上など多様な情報の処理に用いられている。生命情報のデータ解析のアルゴリズムとしては、例えばARACNEなどが知られている(非特許文献1)。 The mutual information amount between a plurality of events is used as a quantity representing a measure of interdependency between the events. By calculating the mutual information amount between a plurality of events, it is expected that the interdependency between the plurality of events can be specified, and thereby the characteristics of the events can be specified. Although the mutual information amount of X and Y is defined as in the equation described later, as shown in the equation of the definition, conventionally, the mutual information amount does not consider the number N of samples from which this is calculated, It was not considered to take into account statistical significance. Further, as shown in the formula of the definition, it was not considered that the mutual information amount can be calculated using a combination of data obtained under different conditions. A technique for analyzing a large amount of data using the mutual information amount is used for processing various information such as documents, sounds, images, positions, life, astronomy, finance, and sales. As an algorithm for analyzing life information data, for example, ARACNE is known (Non-Patent Document 1).
 ところで、フィッシャーの正確確率検定は、主に標本数が少ない場合などに、2つのカテゴリーに分類されたデータの分析に用いられる統計学的検定法であり、各種の統計処理に用いられてきた(非特許文献2~3)。フィッシャーの正確確率と相互情報量との関係は、これまで知られていない。 By the way, Fisher's exact test is a statistical test method used to analyze data classified into two categories mainly when the number of samples is small, and has been used for various statistical processing ( Non-Patent Documents 2 to 3). The relationship between Fisher's exact probability and mutual information has not been known so far.
 本発明は、異なる条件のもとで独立に取得された各種のデータを大規模な範囲にわたって網羅的に統一的に用いる場合であっても、これらのデータに示される複数の事象の相互依存性を、統計的に有意に、効率的に正確に特定することを目的とする。 Even when various data acquired independently under different conditions are used comprehensively and uniformly over a large range, the present invention provides interdependencies of a plurality of events shown in these data. Is statistically significantly and efficiently specified.
 本発明者らは、鋭意検討していたところ、2×2の分割表に基づいて算出したフィッシャーの正確確率Pと、当該分割表の作成に用いたサンプル数Nとを用いて、-log10P/(Nlog102)を算出すれば、相互情報量を近似的に算出することができることを見出した。すなわち、本発明者らは、N個のサンプルについて含むデータから、2値データを含むデータセットを取得し、これを用いて、2×2の分割表を作成し、これに基づいてフィッシャーの正確確率Pを算出し、前記N及び前記Pを用いて、-log10P/(Nlog102)を算出することにより、事象間の相互情報量を算出して、当該事象どうしの相互依存性を特定することができることを見出した。フィッシャーの正確確率Pは、確率理論において研究が進められてきた概念であるのに対して、相互情報量は、主に情報理論において研究が進められてきた概念であり、両者が関係することの発見は、極めて画期的である。なお、ここで、log102は定数であるため、当該事象どうしの相互依存性の特定は、-log10P/Nの算出においても可能である。本明細書において、-log10P/Nの算出は、広義において、-log10P/(Nlog102)の算出を含意するものとする。 The inventors of the present invention have been diligently studying, using Fisher's exact probability P calculated based on the 2 × 2 contingency table and the number of samples N used to create the contingency table, −log 10 It has been found that the mutual information amount can be approximately calculated by calculating P / (Nlog 10 2). That is, the present inventors obtain a data set including binary data from data including N samples, create a 2 × 2 contingency table, and use Fisher's exact data based on the data set. calculates the probability P, using the N and the P, by calculating the -log 10 P / (Nlog 10 2 ), to calculate the mutual information between events, the interdependence of each other the event I found out that it can be identified. Fischer's exact probability P is a concept that has been studied in probability theory, whereas mutual information is a concept that has been studied mainly in information theory. The discovery is extremely groundbreaking. Here, since log 10 2 is a constant, the interdependency between the events can be specified in the calculation of -log 10 P / N. In this specification, the calculation of the -log 10 P / N, in the broad sense, and imply the calculation of the -log 10 P / (Nlog 10 2 ).
 すなわち、本発明は、第1の態様において、第1の事象と第2の事象の相互依存性の特定方法であって、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した、2×2の分割表に基づいて算出された、フィッシャーの正確確率Pと、前記Nをもとに、-log10P/Nを算出する工程を含むことを特徴とする、方法を提供するものである。 That is, in the first aspect, the present invention provides a method for specifying interdependency between a first event and a second event, wherein the information on the first event and the information on the second event are represented by N samples. Calculated based on a 2 × 2 contingency table that aggregates the number of samples from a data set that includes binary data for the first event and binary data for the second event, obtained from data that includes The method includes the step of calculating −log 10 P / N based on the Fisher's exact probability P and the N.
 フィッシャーの正確確率Pは、統計学を用いるものであるため、従来知られている方法によって、メタ解析を行うことができる。メタ解析によれば、異なる種類のサンプルについてのデータなど、異なる条件によって取得したデータをもとにそれぞれ算出した複数のフィッシャーの正確確率Pを統合し、これらの全体のデータについてのフィッシャーの正確確率Pを算出することができる。したがって、前記の相互依存性の特定方法において、異なる条件によって取得したデータをもとに、それぞれフィッシャーの正確確率Pを算出し、算出したそれぞれのフィッシャーの正確確率Pを統合し、統合して得られたフィッシャーの正確確率を用いることによって、異なる条件によって取得したデータの全体をもとに、事象間の相互依存性を特定することができる。これは、異なる条件によって取得したデータの全体をもとにして事象間の相互情報量を算出し、当該事象間の相互依存性を特定するという、従来行うことができなかったことを可能にしたものである。なお、複数のフィッシャーの正確確率Pは、そのための2×2の分割表を集計する基準が異なっていても、メタ解析によって統合することができるため、その正確確率を算出するための2×2の分割表を集計する基準が異なっていても構わない。 Fischer's exact probability P uses statistics, so meta-analysis can be performed by a conventionally known method. According to the meta-analysis, a plurality of Fisher's exact probabilities P calculated based on data obtained under different conditions such as data on different types of samples are integrated, and Fisher's exact probabilities for these overall data are integrated. P can be calculated. Therefore, in the above interdependency identification method, the Fisher's exact probability P is calculated based on the data acquired under different conditions, and the calculated Fisher's exact probabilities P are integrated and obtained. By using the exact Fisher's probability, the interdependency between events can be specified based on the whole data acquired under different conditions. This made it possible to calculate the mutual information between events based on the entire data acquired under different conditions, and to identify the interdependencies between the events, which could not be done in the past Is. The exact probability P of a plurality of Fishers can be integrated by meta-analysis even if the criteria for summing up the 2 × 2 contingency table for that is different, so that 2 × 2 for calculating the correct probability The criteria for summing up the contingency tables may be different.
 したがって、本発明は、第2の態様において、前記第1の態様の方法であって、前記フィッシャーの正確確率Pが、(1)第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第1の基準及び第2の事象についての第1の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した、2×2の分割表に基づいて算出された、フィッシャーの正確確率Pと、(2)第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第2の基準及び第2の事象についての第2の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した、2×2の分割表に基づいて算出された、フィッシャーの正確確率Pとを含む、複数のフィッシャーの正確確率を、メタ解析を用いて統合する工程を含む方法により算出されたものである、方法を提供するものである。 Accordingly, the present invention is the method according to the first aspect, wherein the Fisher's exact probability P is (1) information on the first event and information on the second event is N 1. The binary data and the second event for the first event, obtained from the data included for the samples, based on the first criterion for the first event and the first criterion for the second event Fischer's exact probability P 1 calculated based on a 2 × 2 contingency table in which the number of samples is aggregated from a data set including binary data of (2), (2) information on the first event and second Binary values for the first event, obtained from data containing information on N events for the N 2 samples, based on the second criterion for the first event and the second criterion for the second event About the data and the second event From a data set containing the value data, and counts the number of samples was calculated on the basis of a 2 × 2 contingency table, and a exact P 2 Fisher, a plurality of Fisher's exact, using meta-analysis And providing a method that is calculated by a method including a step of integrating them.
 本発明は、第3の態様において、第1の態様又は第2の態様に記載の方法を実行させるためのコンピュータ用プログラムを提供するものである。 In the third aspect, the present invention provides a computer program for executing the method described in the first aspect or the second aspect.
 本発明は、第4の態様において、第3の態様に記載のコンピュータ用プログラムを保存した記録媒体を提供するものである。 In the fourth aspect, the present invention provides a recording medium storing the computer program according to the third aspect.
 本発明によれば、サンプル数Nを考慮せず、統計的有意性に関する配慮を欠いていた従来の方法とは異なり、フィッシャーの正確確率とサンプル数を考慮した値として、統計的有意性を考慮しつつ、事象間の相互情報量を算出して、当該事象間の相互依存性を特定することができる。また、発明によれば、メタ解析を用いて事象間の相互情報量を算出するため、異なる条件下で取得された異なる種類のサンプルについてのデータであっても、それらを組み合わせた全体のデータにつき、事象間の相互情報量を算出して、当該事象間の相互依存性を特定することができる。このため、全体のデータに含まれる各種のサンプルの特性によるバイアスを低減させつつ、多量のデータをもとにして、より正確に統計的に有意に、事象間の相互依存性を特定することができる。さらに、本発明において、フィッシャーの正確確率を算出したのち、算出したその値につき、有意水準を適用して、得られた結果に応じて、得られたフィッシャーの正確確率のデータを破棄して、その後の演算に用いる対象としないことなどを行えば、各種のデータに伴うノイズを低減し、有意性に乏しいデータを大幅に除去することによって、計算負荷を低減させつつ、より正確に統計的に有意に、事象間の相互依存性を特定することができる。 According to the present invention, statistical significance is considered as a value considering Fisher's exact probability and the number of samples, unlike the conventional method which does not consider the number of samples N and lacks consideration for statistical significance. However, the mutual information amount between the events can be calculated by calculating the mutual information amount between the events. Further, according to the invention, since the mutual information amount between events is calculated using meta-analysis, even if it is data on different types of samples acquired under different conditions, By calculating the mutual information amount between events, the interdependency between the events can be specified. Therefore, it is possible to identify interdependencies between events more accurately and statistically significantly based on a large amount of data while reducing bias due to the characteristics of various samples included in the entire data. it can. Furthermore, in the present invention, after calculating the exact probability of Fisher, applying the significance level for the calculated value, according to the obtained result, discarding the data of the exact probability of Fisher, If you do something that you do not want to use for subsequent calculations, you can reduce the noise associated with various data, and significantly reduce the data that is not significant, thereby reducing the computational burden and more accurately and statistically. Significantly, interdependencies between events can be identified.
フィッシャーの正確確率pと相互情報量MIの関係を示すグラフである。It is a graph which shows the relationship between Fisher's exact probability p and mutual information amount MI. 各遺伝子につき算出したEGFRとの相互情報量を、数値が高い順に左から右に並べたグラフである。縦軸が、各遺伝子につき算出したEGFRとの相互情報量にNlog102を掛けた値を示す。It is the graph which arranged the mutual information amount with EGFR computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and EGFR calculated for each gene. 各遺伝子につき算出したRB1との相互情報量を、数値が高い順に左から右に並べたグラフである。縦軸が、各遺伝子につき算出したRB1との相互情報量にNlog102を掛けた値を示す。It is the graph which arranged the mutual information amount with RB1 calculated about each gene from left to right in order with a high numerical value. Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and RB1 calculated for each gene. 各遺伝子につき算出したIFNGとの相互情報量を、数値が高い順に左から右に並べたグラフである。縦軸が、各遺伝子につき算出したIFNGとの相互情報量にNlog102を掛けた値を示す。It is the graph which arranged the mutual information amount with IFNG computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and IFNG calculated for each gene. 各遺伝子につき算出したGRM1との相互情報量を、数値が高い順に左から右に並べたグラフである。縦軸が、各遺伝子につき算出したGRM1との相互情報量にNlog102を掛けた値を示す。It is the graph which arranged the mutual information amount with GRM1 computed about each gene from left to right in order of numerical value. Vertical axis indicates a value obtained by multiplying the Nlog 10 2 to mutual information and GRM1 calculated for each gene.
 本発明は、第1の事象と第2の事象の相互依存性の特定方法を提供する。ここで、事象の例としては、対象について観察結果として把握される状態が挙げられる。対象の例としては、遺伝子、単語が挙げられる。対象の別の例としては、文書、音声、画像、位置、生命、天文、金融、売上などに関するものが挙げられる。状態の例としては、当該対象の平均的な性質と異なることが挙げられる。事象の例としては、遺伝子の変化、エピジェネティックな変化、株価の上昇・下落が挙げられる。事象の別の例としては、複数の単語が同一の文において用いられること、売り上げに特定の商品の売り上げが含まれることが挙げられる。 The present invention provides a method for specifying the interdependency between the first event and the second event. Here, as an example of an event, the state grasped as an observation result about an object is mentioned. Examples of objects include genes and words. Other examples of objects include documents, sound, images, location, life, astronomy, finance, sales, etc. An example of a state is that it differs from the average property of the object. Examples of events include genetic changes, epigenetic changes, and rising or falling stock prices. Another example of an event is that multiple words are used in the same sentence, and that sales include sales of a specific product.
 遺伝子の変化の例としては、遺伝子配列の変異、遺伝子の発現産物の変化、遺伝子の修飾の変化が挙げられる。遺伝子配列の変異の例としては、遺伝子の塩基配列の変異、遺伝子の染色体上のコピー数の変化、遺伝子の修飾の変化が挙げられる。遺伝子の塩基配列の変異の例としては、遺伝子の点突然変異、遺伝子に対する塩基配列の付加、遺伝子における塩基配列の欠失が挙げられる。遺伝子の発現産物の例としては、タンパク質、mRNA、miRNA(micro-RNA)が挙げられる。遺伝子の発現産物の変化の例としては、遺伝子の発現産物の発現量の変化、遺伝子の発現産物の発現箇所の変化、遺伝子の発現産物の複合体の形成、遺伝子の発現産物の複合体の分解が挙げられる。遺伝子の修飾の例としては、DNAメチル化、ヒストン修飾が挙げられる。ヒストン修飾の例としては、アセチル化、メチル化、ユビキチン化、リン酸化、SUMO化が挙げられる。また、遺伝子の修飾の例としては、翻訳後修飾が挙げられる。翻訳後修飾の例としては、官能基付加、タンパク質またはペプチドの付加、アミノ酸の化学的性質の変換、構造変換が挙げられる。官能基付加の例としては、アシル化、アセチル化、アルキル化、アミド化、ビオチニル化、ホルミル化、γカルボキシル化、グルタミル化、グリコシル化、グリシル化、ヘム、ヒドロキシル化、ヨウ素化、イソプレニル化、リポイル化(プレニル化、GPIアンカー形成、ミリストイル化、ファルネシル化、ゲラニルゲラニル化など)、ヌクレオチドまたは誘導体への共有結合の付加(ADPリボシル化、FAD結合など)、酸化還元反応、ポリエチレングリコール化、ホスファチジルイノシトール、ホスホパンテテイニル化、リン酸化、ピログルタミン酸形成、ラセミ化、チロシン硫酸化、セレノイル化が挙げられる。タンパク質またはペプチドの付加の例としては、ISG化、SUMO化、ユビキチン化が挙げられる。アミノ酸の化学的性質の変換の例としては、シトルリン化または脱アミン、脱アミドが挙げられる。構造変換の例としては、ジスルフィド、プロテアーゼによるものが挙げられる。 Examples of gene changes include gene sequence mutations, gene expression product changes, and gene modification changes. Examples of gene sequence mutations include gene base sequence mutations, gene copy number changes on the chromosome, and gene modification changes. Examples of gene base sequence mutations include gene point mutations, addition of base sequences to genes, and deletion of base sequences in genes. Examples of gene expression products include proteins, mRNA, and miRNA (micro-RNA). Examples of changes in gene expression products include changes in the expression level of gene expression products, changes in the expression location of gene expression products, formation of gene expression product complexes, and degradation of gene expression product complexes Is mentioned. Examples of gene modification include DNA methylation and histone modification. Examples of histone modifications include acetylation, methylation, ubiquitination, phosphorylation, and SUMOylation. Examples of gene modification include post-translational modification. Examples of post-translational modifications include functional group addition, protein or peptide addition, amino acid chemistry conversion, and structural conversion. Examples of functional group addition include acylation, acetylation, alkylation, amidation, biotinylation, formylation, gamma carboxylation, glutamylation, glycosylation, glycylation, heme, hydroxylation, iodination, isoprenylation, Lipoylation (prenylation, GPI anchor formation, myristoylation, farnesylation, geranylgeranylation, etc.), covalent bond addition to nucleotides or derivatives (ADP ribosylation, FAD linkage, etc.), redox reaction, polyethylene glycolation, phosphatidylinositol Phosphopantetheinylation, phosphorylation, pyroglutamate formation, racemization, tyrosine sulfation, selenoylation. Examples of protein or peptide addition include ISG, SUMO, and ubiquitination. Examples of conversion of amino acid chemistry include citrullination or deamination, deamidation. Examples of the structure conversion include disulfide and protease.
 遺伝子の例としては、ヒト、サル、マウス、ラット等の哺乳類の遺伝子が挙げられる。エピジェネティックな変化の例としては、細胞分裂を通して受け継がれる変化であって、DNA塩基配列の変化とは独立した変化が挙げられる。 Examples of genes include genes of mammals such as humans, monkeys, mice and rats. An example of an epigenetic change is a change that is inherited through cell division and is independent of a change in DNA base sequence.
 第1の事象と第2の事象において、第1と第2は、第1の事象と第2の事象を区別するための記号であり、これらの事象の順序を限定するものではない。ここで、第1の事象と第2の事象とは、異なる対象についての同一の状態であってもよく、同一の対象についての異なる状態であってもよい。例えば、第1の事象が、遺伝子Aの塩基配列の変異であり、第2の事象が、遺伝子Bの塩基配列の変異であってもよい。また、例えば、第1の事象が、遺伝子Aの配列の変異であり、第2の事象が、遺伝子Aの発現産物の発現量の変化であってもよい。なお、ここで、遺伝子Aと遺伝子Bは、異なる遺伝子を指す。 In the first event and the second event, the first and second are symbols for distinguishing the first event from the second event, and do not limit the order of these events. Here, the first event and the second event may be the same state for different objects, or may be different states for the same object. For example, the first event may be a mutation in the base sequence of gene A, and the second event may be a mutation in the base sequence of gene B. Further, for example, the first event may be a mutation in the sequence of gene A, and the second event may be a change in the expression level of the expression product of gene A. Here, gene A and gene B indicate different genes.
 事象の例としては、有無で表されるもの、数値で表されるものが挙げられる。数値で表されるものの例としては、2を超える離散量で表されるもの、連続量で表されるものが挙げられる。第1の事象と第2の事象は、異なるように表されるものであってよく、例えば、第1の事象が有無で表されるものであり、第2の事象が2を超える離散量で表されるものであってもよい。 Examples of events include those represented by presence / absence and those represented by numerical values. Examples of what is represented by a numerical value include those represented by a discrete quantity exceeding 2 and those represented by a continuous quantity. The first event and the second event may be expressed differently. For example, the first event is expressed by the presence or absence of the first event, and the second event is a discrete quantity exceeding 2. It may be expressed.
 本発明においては、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータが用いられる。ここで、N個のサンプルは、例えば、事象についての観察結果を与える、共通の性質を有するN個の主体である。Nの例としては、10以上、100以上、1,000以上、10,000以上、100,000以上などの数値が挙げられる。Nが大きいほど、より正確に、第1の事象と第2の事象の相互依存性を特定することができる。前記の共通の性質の例としては、生物に由来すること、ヒトに由来すること、疾患を有するヒトに由来すること、がんを有するヒトに由来すること、特定の種類のがんを有するヒトに由来することが挙げられる。前記主体の例としては、ヒト等の生物の細胞、臓器その他の生体試料が挙げられる。 In the present invention, data including information on the first event and information on the second event for N samples is used. Here, the N samples are, for example, N subjects having a common property that give an observation result about the event. Examples of N include numerical values such as 10 or more, 100 or more, 1,000 or more, 10,000 or more, 100,000 or more. The larger N is, the more accurately the interdependency between the first event and the second event can be specified. Examples of the common properties include those derived from living organisms, derived from humans, derived from humans with diseases, derived from humans with cancer, and humans with specific types of cancer. It is derived from. Examples of the subject include cells of living organisms such as humans, organs, and other biological samples.
 特定の種類のがんの例としては、白血病、リンパ腫、ホジキン病、非ホジキンリンパ腫、多発性骨髄腫、脳腫瘍、乳がん、子宮体がん、子宮頚がん、卵巣がん、食道癌、胃癌、虫垂癌、大腸癌、肝癌、肝細胞癌、胆嚢癌、胆管癌、膵臓がん、副腎癌、消化管間質腫瘍、中皮腫、頭頚部癌、喉頭癌、口腔癌、口腔底癌、歯肉癌、舌癌、頬粘膜癌、唾液腺癌、副鼻腔癌、上顎洞癌、前頭洞癌、篩骨洞癌、蝶型骨洞癌、甲状腺癌、腎臓がん、肺癌、骨肉腫、前立腺癌、精巣腫瘍(睾丸がん)、腎細胞癌、膀胱癌、横紋筋肉腫、皮膚癌、肛門癌が挙げられる。 Examples of certain types of cancer include leukemia, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma, multiple myeloma, brain tumor, breast cancer, endometrial cancer, cervical cancer, ovarian cancer, esophageal cancer, stomach cancer, Appendiceal cancer, colon cancer, liver cancer, hepatocellular carcinoma, gallbladder cancer, bile duct cancer, pancreatic cancer, adrenal cancer, gastrointestinal stromal tumor, mesothelioma, head and neck cancer, laryngeal cancer, oral cancer, oral floor cancer, gingiva Cancer, tongue cancer, buccal mucosa cancer, salivary gland cancer, sinus cancer, maxillary sinus cancer, frontal sinus cancer, ethmoid sinus cancer, sphenoid sinus cancer, thyroid cancer, kidney cancer, lung cancer, osteosarcoma, prostate cancer, Testicular tumor (testicular cancer), renal cell cancer, bladder cancer, rhabdomyosarcoma, skin cancer, anal cancer.
 疾患、特にがんに罹患した生物は、遺伝子間の相互作用が増幅しているため、疾患、特にがんに罹患した生物に由来する細胞、臓器その他の生体試料は、異なる遺伝子についての相互依存性を特定するためのサンプルとして、好適である。 Since organisms affected by disease, particularly cancer, have an amplified interaction between genes, cells, organs and other biological samples derived from disease, particularly cancer-affected organisms, are interdependent for different genes. It is suitable as a sample for specifying sex.
 本発明において用いられる前記データは、第1の事象の情報と第2の事象の情報を、N個のサンプルについて含む。ここで、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータにおいては、例えば、N個のそれぞれのサンプルが、第1の事象の情報と第2の事象の情報を含む。ここで、事象の情報の例としては、(1)事象が有無で表されるものである場合は、そのサンプルにつきその事象があったか、その事象がなかったかの情報が挙げられ、(2)事象が数値で表されるものである場合は、そのサンプルについての数値が挙げられる。 The data used in the present invention includes information on the first event and information on the second event for N samples. Here, in the data including information on the first event and information on the second event for N samples, for example, each of the N samples includes information on the first event and information on the second event. including. Here, as examples of event information, (1) if the event is represented by the presence / absence of the event, information on whether or not the event occurred for the sample can be cited. (2) In the case of a numerical value, the numerical value for the sample is given.
 本発明においては、前記の第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての2値データと第2の事象についての2値データを含むデータセットが取得される。ここで、事象についての2値データの例としては、事象が有無で表される場合における有無のデータ、事象が数値で表される場合における基準値以上又は基準値未満のデータが挙げられる。前記の事象についての2値データを含むデータセットは、例えば、(1)事象が有無で表される場合においては、データに含まれる事象の情報をそのまま用いることにより取得することができ、(2)事象が数値で表される場合においては、基準値を設定し、データに含まれるサンプルについての事象の情報を、基準値以上又は基準値未満と判定し、判定結果として2値データを取得し、これをN個のサンプルについて繰り返すことにより、取得することができる。第1の事象についての2値データと第2の事象についての2値データを含むデータセットの取得は、例えば、(1)第1の事象の情報につき前記の方法を行い、第1の事象についての2値データを取得し、(2)第2の事象の情報につき前記の方法を行い、第2の事象についての2値データを取得し、(3)取得されたそれぞれの2値データを組み合わせることによって、行うことができる。前記において取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットは、例えば、線形インデックスを用いた形式であってもよい。 In the present invention, binary data for the first event and binary data for the second event are obtained from data including the information on the first event and the information on the second event for N samples. The containing dataset is retrieved. Here, examples of binary data for events include data on the presence / absence when events are represented by presence / absence, and data above or below a reference value when events are represented by numerical values. For example, in the case where (1) an event is represented by presence / absence, a data set including binary data regarding the event can be obtained by using the event information included in the data as it is (2 ) If the event is expressed numerically, set a reference value, determine that the event information for the sample included in the data is greater than or less than the reference value, and obtain binary data as the determination result This can be obtained by repeating for N samples. Acquisition of a data set including binary data for the first event and binary data for the second event is performed by, for example, (1) performing the above-described method for the information of the first event, (2) Perform the above-described method for the information of the second event, acquire the binary data for the second event, and (3) combine the acquired binary data. Can be done. The data set including the binary data for the first event and the binary data for the second event acquired in the above may be in a form using a linear index, for example.
 本発明の方法は、前記において取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットを用いることにより、有無で表される事象、数値で表される事象、2を超える離散量で表される事象、連続量で表される事象などの事象の種類を問わずに用いることができる。したがって、本発明の方法は、複数の事象につき、繰り返し行うことに適している。本発明の方法は、複数の事象につき、繰り返し行う場合であっても、同一のアルゴリズムを用いて行うことができるため、簡便に統一的な解析を行うことができる。 The method of the present invention uses the data set including the binary data for the first event and the binary data for the second event, which is obtained in the above, to express the event represented by the presence or absence and the numerical value. It can be used regardless of the type of event such as an event to be expressed, an event expressed by a discrete quantity exceeding 2, an event expressed by a continuous quantity. Therefore, the method of the present invention is suitable for repeatedly performing a plurality of events. Since the method of the present invention can be performed using the same algorithm even when it is repeatedly performed for a plurality of events, a unified analysis can be easily performed.
 生体内の各遺伝子の機能は多様であり、各遺伝子の状態を特定するパラメータは多様であり、各パラメータは連続的または離散的な値を取りうるものであるため、各種の遺伝子についての情報を含むデータを、統一的に用いて各種の遺伝子の相互依存性を特定することは容易ではなかった。本発明の方法は、各種の遺伝子についての情報の種類を問わずに用いることができ、各種の遺伝子について繰り返し行う場合であっても、共通の手法を用いて行うことができるため、簡便に統一的な解析を行うことができる。したがって、本発明の方法は、複数の遺伝子についての情報を含むデータを、統一的に用いてそれらの遺伝子の相互依存性を特定するために用いることに適している。 The functions of each gene in the living body are diverse, the parameters that specify the state of each gene are diverse, and each parameter can take a continuous or discrete value. It was not easy to identify the interdependencies of various genes using the data included in a unified manner. The method of the present invention can be used regardless of the type of information about various genes, and even when repeatedly performed on various genes, it can be performed using a common technique, so it is easily unified. Analysis can be performed. Therefore, the method of the present invention is suitable for using data including information on a plurality of genes in a unified manner to specify the interdependence of these genes.
 本発明においては、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、2×2の分割表にサンプルの数が集計される。当該2値データを含むデータセットからの2×2の分割表へのサンプルの数の集計は、例えば、第1の事象についての2値データと第2の事象についての2値データが、いずれも、有無で表される場合には、以下の表1における各欄の条件に該当するサンプル数であるa、b、c及びdを集計することにより行えばよい。なお、a~dの総和は、前記データセットに含まれるサンプル数のNである。
Figure JPOXMLDOC01-appb-T000001
In the present invention, the number of samples is aggregated in a 2 × 2 contingency table from a data set including binary data for the first event and binary data for the second event. Aggregation of the number of samples from the data set including the binary data into the 2 × 2 contingency table is, for example, that the binary data for the first event and the binary data for the second event are both In the case where it is expressed by presence / absence, it may be performed by aggregating a, b, c, and d which are the number of samples corresponding to the conditions of each column in Table 1 below. Note that the sum of a to d is N as the number of samples included in the data set.
Figure JPOXMLDOC01-appb-T000001
 2×2の分割表への集計においては、前記の条件に該当するサンプル数であるa、b、c及びdを集計しさえすれば、表を用いなくてもよい。例えば、(1)第1の事象があり、第2の事象があるという条件(2)第1の事象があり、第2の事象がないという条件、(3)第1の事象がなく、第2の事象があるという条件、及び(4)第1の事象がなく、第2の事象がないという条件を設定し、N個のサンプルのそれぞれが(1)~(4)の条件のいずれに該当するかを判定して、N個のサンプルのそれぞれを、前記各条件に分類し、これをN個の全サンプルについて繰り返し、各条件に分類されたサンプルの数を集計することによって、(1)~(4)の条件に該当するサンプルの数として、それぞれ、サンプル数a、b、c及びdを取得してもよい。この場合、(1)aは、N個の全サンプルにしめる、第1の事象があり、第2の事象があるサンプルの数であり、(2)bは、N個の全サンプルにしめる、第1の事象があり、第2の事象がないサンプルの数であり、(3)cは、N個の全サンプルにしめる、第1の事象がなく、第2の事象があるサンプルの数であり、(4)dは、N個の全サンプルにしめる、第1の事象がなく、第2の事象がないサンプルの数である。 In the tabulation on the 2 × 2 contingency table, the table may not be used as long as a, b, c, and d, which are the number of samples corresponding to the above conditions, are tabulated. For example, (1) a condition that there is a first event and a second event (2) a condition that there is a first event and there is no second event, (3) there is no first event, Set the condition that there are two events, and (4) the condition that there is no first event and no second event, and each of the N samples is one of the conditions (1) to (4) By determining whether it is true, each of the N samples is classified into each of the above conditions, this is repeated for all N samples, and the number of samples classified into each of the conditions is totaled (1 ) To (4), the sample numbers a, b, c, and d may be acquired as the number of samples corresponding to the conditions (4) to (4). In this case, (1) a is the number of samples where there is a first event and there is a second event, which makes N total samples, and (2) b is the number of samples where there are N total samples, first (3) c is the number of samples without the first event and with the second event, which makes all N samples ( 4) d is the number of samples without the first event and without the second event that make up all N samples.
 本発明においては、当該サンプルの数を集計した2×2の分割表に基づいて、フィッシャーの正確確率Pが算出される。フィッシャーの正確確率Pの算出においては、まず、前記のa、b、c、d及びNと、以下の式により、pを算出する。
Figure JPOXMLDOC01-appb-M000002
In the present invention, Fischer's exact probability P is calculated based on a 2 × 2 contingency table in which the number of samples is counted. In calculating Fischer's exact probability P, first, p is calculated by the a, b, c, d and N and the following equations.
Figure JPOXMLDOC01-appb-M000002
 次に、前記の表1のように2×2の分割表にサンプルの数を集計した前記のデータセットよりも生起しにくいデータセットを全て想定し、そのそれぞれのデータセットにつき、同様に、2×2の分割表にサンプルの数を集計し、同様に、前記の式を用いてpを算出する。算出された全てのpを合計することにより、フィッシャーの正確確率Pを算出することができる。 Next, assume all the data sets that are less likely to occur than the above-described data set in which the number of samples is aggregated in a 2 × 2 contingency table as shown in Table 1 above, and for each data set, similarly 2 Count the number of samples in a × 2 contingency table, and similarly calculate p using the above equation. Fischer's exact probability P can be calculated by summing all the calculated p's.
 本発明においては、算出された当該フィッシャーの正確確率Pと、前記Nをもとに、-log10P/(Nlog102)が算出される。-log10P/(Nlog102)の算出は、P及びNをもとに、例えば、コンピュータをもとに行ってもよい。 In the present invention, the exact P of the Fischer calculated, on the basis of the N, -log 10 P / (Nlog 10 2) is calculated. The calculation of −log 10 P / (Nlog 10 2) may be performed based on P and N, for example, based on a computer.
 本発明者らの見出したところによれば、-log10P/(Nlog102)は、第1の事象と第2の事象の間の相互情報量を近似する。ここで、相互情報量とは、情報理論において用いられている2つの確率変数の相互依存の尺度を表す量である。相互情報量は、XとYが共有する情報量の尺度である。2つの離散確率変数XとYの相互情報量MIは、例えば、以下の式において定義される。
Figure JPOXMLDOC01-appb-M000003
According to our findings, -log 10 P / (Nlog 10 2) approximates the mutual information between the first event and the second event. Here, the mutual information amount is an amount representing a measure of interdependence between two random variables used in information theory. The mutual information amount is a measure of the information amount shared by X and Y. The mutual information MI between the two discrete random variables X and Y is defined by the following equation, for example.
Figure JPOXMLDOC01-appb-M000003
 上の式において、p(x,y)はXとYの同時分布関数、p(x)とp(y)はそれぞれXとYの周辺確率分布関数である。 In the above equation, p (x i , y j ) is a simultaneous distribution function of X and Y, and p (x i ) and p (y j ) are marginal probability distribution functions of X and Y, respectively.
 また、2つの連続確率変数XとYの相互情報量I(X;Y)は、例えば、以下の式において定義される。
Figure JPOXMLDOC01-appb-M000004
Also, the mutual information I (X; Y) between the two continuous random variables X and Y is defined by the following equation, for example.
Figure JPOXMLDOC01-appb-M000004
 上の式において、p(x,y)はXとYの同時分布密度関数、p(x)とp(y)はそれぞれXとYの周辺確率密度関数である。 In the above equation, p (x, y) is a simultaneous distribution density function of X and Y, and p (x) and p (y) are marginal probability density functions of X and Y, respectively.
 これらの式は、あり得る全てのデータ範囲において2変数の同時確率の期待値を計算して、その総和を求めることによって、相互情報量が算出されることを意味している。 These formulas mean that the mutual information is calculated by calculating the expected value of the joint probability of two variables in all possible data ranges and calculating the sum.
 第1の事象と第2の事象の間の相互情報量MIと、-log10P/(Nlog102)との関係につき、本発明者らの見出した知見を以下に示す。まず、2個の確率変数AとBの間の以下の表2及び表3の分割表を考え、それらは、それぞれAとA’、BとB’の2個の値を取るとする。
Figure JPOXMLDOC01-appb-T000005
Figure JPOXMLDOC01-appb-T000006
A mutual information MI between the first event and a second event, every relationship between -log 10 P / (Nlog 10 2 ), shows the headline findings of the present inventors as follows. First, consider the following contingency tables in Table 2 and Table 3 between two random variables A and B, and assume that they take two values, A and A ′, and B and B ′, respectively.
Figure JPOXMLDOC01-appb-T000005
Figure JPOXMLDOC01-appb-T000006
 表2は確率変数の組み合わせの相対度数を示す。よって、X、X、X、XはそれぞれAB、A’B、AB’、A’B’の割合である。表3は、相対度数にNを掛けて得られる度数そのものを表す。 Table 2 shows the relative frequencies of the random variable combinations. Therefore, X 0 , X 1 , X 2 , and X 3 are ratios of AB, A′B, AB ′, and A′B ′, respectively. Table 3 shows the frequency itself obtained by multiplying the relative frequency by N.
 そのとき、相互情報量MIは次のように定義される。ここで対数は自然対数である。
Figure JPOXMLDOC01-appb-M000007
At that time, the mutual information MI is defined as follows. Here, the logarithm is a natural logarithm.
Figure JPOXMLDOC01-appb-M000007
 他方、フィッシャーの正確確率検定のp値の主要項は以下のようになる。
Figure JPOXMLDOC01-appb-M000008
On the other hand, the main term of the p-value of Fisher's exact test is as follows.
Figure JPOXMLDOC01-appb-M000008
 両辺においてlogを取ると、
Figure JPOXMLDOC01-appb-M000009
Taking log on both sides,
Figure JPOXMLDOC01-appb-M000009
 スターリング(Stirling)の公式を用いて、logN!を(NlogN-N)で近似して、X+X+X+X=1を使うと、
Figure JPOXMLDOC01-appb-M000010
Using the Stirling formula, logN! Is approximated by (NlogN−N) and using X 0 + X 1 + X 2 + X 3 = 1,
Figure JPOXMLDOC01-appb-M000010
 よって、
Figure JPOXMLDOC01-appb-M000011
Therefore,
Figure JPOXMLDOC01-appb-M000011
 以上のとおり、本発明者らは、事象間の相互情報量MIは、フィッシャーの正確確率Pを対数変換した-log10P値の定数倍と近似的に等しいことを見出した。ここで、Nはサンプル数を示し、N→∞のとき両辺は等しい値に近づく。 As described above, the present inventors have found that the mutual information MI between events is approximately equal to a constant multiple of the −log 10 P value obtained by logarithmically converting Fisher's exact probability P. Here, N indicates the number of samples. When N → ∞, both sides approach the same value.
 さらに、本発明者らは、後記実施例において示すように、サンプル数が1019の場合において、相互情報量が、-log10P/(Nlog102)によって充分に近似できることを見出し、Nがこのような数値である場合において、-log10P/(Nlog102)を用いることによって、第1の事象と第2の事象の相互依存性を正確に特定することができることを見出した。したがって、本発明において、サンプルの個数であるNは、好ましくは、100以上、より好ましくは、500以上、さらに好ましくは、1,000以上である。従来、フィッシャーの正確確率Pは、サンプル数が少ない場合、すなわち、Nの数が小さい場合に用いられることが多かった。本発明は、このようにサンプル数が多いデータの解析のために、フィッシャーの正確確率Pを用いて、優れた効果を得るものであり、画期的である。また、従来の相互情報量の計算は、サンプル数Nを考慮せずに行われており、統計的有意性に関する配慮が欠如していた。例えば、10例のデータだけから計算された相互情報量は、1,000例のデータに基づく相互情報量に比べ統計的有意性は10-100しかないが、相互情報量の従来の算出方法は、これらを区別していなかった。本発明における-log10P/(Nlog102)を用いる上記の相互情報量の算出方法は、サンプル数Nを用いて近似的に相互情報量を求めるものであり、データの持つ重みを考慮したものとして相互情報量を算出することができ、画期的である。 Furthermore, the present inventors, as shown in Examples below, when the number of samples is 1019, mutual information is found that can be sufficiently approximated by -log 10 P / (Nlog 10 2 ), N is the In such a case, it was found that the interdependency between the first event and the second event can be accurately specified by using -log 10 P / (Nlog 10 2). Therefore, in the present invention, N, which is the number of samples, is preferably 100 or more, more preferably 500 or more, and still more preferably 1,000 or more. Conventionally, Fischer's exact probability P is often used when the number of samples is small, that is, when the number of N is small. The present invention obtains an excellent effect by using Fisher's exact probability P for the analysis of data having a large number of samples, and is epoch-making. Further, the conventional calculation of mutual information is performed without considering the number of samples N, and there is a lack of consideration regarding statistical significance. For example, mutual information amount calculated by the data of 10 cases, but is only 10 -100 statistical significance compared with the mutual information amount based on 1000 cases of the data, the conventional method of calculating the mutual information Did not distinguish between these. In the present invention, the mutual information calculation method using -log 10 P / (Nlog 10 2) is to obtain the mutual information approximately using the number of samples N, and considers the weight of the data. Mutual information can be calculated as a thing, and it is epoch-making.
 このように、上記のとおり算出した-log10P/(Nlog102)は、第1の事象と第2の事象の相互情報量に近似しており、これを用いることにより、第1の事象と第2の事象の相互依存性を特定することができる。ここで、第1の事象と第2の事象の相互依存性の特定は、上記のとおり算出した-log10P/(Nlog102)の値それ自体を評価して行ってもよい。また、第1の事象と第2の事象の相互依存性の特定においては、第1の事象と第2の事象について行った-log10P/(Nlog102)の算出と同様の方法を、第2の事象に代えて、第2の事象と異なる第3の事象について行い、得られた第1の事象と第3の事象について算出した-log10P/(Nlog102)の値と、第1の事象と第2の事象について算出した-log10P/(Nlog102)の値との比較を行ってもよい。ここで、第3の事象は、第1の事象との相互依存性が既知であってもよい。相互依存性が既知であることの例としては、相互依存性の程度または意味を裏付ける実験結果が既に存在することが挙げられる。また、相互依存性の特定にあたっては、相互情報量そのものである-log10P/(Nlog102)を算出してもよいが、-(log10P)/Nを算出してもよい。-(log10P)/Nは、相互依存性の高さを示す数値となり、この数値を用いて相互依存性の高さの比較を行うことが可能であり、この数値を用いて相互依存性の強さを判定することもでき、数値が高いほど相互依存性が強いと判定することができる。これらの方法により、第1の事象と第2の事象の相互依存性を、より正確に特定することができる。 Thus, −log 10 P / (Nlog 10 2) calculated as described above approximates the mutual information amount of the first event and the second event, and by using this, the first event And the interdependency of the second event can be identified. Here, interdependencies particular first event and the second event may be performed to evaluate the value itself of -log 10 P / calculated as above (Nlog 10 2). In the interdependency of the particular first event and a second event, the first event and the -log 10 P / (Nlog 10 2 ) the same method as calculating the performed on the second event, instead of the second event, the value of the carried out for a third event different from the second event, the resulting first event and -log 10 P was calculated for a third event / (Nlog 10 2), first event and -log 10 P / may perform comparison between the value of (Nlog 10 2) calculated for the second event. Here, the third event may have a known interdependency with the first event. An example of the known interdependence is that experimental results already exist that support the degree or meaning of interdependence. Also, mutual when the dependency of the specific, may be calculated -log 10 P / (Nlog 10 2 ) is a mutual information itself but, - (log 10 P) / N may be calculated. -(Log 10 P) / N is a numerical value indicating the high degree of interdependence, and it is possible to compare the high degree of interdependence using this numerical value. It can also be determined that the higher the numerical value, the stronger the interdependence. By these methods, the interdependency between the first event and the second event can be specified more accurately.
 同様に、相互に異なる複数の事象について、それぞれ、第1の事象について-log10P/(Nlog102)の値を算出し、当該複数の事象について算出した値と、第1の事象と第2の事象について算出した-log10P/(Nlog102)の値と比較してもよい。これらの方法により、第1の事象と第2の事象の相互依存性を、より一層正確に特定することができる。 Similarly, for a plurality of events that are different from each other, respectively, the first event to calculate the value of the -log 10 P / (Nlog 10 2 ), the values calculated for the plurality of events, a first event the -log 10 P / may be compared with the value of (Nlog 10 2) calculated for 2 events. By these methods, the interdependency between the first event and the second event can be specified more accurately.
 さらに、相互に異なる複数の事象について、それぞれ、第1の事象について算出した-log10P/(Nlog102)の値の大きさに応じて、当該事象を順位付けたリストを作成し、当該リストをもとに、第1の事象の性質を特定してもよい。当該リストをもとにして第1の事象の性質を特定するにあたっては、当該リストに含まれる事象の性質を考慮してもよい。なお、当該リストは、-log10P/(Nlog102)を算出せずに、-log10P/Nの大きさに応じた順位付けによっても作成することができる。 Further, the plurality of events different from each other, respectively, in accordance with the magnitude of the calculated value of the -log 10 P / (Nlog 10 2 ) for the first event, create a list that ranks the event, the The nature of the first event may be specified based on the list. In specifying the nature of the first event based on the list, the nature of the event included in the list may be considered. The list can also be created by ranking according to the magnitude of -log 10 P / N without calculating -log 10 P / (Nlog 10 2).
 -log10P/(Nlog102)の値を算出する事象の数は、例えば、第1の事象及び第2の事象と共通の性質を有する事象の全体の数である。例えば、第1の事象と第2の事象がいずれもヒトの遺伝子についてのものである場合には、-log10P/(Nlog102)の値を算出する事象の数の例は、ヒトの遺伝子の総数である約20,000である。前記のリストをもとに、第1の事象の性質を特定する場合、当該リストに含まれる事象の数は、例えば、当該第1の事象及び第2の事象と共通の性質を有する事象の全体の数の50%以下、20%以下又は10%以下としてもよい。 The number of events for which the value of −log 10 P / (Nlog 10 2) is calculated is, for example, the total number of events having the same properties as the first event and the second event. For example, if the first event and the second event is for any of the human gene, the number of examples of events which calculates the value of the -log 10 P / (Nlog 10 2 ) , the human The total number of genes is about 20,000. When the nature of the first event is specified based on the list, the number of events included in the list is, for example, the total number of events having the same nature as the first event and the second event. 50% or less, 20% or less, or 10% or less.
 事象が、遺伝子についてのものである場合において、特定する相互依存性の例としては、当該遺伝子の分子細胞機能、生理機能、疾患関連性、生物学的パスウェイに関するものが挙げられ、また、細胞表面分子同士の相互作用、代謝経路、分子機能経路、薬剤標的性に関するものが挙げられる。疾患関連性の例としては、がんの発症や進展、免疫アレルギー疾患、神経精神疾患、先天異常との関連性が挙げられる。 If the event is for a gene, examples of interdependencies that can be identified include those related to the molecular cellular function, physiological function, disease relevance, biological pathways of the gene, and cell surface Examples include interactions between molecules, metabolic pathways, molecular functional pathways, and drug targeting. Examples of disease relevance include the onset and progression of cancer, immune allergic diseases, neuropsychiatric disorders, and congenital abnormalities.
 本発明においては、用いるサンプルが、がんに罹患した患者に由来するものである場合であっても、がんに関連しない遺伝子どうしについての相互依存性を特定することができる。がんに関連しない遺伝子の例としては、神経系、免疫系、代謝、内分泌関連の遺伝子が挙げられる。また、逆に、本発明においては、用いるサンプルが、がんに罹患しない患者に由来するものである場合であっても、がんに関連する遺伝子どうしについての相互依存性を特定することができる。本発明において特定された相互依存性を用いることによって、疾患に対する標的分子や薬剤の特定を行うことができる。また、本発明において特定された相互依存性を用いることによって、オーファン受容体のリガンドの探索を行うことができる。 In the present invention, even if the sample to be used is derived from a patient suffering from cancer, it is possible to specify the interdependence of genes not related to cancer. Examples of genes not related to cancer include genes related to the nervous system, immune system, metabolism, and endocrine. Conversely, in the present invention, even when the sample used is derived from a patient who does not suffer from cancer, it is possible to specify the interdependence of genes related to cancer. . By using the interdependency specified in the present invention, it is possible to specify a target molecule or a drug for a disease. In addition, by using the interdependency specified in the present invention, it is possible to search for orphan receptor ligands.
 事象が単語についてのものである場合においては、例えば、事象が特定の文章において特定の単語が用いられることである場合においては、特定する相互依存性の例としては、当該単語の意味が挙げられる。 In the case where the event is about a word, for example, in the case where the event is that a specific word is used in a specific sentence, examples of the interdependence to be specified include the meaning of the word .
 本発明の前記の方法において-log10P/(Nlog102)の算出に用いられる前記フィッシャーの正確確率Pは、フィッシャーの正確確率Pと、フィッシャーの正確確率Pを含む複数のフィッシャーの正確確率を、メタ解析を用いて統合する工程を含む方法により算出されたものであってもよい。ここで、フィッシャーの正確確率Pは、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第1の基準及び第2の事象についての第1の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した2×2の分割表に基づいて算出されたものである。また、フィッシャーの正確確率Pは、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第2の基準及び第2の事象についての第2の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した2×2の分割表に基づいて算出されたものである。 The Fisher's exact probability P used in the calculation of -log 10 P / (Nlog 10 2) in the method of the present invention is the Fisher's exact probability P 1 and a plurality of Fisher's exact probabilities P 2 . The accuracy probability may be calculated by a method including a step of integrating using meta-analysis. Here, Fisher's exact probability P 1 is obtained from the data including the information of the first event and the information of the second event for N 1 samples, the first criterion and the second event for the first event. From a data set containing binary data for the first event and binary data for the second event, acquired based on the first criterion for It is calculated based on this. In addition, Fisher's exact probability P 2 is obtained from the data including the information of the first event and the information of the second event for N 2 samples, for the second criterion and the second event for the first event. Based on a 2 × 2 contingency table summarizing the number of samples from the data set containing the binary data for the first event and the binary data for the second event, acquired based on the second criterion of Calculated.
 前記フィッシャーの正確確率Pの算出において、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータは、NとNの相違を除き、前述と同様に取得することができる。前記フィッシャーの正確確率Pの算出において、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータは、NとNの相違を除き、前述と同様に取得することができる。ここで、NとNの合計は、Nを超えないが、Nと同一であってもよく、Nより小さくてもよい。前述のN個のサンプルは、前記N個のサンプル及びN個のサンプルを含むものである。N個のサンプルは、好ましくは、事象についての観察結果を与える、共通の性質を有するN個の主体であり、N個のサンプルは、好ましくは、事象についての観察結果を与える、共通の性質を有するN個の主体である。N個の主体に共通の性質と、N個の主体に共通の性質は、完全に一致しなくてもよい。例えば、N個の主体に共通の性質が、ヒトの乳癌疾患に由来することであり、N個の主体に共通の性質が、ヒトの肺癌疾患に由来することであってもよい。この場合であっても、N個のサンプル及びN個のサンプルを含むN個のサンプルは、ヒトのがん疾患に由来する性質を共通に有している。 In the calculation of Fisher's exact probability P 1 , data including information on the first event and information on the second event for N 1 samples is obtained in the same manner as described above, except for the difference between N and N 1. be able to. In the calculation of Fisher's exact probability P 2 , data including information on the first event and information on the second event for N 2 samples is obtained in the same manner as described above, except for the difference between N and N 2. be able to. Here, the total of N 1 and N 2 does not exceed N, but may be the same as N or may be smaller than N. The N samples described above include the N 1 samples and the N 2 samples. N 1 samples are preferably N 1 subjects with a common property, giving observations about events, and N 2 samples are preferably common giving observations about events N 2 main bodies having the following properties. The property common to the N one subjects and the property common to the N two subjects may not completely match. For example, the property common to N 1 subjects may be derived from human breast cancer disease, and the property common to N 2 subjects may be derived from human lung cancer disease. Even in this case, N samples including N 1 samples and N 2 samples have a common property derived from human cancer diseases.
 前記フィッシャーの正確確率Pの算出においては、第1の事象についての2値データと第2の事象についての2値データを含むデータセットが、第1の事象についての第1の基準及び第2の事象についての第1の基準に基づき取得される。また、前記フィッシャーの正確確率Pの算出においては、第1の事象についての2値データと第2の事象についての2値データを含むデータセットが、第1の事象についての第2の基準及び第2の事象についての第2の基準に基づき取得される。 In calculating the Fisher's exact probability P 1 , a data set including binary data for a first event and binary data for a second event is obtained from the first criterion and the second for the first event. Obtained based on the first criterion for the event. In the calculation of Fisher's exact probability P 2 , a data set including binary data for the first event and binary data for the second event is the second criterion for the first event and Obtained based on a second criterion for the second event.
 ここで、当該データセットの取得は、第1の事象についての第1の基準及び第2の事象についての第1の基準に基づくこと及び第1の事象についての第2の基準及び第2の事象についての第2の基準に基づくことを除き、前述と同様に行うことができる。第1の事象についての第1の基準と、第2の事象についての第1の基準は、それぞれ、N個のサンプルについて第1の事象についての2値データと第2の事象についての2値データを取得するための基準である。第1の事象についての第2の基準と、第2の事象についての第2の基準は、それぞれ、N個のサンプルについて、第1の事象についての2値データと第2の事象についての2値データを取得するための基準である。当該基準の例としては、事象が有無で表される場合においては、有無が挙げられ、事象が数値で表される場合においては、その数値の上下で分類するための基準値が挙げられる。当該基準値を用いる場合においては、例えば、数値が基準値以上であること、又は数値が基準値未満であることに応じて2値データに変換することができる。第1の事象についての第1の基準と、第1の事象についての第2の基準は、同一であっても、異なるものであってもよい。例えば、第1の事象が数値で表される場合において、第1の基準となる基準値と、第2の基準となる基準値とは、同一の数値であってもよく、異なる数値であってもよい。また、第1の事象についての第1の基準と、第2の事象についての第1の基準は、同一であっても、異なるものであってもよく、第1の事象についての第2の基準と、第2の事象についての第2の基準は、同一であっても、異なるものであってもよい。例えば、第1の事象も第2の事象も数値で表される場合において、第1の事象についての第1の基準となる基準値と、第2の事象についての第1の基準となる基準値とは、同一の数値であってもよく、異なる数値であってもよい。 Here, the acquisition of the data set is based on the first criterion for the first event and the first criterion for the second event, and the second criterion and the second event for the first event. This can be done as described above except that it is based on the second criterion. The first criterion for the first event and the first criterion for the second event are the binary data for the first event and the binary value for the second event for N 1 samples, respectively. It is a standard for acquiring data. The second criterion for the first event and the second criterion for the second event are respectively binary data for the first event and 2 for the second event for N 2 samples, respectively. This is a standard for obtaining value data. Examples of the standard include the presence / absence in the case where the event is represented by the presence / absence, and the reference value for classifying the event by the numerical value when the event is represented by the numerical value. In the case of using the reference value, for example, it can be converted into binary data depending on whether the numerical value is equal to or higher than the reference value or the numerical value is less than the reference value. The first criterion for the first event and the second criterion for the first event may be the same or different. For example, when the first event is represented by a numerical value, the reference value serving as the first reference and the reference value serving as the second reference may be the same numerical value or different numerical values. Also good. Also, the first criterion for the first event and the first criterion for the second event may be the same or different, and the second criterion for the first event And the second criteria for the second event may be the same or different. For example, when both the first event and the second event are represented by numerical values, a reference value serving as a first reference for the first event and a reference value serving as a first reference for the second event May be the same numerical value or different numerical values.
 このように、本発明においては、2を超える離散量で表されるデータ、及び連続量で表されるデータから2値データに変換したデータを用いることによって、データセットのもととなるデータが離散量であるか、連続量であるか、2値データであるかなどの種類を問わず、当該データのサンプルが異種であるか同種であるかを問わず、各種のデータを統一的に統計処理に用いることができ、広範囲のデータに基づく解析結果を得ることができる。 As described above, in the present invention, the data that is the basis of the data set is obtained by using the data represented by the discrete quantity exceeding 2 and the data converted from the data represented by the continuous quantity to the binary data. Regardless of whether the data is discrete, continuous, or binary data, regardless of whether the data sample is heterogeneous or homogeneous, statistically analyze various data It can be used for processing, and analysis results based on a wide range of data can be obtained.
 前記のように取得された第1の事象についての2値データと第2の事象についての2値データを含むデータセットを用いて、前述と同様の方法を行うことにより、N個のサンプルについて、第1の事象についての第1の基準と第2の事象についての第1の基準とに応じてサンプルの数を集計した2×2の分割表を取得することができる。同様に、N個のサンプルについて、第1の事象についての第2の基準と第2の事象についての第2の基準とに応じてサンプルの数を集計した2×2の分割表を取得することができる。取得されたN個のサンプルについての前記2×2の分割表からのフィッシャーの正確確率Pの算出は、前述のフィッシャーの正確確率Pの算出と同様に行うことができる。同様に、取得されたN個のサンプルについての前記2×2の分割表からのフィッシャーの正確確率Pの算出も、前述のフィッシャーの正確確率Pの算出と同様に行うことができる。 By performing the same method as described above using the data set including the binary data for the first event and the binary data for the second event obtained as described above, N 1 samples are obtained. A 2 × 2 contingency table in which the number of samples is tabulated according to the first criterion for the first event and the first criterion for the second event can be obtained. Similarly, for N 2 samples, obtain a 2 × 2 contingency table that counts the number of samples according to the second criterion for the first event and the second criterion for the second event. be able to. Calculation of Fisher's exact probability P 1 from the 2 × 2 contingency table for the obtained N 1 samples can be performed in the same manner as the calculation of Fisher's exact probability P described above. Similarly, calculation of Fisher's exact probability P 2 from the 2 × 2 contingency table for the obtained N 2 samples can be performed in the same manner as the calculation of Fisher's exact probability P described above.
 本発明において用いるフィッシャーの正確確率Pは、フィッシャーの正確確率Pと、フィッシャーの正確確率Pを含む複数のフィッシャーの正確確率を、メタ解析を用いて統合する工程を含む方法により算出されたものであってもよい。ここで、複数のフィッシャーの正確確率は、フィッシャーの正確確率Pと、フィッシャーの正確確率Pを含むものであり、その数は、例えば、2であるが、それを超える数であってもよい。フィッシャーの正確確率Pと、フィッシャーの正確確率Pのほか、当該複数のフィッシャーの正確確率に含まれるものとしては、これらと同様の方法により算出したフィッシャーの正確確率Pが挙げられる。メタ解析を用いて統合するフィッシャーの正確確率の数は、特に限定はないが、例えば、2~100である。 The Fischer's exact probability P used in the present invention was calculated by a method including a step of integrating the Fischer's exact probability P 1 and the Fischer's exact probability P 2 including the Fisher's exact probability P 2 using meta-analysis. It may be a thing. Here, the Fischer's exact probability includes the Fischer's exact probability P 1 and the Fischer's exact probability P 2 , and the number thereof is, for example, 2 but even if it exceeds this number, Good. In addition to the Fischer's exact probability P 1 and the Fischer's exact probability P 2 , the Fischer's exact probability P n calculated by a method similar to these may be included in the Fischer's exact probability P 2. The number of Fisher's exact probabilities to be integrated using meta-analysis is not particularly limited, but is 2 to 100, for example.
 メタ解析を用いた統合は、各種の方法が知られており、例えば、Rosental,R.(1984).Meta-analytic procedures for social research.Beverly Hills,CA:Sageには、異なる複数の検討条件下で得られたp値を統合してPoverallを計算する方法が説明されている。メタ解析を用いた統合は、例えば、フィッシャーの正確確率検定における片側検定を対象として、以下のように行うことができる。まず、統合するそれぞれのフィッシャーの正確確率をpとして、これをZ値(z)に変換する。
Figure JPOXMLDOC01-appb-M000012
Various methods for integration using meta-analysis are known. For example, Rosental, R. et al. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage describes a method for calculating P overall by integrating p values obtained under different study conditions. Integration using meta-analysis can be performed as follows, for example, for a one-sided test in Fisher's exact test. First, let p i be the exact probability of each Fisher to be integrated, and this is converted into a Z value (z i ).
Figure JPOXMLDOC01-appb-M000012
 Z値の合計を統合する個数(k)の平方根で割ったものであるZoverallは正規分布に従う。
Figure JPOXMLDOC01-appb-M000013
Z overall, which is the sum of Z values divided by the square root of the number (k) to be integrated, follows a normal distribution.
Figure JPOXMLDOC01-appb-M000013
 このZoverallから、統合されたP値であるpoverallを求めることにより、各フィッシャーの正確確率を統合することができる。
Figure JPOXMLDOC01-appb-M000014
By calculating p overall which is an integrated P value from this Z overall , the exact probability of each fisher can be integrated.
Figure JPOXMLDOC01-appb-M000014
 従来、様々な条件下で得られたデータを統合して相互情報量を算出することは、行われていなかった。本発明においては、前述のように、メタ解析を用いて統合したフィッシャーの正確確率Pを用いることにより、例えば、様々な条件下で得られたデータを組み合わせて用いて、広範なデータをもとに相互情報量を算出することができ、これにより、様々な異なる条件下(例えば、異なる細胞系譜、多様な細胞内外の条件など)での知見を総合しつつ、その条件下におけるバイアスに影響を受けずに、より正確に事象間の相互依存性を特定することができる。 Conventionally, calculation of mutual information by integrating data obtained under various conditions has not been performed. In the present invention, as described above, by using Fisher's exact probability P integrated using meta-analysis, for example, a combination of data obtained under various conditions can be used to generate a wide range of data. Mutual information can be calculated on the basis of this, and this will affect the bias under these conditions while integrating the findings under various different conditions (eg, different cell lineages, various internal and external conditions, etc.). The interdependency between events can be specified more accurately without receiving it.
 本発明においては、共通の方法を用いて大規模なデータを解析することができるため、本発明の方法は、コンピュータによる実施が適している。本発明において、上記の方法は、この方法を実行させるためのコンピュータ用プログラムによって行ってもよい。当該コンピュータ用プログラムとしては、前述の方法の各工程を行う手段としてコンピュータを機能させるためのプログラムを挙げることができる。 In the present invention, since a large amount of data can be analyzed using a common method, the method of the present invention is suitable for implementation by a computer. In the present invention, the above method may be performed by a computer program for executing this method. Examples of the computer program include a program for causing a computer to function as means for performing each step of the above-described method.
 当該コンピュータ用プログラムとしては、例えば、コンピュータを、
(1)第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータを取得する工程を行う手段、
(2)前記の第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての2値データと第2の事象についての2値データを含むデータセットを取得する工程を行う手段、
(3)N個のサンプルのそれぞれが第1の事象についての基準と第2の事象についての基準に基づき、2×2の分割表の類型のいずれに該当するかを判定して、N個のサンプルのそれぞれを、前記各類型に分類する工程を行う手段、
(4)N個のサンプルのそれぞれを、前記各類型に分類し、これをN個の全サンプルについて繰り返し、各類型に分類されたサンプルの数を集計して、前記データセットから2×2の分割表にサンプルの数を集計する工程を行う手段、
(5)前記サンプルの数を集計した2×2の分割表に基づいて、フィッシャーの正確確率Pを算出する工程を行う手段、及び
(6)前記算出したフィッシャーの正確確率Pと、前記Nをもとに、-log10P/(Nlog102)を算出する工程を行う手段
として機能させるためのプログラムを挙げることができる。
Examples of the computer program include a computer,
(1) means for performing a step of obtaining data including information on the first event and information on the second event for N samples;
(2) From the data including the information of the first event and the information of the second event for N samples, the data including the binary data for the first event and the binary data for the second event Means for performing a step of obtaining a set;
(3) Based on the criteria for the first event and the criteria for the second event, determine whether each of the N samples corresponds to a 2 × 2 contingency table type, Means for performing a step of classifying each of the samples into the respective types;
(4) Each of the N samples is classified into each type, and this is repeated for all the N samples, and the number of samples classified into each type is totaled to obtain 2 × 2 from the data set. Means for performing a step of counting the number of samples in the contingency table;
(5) means for performing a step of calculating Fisher's exact probability P based on a 2 × 2 contingency table in which the number of samples is tabulated, and (6) the calculated Fisher's exact probability P and the N Basically, a program for causing a function to perform a step of calculating −log 10 P / (Nlog 10 2) can be given.
 当該プログラムは、これをコンピュータに読み込ませ、コンピュータのハードウェア資源と、読み込まれたソフトウェアとを協調して、機能させることによって、実行させることができる。ハードウェア資源としては、CPU等の演算手段、メモリ等の記憶手段を挙げることができる。 The program can be executed by causing the computer to read it and causing the hardware resources of the computer and the loaded software to function in a coordinated manner. Examples of hardware resources include arithmetic means such as a CPU and storage means such as a memory.
 前記コンピュータ用プログラムは、記録媒体に保存したものであってもよい。記録媒体としては、例えば、CD-ROM、DVDなどの光読取手段、半導体メモリ、フレキシブルディスク、ハードディスクなどの情報格納手段を挙げることができる。 The computer program may be stored in a recording medium. Examples of the recording medium include optical reading means such as CD-ROM and DVD, and information storage means such as semiconductor memory, flexible disk, and hard disk.
実施例1:
 米国The Cancer Genome Atlas(TCGA)(http://cancergenome.nih.gov/)から、サンプル数1019の乳房浸潤癌患者のデータ(BRCA)をダウンロードした。このデータは、約20,000個の遺伝子についての情報を含んでいた。目的遺伝子としてのCLSTN3(Calsyntenin 3)のmRNA発現につき、野生型に比して2倍を超えるか、2倍以下かを基準として、各乳房浸潤癌患者を2類型に分類した。同様に、他の残りの遺伝子のmRNA発現についても、野生型に比して2倍を超えるか、2倍以下かを基準として、各乳房浸潤癌患者を2類型に分類した。分類後のデータをもとにして、上記の基準に応じて、CLSTN3(Calsyntenin 3)と、他の残りの遺伝子のそれぞれにつき、2×2の分割表に乳房浸潤癌患者の数を集計した。集計された数をもとに、前述した相互情報量の定義の式を用いて、各遺伝子につき、CLSTN3(Calsyntenin 3)との相互情報量を算出した。また、集計された数をもとに、各遺伝子につき、フィッシャーの正確確率pを算出した。各遺伝子につき、算出したCLSTN3(Calsyntenin 3)との相互情報量と、フィッシャーの正確確率pから求めた-log(p)の値を、グラフにプロットした。
Example 1:
Data from a breast cancer invasive cancer patient with a sample number of 1019 (BRCA) was downloaded from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/). This data contained information about 20,000 genes. Each breast invasive cancer patient was classified into two types based on whether mRNA expression of CLSTN3 (calsyntenin 3) as a target gene was more than twice or less than that of the wild type. Similarly, regarding the mRNA expression of other remaining genes, each breast invasive cancer patient was classified into two types based on whether it was more than twice or less than twice that of the wild type. Based on the data after classification, according to the above criteria, the number of breast invasive cancer patients was counted in a 2 × 2 contingency table for each of CLSTN3 (calsyntenin 3) and other remaining genes. Based on the counted number, the mutual information amount with CLSTN3 (Calsyntenin 3) was calculated for each gene using the formula for defining the mutual information amount described above. In addition, based on the counted number, Fisher's exact probability p was calculated for each gene. For each gene, the calculated mutual information with CLSTN3 (Calsyntin 3) and the value of -log (p) obtained from Fisher's exact probability p were plotted on a graph.
 結果を、図1に示す。図1に示すように、サンプル数1019において、相互情報量と、-log(p)の間には、直線的な関係があった。このように、Nが大きい場合には、フィッシャーの正確確率pについての-log(p)と、相互情報量との間には、比例関係があった。
 点突然変異の有無を基準として各乳房浸潤癌患者を分類した場合においても同様の結果が得られた。
The results are shown in FIG. As shown in FIG. 1, in the number of samples 1019, there was a linear relationship between the mutual information amount and -log (p). Thus, when N is large, there is a proportional relationship between -log (p) for Fisher's exact probability p and the mutual information.
Similar results were obtained when each breast invasive cancer patient was classified based on the presence or absence of point mutations.
実施例2:
 急性骨髄性白血病、膀胱尿路上皮癌、乳房浸潤癌、結腸腺癌、多形神経膠芽腫、頭頸部扁平上皮癌、腎臓腎細胞癌、腎臓乳頭細胞癌、肺腺癌、肺扁平上皮癌、卵巣漿液性嚢胞腺癌、膵臓腺癌、前立腺癌、直腸腺癌、皮膚メラノーマ、胃腺癌、甲状腺癌、子宮内膜癌、がん細胞株(CCLE)という計19種類のサンプルについて、それぞれのサンプルについてのデータをTCGA(http://cancergenome.nih.gov/)からダウンロードした。なお、上記のCCLEは症例データではなく、1021種類の株化癌細胞を用いたデータである。それぞれのサンプルについてのデータは、サンプルとして66~1021症例を含み、約20,000個の遺伝子についての情報を含んでいた。
Example 2:
Acute myeloid leukemia, bladder urothelial cancer, breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, renal renal cell carcinoma, renal papillary cell carcinoma, lung adenocarcinoma, lung squamous cell carcinoma , Ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, prostate cancer, rectal adenocarcinoma, cutaneous melanoma, gastric adenocarcinoma, thyroid cancer, endometrial cancer, cancer cell line (CCLE) Data about the sample was downloaded from TCGA (http://cancergenome.nih.gov/). The above-mentioned CCLE is not case data but data using 1021 types of established cancer cells. The data for each sample included 66-1021 cases as samples and contained information about 20,000 genes.
 EGFR(上皮成長因子受容体、epidermal growth factor receptor)を目的遺伝子として、19種類のサンプルのそれぞれについて、残りの遺伝子のそれぞれにつき、実施例1と同様の方法により、EGFRとの間で2×2の分割表にサンプル数を集計し、これをもとに、フィッシャーの正確確率Pを算出した。 Using EGFR (epidermal growth factor receptor, epidermal growth factor receptor) as the target gene, for each of the 19 types of samples, each of the remaining genes was 2 × 2 between the EGFR and the EGFR in the same manner as in Example 1. The number of samples was counted in the contingency table, and based on this, Fisher's exact probability P was calculated.
 各サンプルについて算出したEGFRと各遺伝子のフィッシャーの正確確率Pを、メタ解析法(Rosenthal,1984)を用いて統合した。すなわち、各P値をZ値に変換し、Z値を統合してZoverall値を算出し、さらに算出したZoverall値を変換し、各遺伝子につき、Poverall値を得た。得られたPoverall値と、統合して用いた全サンプルの数Nallをもとに、前記Nをもとに、各遺伝子につき、-log10overall/(Nalllog102)を算出した。 The EGFR calculated for each sample and Fisher's exact probability P for each gene were integrated using a meta-analysis method (Rosenthal, 1984). That is, each P value into a Z value, calculates a Z overall value by integrating the Z value, converts the further calculated Z overall value, for each gene, to obtain a P overall value. Calculating a P overall values obtained, based on the number N all of all samples using integrated, on the basis of the N, for each gene, -log 10 P overall / a (N all log 10 2) did.
 算出した値が高い遺伝子から順に並べた結果を図2に示す。また、算出した値が大きい遺伝子2000個にEGFRを加えた2001個をキアゲン社のIngenuity Pathway Analysis(IPA)(登録商標)解析ソフトウェアにて解析した。IPAにおける標準経路(Canonical Pathways)の上位5個の結果を、以下の表4に示す。
Figure JPOXMLDOC01-appb-T000015
FIG. 2 shows the results of arranging the calculated values in descending order. Further, 2001 genes obtained by adding EGFR to 2000 genes having a large calculated value were analyzed with Ingenuity Pathway Analysis (IPA) (registered trademark) analysis software manufactured by Qiagen. Table 5 below shows the top five results of the standard path (Canonical Pathways) in IPA.
Figure JPOXMLDOC01-appb-T000015
 予測されたパスウェイの3番目がEGFシグナル伝達であった。このように、19種類のサンプルにつき、フィッシャーの正確確率をメタ解析によって統合した場合において、EGFRと各遺伝子との相互依存性を正確に特定することができた。 The third predicted pathway was EGF signaling. Thus, for 19 types of samples, when Fisher's exact probability was integrated by meta-analysis, the interdependence between EGFR and each gene could be specified accurately.
実施例3:
 RB1(RB Transcriptional Corepressor 1)、IFNG(interferon gamma)及びGRM1(glutamate metabotropic receptor 1)をそれぞれ目的遺伝子としたほかは、実施例2と同様の方法を行った。それぞれの目的遺伝子につき、算出した値が高い遺伝子から順に並べた結果を、図3~図5に示す。
Example 3:
The same method as in Example 2 was carried out except that RB1 (RB Transcribal Compressor 1), IFNG (interferon gamma) and GRM1 (glutamate metabotropic receptor 1) were respectively used as target genes. The results of arranging each gene of interest in order from the gene with the highest calculated value are shown in FIGS.
 また、IFNGにつき、算出した値が大きい遺伝子2000個の遺伝子リストをIPA(登録商標)解析ソフトウェアにて解析した。その結果、IPA(登録商標)におけるUpstream Regulatorの予測の最上位はIFNGであった。このように、IFNGなしにIFNGが予測できた。IPA(登録商標)における標準経路(Canonical Pathways)の上位5個の結果を、以下の表5に示す。
Figure JPOXMLDOC01-appb-T000016
In addition, for IFNG, a gene list of 2000 genes having a large calculated value was analyzed with IPA (registered trademark) analysis software. As a result, the highest prediction of Upstream Regulator in IPA (registered trademark) was IFNG. Thus, IFNG could be predicted without IFNG. Table 5 below shows the top five results of the standard path (Canonical Pathways) in IPA (registered trademark).
Figure JPOXMLDOC01-appb-T000016
 予測されたパスウェイは、知られているIFNGのそれと極めて良く一致している。これらの結果は、IPA(登録商標)の解析対象とした本発明による解析結果が高精度であることを強く示唆するとともに、本発明はがん以外の疾患領域にも有用であることを示す。 The predicted pathway is in good agreement with that of the known IFNG. These results strongly suggest that the analysis result according to the present invention, which is an analysis target of IPA (registered trademark), is highly accurate, and also indicate that the present invention is useful for disease regions other than cancer.
 同様に、GRM1との相互情報量が大きい遺伝子2000個の遺伝子リストをIPA(登録商標)解析ソフトウェアにて解析した。疾患又は機能アノテーション(Disease & Functions Annotation)において活性zスコア(Activation z-score)の絶対値が3以上のものの上位15個の結果を、以下の表6に示す。
Figure JPOXMLDOC01-appb-T000017
Similarly, a gene list of 2000 genes having a large amount of mutual information with GRM1 was analyzed with IPA (registered trademark) analysis software. Table 6 below shows the top 15 results of those with an activity z-score of 3 or more in disease or function annotations (Disease & Functions Annotation).
Figure JPOXMLDOC01-appb-T000017
 予測されたGRM1の機能は、知られているGRM1の機能と極めて一致していることがわかる。このように、多数のサンプルにつき、フィッシャーの正確確率をメタ解析によって統合した場合において、GRM1と各遺伝子との相互依存性を、極めて正確に特定することができた。 It can be seen that the predicted GRM1 function is very consistent with the known GRM1 function. Thus, when the Fisher's exact probability was integrated by meta-analysis for a large number of samples, the interdependence between GRM1 and each gene could be identified very accurately.
実施例4:
 スーパーマーケットチェーンのA店舗での1週間の売り上げについて、サンプル数約5000の購入履歴をPOSシステムからダウンロードする。このデータは、個々の購入の内容についての情報を含むものである。5000のサンプルについて、「おにぎり」のカテゴリーに属する商品を購入しているか否かを基準として、2類型に分類する。同様に、他の商品カテゴリー(商品カテゴリー数は約300)についても、購入しているか否かを基準として、2類型に分類する。実施例1と同様の方法により、「おにぎり」と各商品カテゴリーについての2×2の分割表においてサンプルを集計し、その集計結果に基づきフィッシャーの正確確率Pを算出する。これを約200の商品カテゴリーの全てについて行う。
Example 4:
For sales for one week at store A in the supermarket chain, a purchase history of about 5000 samples is downloaded from the POS system. This data includes information about the contents of individual purchases. The 5000 samples are classified into two types based on whether or not a product belonging to the “rice ball” category has been purchased. Similarly, other product categories (the number of product categories is about 300) are classified into two types based on whether or not they are purchased. By using the same method as in the first embodiment, samples are tabulated in a 2 × 2 contingency table for “rice ball” and each product category, and Fisher's exact probability P is calculated based on the tabulation result. This is done for all about 200 product categories.
 スーパーマーケットチェーンの他の店舗B~Zについても、同様に、「おにぎり」と各商品カテゴリーのフィッシャーの正確確率Pを算出し、実施例2と同様の方法により、メタ解析法を用いて統合する。統合して用いた全サンプル数のNallをもとに、各商品カテゴリーにつき、-log10overall/(Nalllog102)を算出する。 For other stores B to Z of the supermarket chain, the exact probability P of “rice ball” and Fisher of each product category is calculated in the same manner, and integrated using the meta-analysis method in the same manner as in the second embodiment. Based on N all of the total number of samples used by integration, −log 10 P overall / (N all log 10 2) is calculated for each product category.
 この算出により得られた値が高い商品カテゴリーは、「おにぎり」と同時に購入されることが多いことが分析できる。例えば、「おにぎり」を購入するスーパーの顧客は「カップ味噌汁」を同時に購入することが多いと分析された場合には、両者を隣接して陳列することで、売り上げを伸ばすことができる。 It can be analyzed that the product category with a high value obtained by this calculation is often purchased at the same time as “rice ball”. For example, if it is analyzed that a supermarket customer who purchases “rice ball” often purchases “cup miso soup” at the same time, sales can be increased by displaying both of them adjacently.
実施例5:
 東京証券取引所の第1部で株式が取引される銘柄(約2000銘柄)についての2017年の株価推移のデータをダウンロードする。2017年の取引日は約240日あり、それぞれの日をサンプルとする。次に、2017年におけるドル円相場のレート(円換算した1ドルの価格)のデータをダウンロードする。ドル円相場のレートのデータを用い、サンプル日におけるドル円相場のレートが、前日のレートよりも高くなっているか否かを基準として、2類型に分類する。次に、株価推移のデータを用い、各会社の株価について、株の取引開始時よりも取引終了時の方が高くなっているか否かを基準として、2類型に分類する。実施例1と同様の方法により、ドル円相場の変動と会社の株価の変動についての2×2の分割表においてサンプルを集計し、その集計結果に基づきフィッシャーの正確確率Pを算出する。これを約2000銘柄の株価について行う。
Example 5:
Download the stock price transition data for 2017 for stocks traded on the first section of the Tokyo Stock Exchange (about 2000 stocks). There are about 240 trading days in 2017, and each day is a sample. Next, the data of the dollar-yen exchange rate (price of 1 dollar converted into yen) in 2017 is downloaded. Using the dollar-yen exchange rate data, it is classified into two types based on whether the dollar-yen exchange rate on the sample date is higher than the previous day rate. Next, using the data of the stock price transition, the stock prices of each company are classified into two types based on whether or not the stock price at the end of trading is higher than the stock trading time. In the same manner as in the first embodiment, samples are tabulated in a 2 × 2 contingency table for fluctuations in the dollar-yen exchange rate and the stock price of the company, and Fisher's exact probability P is calculated based on the tabulation results. This is done for about 2000 stock prices.
 算出された各銘柄のPを、東証業種分類中分類にしたがって業種ごとに、実施例2と同様の方法により、メタ解析法を用いて統合する。統合して用いた全サンプル数のNallをもとに、各業種につき、-log10overall/(Nalllog102)を算出する。 The calculated P of each brand is integrated by using the meta-analysis method in the same manner as in the second embodiment for each industry according to the classification in the TSE industry classification. Based on N all of the total number of samples used in integration, −log 10 P overall / (N all log 10 2) is calculated for each industry.
 この算出により得られた値が高い業種は、ドル円相場と連動して株価が変動する傾向が高いことが予測できる。

 
It can be predicted that an industry with a high value obtained by this calculation has a high tendency for the stock price to fluctuate in conjunction with the dollar-yen exchange rate.

Claims (4)

  1. 第1の事象と第2の事象の相互依存性の特定方法であって、第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した、2×2の分割表に基づいて算出された、フィッシャーの正確確率Pと、前記Nをもとに、-log10P/Nを算出する工程を含むことを特徴とする、方法。 A method for identifying interdependencies between a first event and a second event, wherein the first event is obtained from data including information on the first event and information on the second event for N samples. Fisher's exact probability P calculated based on a 2 × 2 contingency table in which the number of samples is aggregated from the data set including the binary data for and the binary data for the second event, and the N A step of calculating −log 10 P / N based on
  2. 請求項1に記載の方法であって、前記フィッシャーの正確確率Pが、(1)第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第1の基準及び第2の事象についての第1の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した2×2の分割表に基づいて算出された、フィッシャーの正確確率Pと、(2)第1の事象の情報と第2の事象の情報をN個のサンプルについて含むデータから、第1の事象についての第2の基準及び第2の事象についての第2の基準に基づき取得された、第1の事象についての2値データと第2の事象についての2値データを含むデータセットから、サンプルの数を集計した、2×2の分割表に基づいて算出された、フィッシャーの正確確率Pとを含む、複数のフィッシャーの正確確率を、メタ解析を用いて統合する工程を含む方法により算出されたものである、方法。 2. The method of claim 1, wherein the Fisher's exact probability P is calculated from: (1) data including first event information and second event information for N 1 samples; From a data set including binary data for the first event and binary data for the second event, acquired based on the first criterion for and the first criterion for the second event, Fischer's exact probability P 1 calculated based on a 2 × 2 contingency table summarizing the number, and (2) data including information on the first event and information on the second event for N 2 samples From binary data for the first event and binary data for the second event, obtained based on the second criterion for the first event and the second criterion for the second event From the data set, sample It was aggregated, calculated on the basis of a 2 × 2 contingency table, and a exact P 2 Fisher's exact multiple Fischer was calculated by a method comprising the step of integrating using a meta-analysis Is the way.
  3. 請求項1または2に記載の方法を実行させるためのコンピュータ用プログラム。 A computer program for executing the method according to claim 1.
  4. 請求項3に記載のコンピュータ用プログラムを保存した記録媒体。

     
    A recording medium storing the computer program according to claim 3.

PCT/JP2018/013877 2017-03-31 2018-03-30 Method for identifying interdependence WO2018181988A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019509406A JP6820621B2 (en) 2017-03-31 2018-03-30 How to identify interdependencies

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-072904 2017-03-31
JP2017072904 2017-03-31

Publications (1)

Publication Number Publication Date
WO2018181988A1 true WO2018181988A1 (en) 2018-10-04

Family

ID=63678171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/013877 WO2018181988A1 (en) 2017-03-31 2018-03-30 Method for identifying interdependence

Country Status (2)

Country Link
JP (1) JP6820621B2 (en)
WO (1) WO2018181988A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009048455A (en) * 2007-08-21 2009-03-05 Nippon Hoso Kyokai <Nhk> Internode relationship estimation apparatus and computer program
JP2009069911A (en) * 2007-09-10 2009-04-02 Mizuho Information & Research Institute Inc Gene-related analysis device and gene-related analysis program
JP2009517064A (en) * 2005-11-30 2009-04-30 アンスティテュ、ナショナル、ド、ラ、サント、エ、ド、ラ、ルシェルシュ、メディカル(アンセルム) Methods for Hepatocellular Carcinoma Classification and Prognostication
JP2013123420A (en) * 2011-12-15 2013-06-24 World Fusion Co Ltd Preparation method of gene set
US20170009277A1 (en) * 2014-01-30 2017-01-12 Siemens Healthcare Gmbh Genetic resistance testing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009517064A (en) * 2005-11-30 2009-04-30 アンスティテュ、ナショナル、ド、ラ、サント、エ、ド、ラ、ルシェルシュ、メディカル(アンセルム) Methods for Hepatocellular Carcinoma Classification and Prognostication
JP2009048455A (en) * 2007-08-21 2009-03-05 Nippon Hoso Kyokai <Nhk> Internode relationship estimation apparatus and computer program
JP2009069911A (en) * 2007-09-10 2009-04-02 Mizuho Information & Research Institute Inc Gene-related analysis device and gene-related analysis program
JP2013123420A (en) * 2011-12-15 2013-06-24 World Fusion Co Ltd Preparation method of gene set
US20170009277A1 (en) * 2014-01-30 2017-01-12 Siemens Healthcare Gmbh Genetic resistance testing

Also Published As

Publication number Publication date
JP6820621B2 (en) 2021-01-27
JPWO2018181988A1 (en) 2020-04-23

Similar Documents

Publication Publication Date Title
Suwinski et al. Advancing personalized medicine through the application of whole exome sequencing and big data analytics
Schep et al. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data
Cosgun et al. High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans
Torang et al. An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets
Poirion et al. Using single nucleotide variations in single-cell RNA-seq to identify subpopulations and genotype-phenotype linkage
Taşan et al. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks
Liu et al. Probe-level measurement error improves accuracy in detecting differential gene expression
Wei et al. Nonparametric pathway-based regression models for analysis of genomic data
Kohane et al. Quantifying the white blood cell transcriptome as an accessible window to the multiorgan transcriptome
Fan et al. irGSEA: the integration of single-cell rank-based gene set enrichment analysis
Wang et al. Using multiple measurements of tissue to estimate subject-and cell-type-specific gene expression
Marderstein et al. Demographic and genetic factors influence the abundance of infiltrating immune cells in human tissues
Zemmour et al. Prediction of early breast cancer metastasis from DNA microarray data using high-dimensional cox regression models
Chen et al. A powerful Bayesian meta-analysis method to integrate multiple gene set enrichment studies
Gao et al. A Bayesian inference transcription factor activity model for the analysis of single-cell transcriptomes
Huang et al. Identification of cancer genomic markers via integrative sparse boosting
Oliynyk Age-related late-onset disease heritability patterns and implications for genome-wide association studies
Wright et al. Science for the next century: deep phenotyping
Touraine et al. More accurate cancer-related excess mortality through correcting background mortality for extra variables
Du et al. SYNJ2 variant rs9365723 is associated with colorectal cancer risk in Chinese Han population
Shi et al. Measures for the degree of overlap of gene signatures and applications to TCGA
Liu et al. Identification of ferroptosis-related molecular clusters and immune characterization in autism spectrum disorder
Zandavi et al. Disentangling single-cell omics representation with a power spectral density-based feature extraction
WO2018181988A1 (en) Method for identifying interdependence
Mohammadi et al. A convex optimization approach for identification of human tissue-specific interactomes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18774622

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019509406

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18774622

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载