WO2001096861A1 - Systeme d'identification de molecule - Google Patents
Systeme d'identification de molecule Download PDFInfo
- Publication number
- WO2001096861A1 WO2001096861A1 PCT/SE2001/001322 SE0101322W WO0196861A1 WO 2001096861 A1 WO2001096861 A1 WO 2001096861A1 SE 0101322 W SE0101322 W SE 0101322W WO 0196861 A1 WO0196861 A1 WO 0196861A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- molecules
- mass
- masses
- molecule
- stored
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 60
- 239000000470 constituent Substances 0.000 claims abstract description 37
- 238000009826 distribution Methods 0.000 claims description 18
- 238000004949 mass spectrometry Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 abstract description 5
- 108090000623 proteins and genes Proteins 0.000 description 43
- 235000018102 proteins Nutrition 0.000 description 42
- 102000004169 proteins and genes Human genes 0.000 description 42
- 108090000765 processed proteins & peptides Proteins 0.000 description 27
- 239000012634 fragment Substances 0.000 description 14
- 102000004196 processed proteins & peptides Human genes 0.000 description 13
- 150000001413 amino acids Chemical class 0.000 description 10
- 150000007523 nucleic acids Chemical class 0.000 description 8
- 230000002797 proteolythic effect Effects 0.000 description 8
- 108090000790 Enzymes Proteins 0.000 description 6
- 102000004190 Enzymes Human genes 0.000 description 6
- 229920001184 polypeptide Polymers 0.000 description 6
- 101710100170 Unknown protein Proteins 0.000 description 5
- 230000009471 action Effects 0.000 description 5
- 235000001014 amino acid Nutrition 0.000 description 5
- 229940024606 amino acid Drugs 0.000 description 5
- 150000004676 glycans Chemical class 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 229920001282 polysaccharide Polymers 0.000 description 5
- 239000005017 polysaccharide Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004088 simulation Methods 0.000 description 5
- 230000005284 excitation Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000001819 mass spectrum Methods 0.000 description 4
- 150000002772 monosaccharides Chemical class 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 229920000642 polymer Polymers 0.000 description 4
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 238000002330 electrospray ionisation mass spectrometry Methods 0.000 description 3
- 238000013467 fragmentation Methods 0.000 description 3
- 238000006062 fragmentation reaction Methods 0.000 description 3
- 150000002500 ions Chemical class 0.000 description 3
- 230000003278 mimic effect Effects 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 238000004885 tandem mass spectrometry Methods 0.000 description 3
- XKRFYHLGVUSROY-UHFFFAOYSA-N Argon Chemical compound [Ar] XKRFYHLGVUSROY-UHFFFAOYSA-N 0.000 description 2
- 108090000631 Trypsin Proteins 0.000 description 2
- 102000004142 Trypsin Human genes 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000007385 chemical modification Methods 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000003795 desorption Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000001962 electrophoresis Methods 0.000 description 2
- 230000006862 enzymatic digestion Effects 0.000 description 2
- 239000007789 gas Substances 0.000 description 2
- 230000007062 hydrolysis Effects 0.000 description 2
- 238000006460 hydrolysis reaction Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000001303 quality assessment method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000012588 trypsin Substances 0.000 description 2
- 239000004382 Amylase Substances 0.000 description 1
- 102000013142 Amylases Human genes 0.000 description 1
- 108010065511 Amylases Proteins 0.000 description 1
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 description 1
- 108010054576 Deoxyribonuclease EcoRI Proteins 0.000 description 1
- 108010051815 Glutamyl endopeptidase Proteins 0.000 description 1
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 description 1
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 1
- 102000007079 Peptide Fragments Human genes 0.000 description 1
- 108010033276 Peptide Fragments Proteins 0.000 description 1
- 102000016679 alpha-Glucosidases Human genes 0.000 description 1
- 108010028144 alpha-Glucosidases Proteins 0.000 description 1
- 102000019199 alpha-Mannosidase Human genes 0.000 description 1
- 108010012864 alpha-Mannosidase Proteins 0.000 description 1
- 235000019418 amylase Nutrition 0.000 description 1
- 229910052786 argon Inorganic materials 0.000 description 1
- 229960001230 asparagine Drugs 0.000 description 1
- 235000009582 asparagine Nutrition 0.000 description 1
- 235000003704 aspartic acid Nutrition 0.000 description 1
- 238000013398 bayesian method Methods 0.000 description 1
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 description 1
- 150000001720 carbohydrates Chemical class 0.000 description 1
- 235000014633 carbohydrates Nutrition 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 238000002144 chemical decomposition reaction Methods 0.000 description 1
- 238000004587 chromatography analysis Methods 0.000 description 1
- 238000003776 cleavage reaction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- ATDGTVJJHBUTRL-UHFFFAOYSA-N cyanogen bromide Chemical compound BrC#N ATDGTVJJHBUTRL-UHFFFAOYSA-N 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000029087 digestion Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 108010003914 endoproteinase Asp-N Proteins 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000005040 ion trap Methods 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000037353 metabolic pathway Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 108091005601 modified peptides Proteins 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 229920001542 oligosaccharide Polymers 0.000 description 1
- 150000002482 oligosaccharides Chemical class 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 229910052760 oxygen Inorganic materials 0.000 description 1
- 230000004481 post-translational protein modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 230000017854 proteolysis Effects 0.000 description 1
- 108091008146 restriction endonucleases Proteins 0.000 description 1
- 230000007017 scission Effects 0.000 description 1
- 108010059339 submandibular proteinase A Proteins 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000005199 ultracentrifugation Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01J—ELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
- H01J49/00—Particle spectrometers or separator tubes
- H01J49/26—Mass spectrometers or separator tubes
- H01J49/34—Dynamic spectrometers
- H01J49/40—Time-of-flight spectrometers
-
- C—CHEMISTRY; METALLURGY
- C40—COMBINATORIAL TECHNOLOGY
- C40B—COMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
- C40B40/00—Libraries per se, e.g. arrays, mixtures
Definitions
- the present invention relates to a method and tools for the identification of unknown molecules, and, particularly, a method and tools for molecule identification that provide a solution to the problem of random mass matching.
- Molecule identification problems can concern e.g. the tracing of unwanted substances in the environment and the studies of metabolic pathways and disease-state markers in drug development projects. Molecule identification problems can sometimes be solved by the appropriate application of instruments and methods for the acquisition and processing of data from a sample containing the molecules to be identified.
- data from a sample is mass data.
- Molecular or molecular constituent mass data can be obtained by a variety of techniques including techniques such as ultra-centrifugation, electrophoresis, and mass spectrometry.
- Experimental mass data from the sample analyzed is often compared with database-information about known or hypothetical molecules.
- MS mass spectrometry
- MS of protein-digests combined with searching in protein and DNA sequence databases is a method of choice for the identification of proteins in proteomics projects.
- the field of proteomics which include the elucidation of protein function under various cell conditions, is believed to form a future basis for drug design.
- MS-protein identification involves cleavage of proteins with an enzyme having high digestion specificity (usually trypsin), whereupon the resulting proteolytic products are subjected to mass analysis by either matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization mass spectrometry (ESI-MS).
- MALDI-MS matrix-assisted laser desorption/ionization mass spectrometry
- ESI-MS electrospray ionization mass spectrometry
- the experimentally determined masses are then compared with masses of peptides that individual proteins in a database would yield if they were cleaved by the same enzyme as was used in the experiment.
- individual proteolytic peptide ions are isolated and subjected to fragmentation and fragment mass analysis in the mass spectrometer.
- the resulting fragment masses are then compared with hypothetical proteolytic peptide fragment masses of the proteins in a database.
- the protein is identified based on an evaluation of either or both of these comparisons.
- Mass spectrometry determines a peptide mass mi to an accuracy ⁇ mi, with Amil ⁇ ii typically >30 ppm. Within the mass range mt ⁇ mi proteolytic peptide masses of several proteins in a genome database can match.
- an unmodified peptide will match randomly with several proteins in the database, in addition to the true match with the actual protein present in the sample, and a modified peptide will yield only random matches. Consequently, a database search using mass spectrometry information will not always identify a protein unambiguously. Therefore, in order to perform accurate and reliable molecule identification, instruments for obtaining mass data must be appropriately linked with the use of other technical resources for the comparison of mass data and mass information obtained from a database.
- the link can be a system that makes use of a method including means for comparison of data and database information, preferably operated via a computer.
- Identification of proteins by the above-described approach requires a scheme for determining the best match between the experimental data and a sequence in the database.
- Existing schemes for determining the best match include ranking by number of matches (W.J. Henzel et al., Proc. Natl. Acad. Sci. U S A 90, 5011, 1993), a scoring system based on the observed frequency of peptides from all proteins in a database in a given molecular weight range (the so-called "MOWSE score" (D.J.C. Pappin et al., Current Biology 6, 327,1993), and a scheme based on Bayesian probabilities (W. Zhang et al., Anal. Chem. 72, 2482, 2000).
- the object of the present invention is to overcome the shortcomings of the above-mentioned schemes, i.e., to provide a method that solves the problem of random mass matching.
- This and other objects have been met by providing a system including methods of determining the probability for a particular score due to random mass matching of a molecule, and to utilize the computed probability to rank molecules.
- the method comprises: a) determining the number of matches between a database molecule and mass data; b) computing the probability that a database molecule would yield a particular number of matches by chance; c) computing a score based on one or several probabilities computed in step (b); c) comparing the scores of molecules in a molecule database; and d) identifying the molecule or molecules that yield(s) the best score (s).
- the invention further provides a method of generating a frequency function of the number of matches for random (false) molecule identification for any experimental condition.
- the method comprises: a) defining a sub -population of the molecules contained in a database; b) computing the probability that a molecule in this sub-population would yield a particular number of matches by chance; c) computing a probability that all molecules in the sub-population would yield at most a particular number of matches by chance; d) computing the probability that at least one molecule in the sub-population would yield at least a particular number of matches by chance; and e) determining the relative frequency of each number of matches by using the probability computed in step (d) for each number of matches and generating therefrom a frequency function of the number of matches for random protein identification.
- Fig. 1 shows frequencies (i.e., number of matching proteins) of various tryptic peptide masses in a database.
- Fig. 2 shows mass distribution peaks for tryptic peptides.
- Fig. 3 shows the performance of an implementation of one embodiment of the invention in comparison with state of the art systems for protein identification.
- the graph displays results from simulations employing the invention (denoted Probity), a Bayesian method, and a method based on the number of matches.
- Fig. 4 shows score frequency functions generated by the invention in comparison with score frequency functions generated by simulation.
- Examples of large-scale molecule identification can be found in proteomics projects, where thousands of proteins from cells are to be identified, or cells are screened for molecular markers of states of disease.
- the ultimate goal of molecule identification procedures is to rely on simple, rapid and automated procedures and instrumentation.
- the technical solutions of the system that links and compares mass data with database information are of key importance to the design of instruments for automated molecule identification, since the system used will influence strongly the capability of obtaining a high relative frequency of true identification results, which is particularly critical when the quality of the data is poor.
- automated identification instrumentation demand that the quality of identification results is assessed automatically by the use of a significance test (J. Eriksson et al., Anal. Che . 72, 999, 2000).
- One object of the present invention is to provide a system that utilizes methods that allow more accurate molecule identification and more accurate and rapid significance testing of identification results.
- the method according to the invention appropriately takes into account the phenomenon of random matching, and is therefore well suited for implementation in an automated molecule identification system.
- a particular concern regarding large-scale molecular identification is the time required to obtain the identification result together with a quality assessment of this result.
- a quality assessment can be accomplished by significance test, which requires knowledge of functions describing scores for false results.
- Such frequency functions are currently obtained by simulation of random molecular identification.
- an analytical expression for the derivation of a frequency function is provided.
- the methods according to the invention are well suited for, but not limited to, applications, in which the molecules are biological molecules that can exist in cells of organisms.
- Bio molecules include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses.
- biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates.
- An experimental biological molecule is a biological molecule that is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule.
- a theoretical biological molecule is a biological molecule is a known biological molecule described in a database.
- Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids.
- a protein typically contains approximately at least ten amino acids, preferably at least 50 amino acids and more preferably at least 100 amino acids.
- Nucleic acids are polymers of nucleotides. Constituent parts of nucleic acids comprise nucleotides. Typically, a nucleic acid contains at least 100 nucleotides, preferably at least 500 nucleotides.
- Polysaccharides are polymers of monosaccharides. Constituent parts of polysaccharides comprise one or more monosaccharides. Typically, a polysaccharide contains at least five monosaccharides, preferably at least ten monosaccharides.
- Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule.
- Mass data include individual mass spectra and groups of mass spectra.
- the mass spectra can be in the form of peptide maps, oglionucleotide maps or oligosaccharide maps.
- the method of the present invention includes generating experimental mass data for the experimental molecule within a certain mass range. Mass data include the measured masses. The method also includes generating theoretical mass data in the same mass range. In one embodiment, the experimental mass data is a subset of the experimental mass data.
- mass data for molecules can be generated in any manner that provides mass data within certain accuracy. Examples include matrix-assisted laser desorption/ionization mass spectrometry, electrospray ionization mass spectrometry, chromatography and electrophoresis. Mass data can also be generated by a general -purpose computer configured by software or otherwise. For the purposes of the present invention the mass data, for example a peptide mass, mi, is determined to an accuracy ⁇ mi, with ⁇ mi/mi preferably ⁇ 10,000 ppm, more preferably ⁇ 100 ppm, and most preferably ⁇ 30 ppm.
- a step in generating mass data of a molecule may include first cleaving the molecule into constituent parts.
- Biological molecules may be cleaved by methods known in the art.
- the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses.
- Methods of cleaving include chemical degradation of the biological molecules.
- Biological molecules may be degraded by contacting the biological molecule with any chemical substance.
- proteins may be predictably degraded into peptides by means of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, V8 protease, endoproteinase Arg-C, etc.
- Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucleases, such as Eco RI, Sma I, BamH I, Hinc II, etc.
- Polysaccharides may be degraded into constituent parts by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc.
- a mass range (m m in, m ma ⁇ ) is determined for the experimental mass data.
- the mass range can be any mass range of the mass data.
- the mass range is the minimum and maximum measured masses of the experimental mass data for a molecule.
- a molecule database is any compilation of information about characteristics of molecules.
- a molecule database can be a biological molecule database.
- Databases are the preferred method for storing both polypeptide amino acid sequences and the nucleic acid sequences that code for these polypeptides.
- the databases come in a variety of different types that have advantages and disadvantages when viewed as the hypothesis for a polypeptide identification experiment.
- database entry for an amino acid sequence may appear to be a simple text file for a user browsing for a particular polypeptide
- database many databases are organized into very flexible, complicated structures.
- the detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file” database), a collection of tables (a “relational” database), or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented” database). Protein mass data may be predicted from nucleic acid sequence databases.
- protein mass data may be obtained directly from protein sequence databases that contain a collection of amino acid sequences represented by a string of single-letter or three-letter codes for the residues in a polypeptide, starting at the N-terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (such as "B” indicating that the residue may be "D" (aspartic acid) or "N” (asparagine)).
- the sequences typically have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence.
- Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence.
- a database that contains these elements is referred as "annotated”.
- Annotated databases are used if some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence.
- Non- annotated databases only contain the sequence, an accession number, and a descriptive title.
- the background information known about an experimental molecule by which the data base search can be constrained can include any information.
- Some examples of background information include information about the species of an experimental biological molecule, knowledge or an assumption about the mass of the experimental biological molecule and the isoelectric point of the experimental biological molecule.
- the observed molecular mass or the observed isoelectric point of a protein can be used in combination with the measured masses of peptides generated by proteolysis to constrain the search for a polypeptide.
- the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen mass range.
- the chosen mass range is preferably within 50% of the mass of the unknown protein, more preferably within 35%, most preferably within 25%.
- the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen isoelectric point range.
- the isoelectric point (pi) of a protein is the pH at which its net charge is zero.
- the chosen isoelectric point range is preferably within 50% of the isoelectric point of the unknown protein, more preferably within 35%, most preferably within 25%.
- fragment mass data for a peptide can be generated in any manner which provides fragment mass data within a certain accuracy.
- Experimental conditions include the type of energy used to generate the fragment mass data.
- Nibrational excitation energy can be used.
- the vibrational excitation may be generated by collisions of the peptide with electrons, photons, gas molecules or a surface.
- Electronic excitation can be used.
- the electronic excitation may be generated by collisions of the peptide with electrons, photons, gas molecules (e.g. argon) or a surface.
- the experimental fragment mass spectrum of a peptide from an enzymatically digested unknown protein is compared with the theoretical masses calculated by applying the rules for the specificity of the enzyme, and the rules for the fragmentation as known to those of ordinary skill in the art, to the amino acid sequence of a database protein.
- Fragment mass data for the purposes of this invention can be generated by using multidimensional mass spectrometry (MS/MS), also known as tandem mass spectrometry.
- MS/MS multidimensional mass spectrometry
- a number of types of mass spectrometers can be used including a triple-quadruple mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer.
- a single peptide from a protein digest is subjected to MS/MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences.
- the invention provides a method to determine the probabilities for the scores that a particular molecule in a database can yield by chance when compared with mass data.
- the method can operate under a variety of experimental and database search constraints.
- the score can be the number of matches between masses derived from known or hypothetical molecules or molecular constituents in a database and masses in mass data from one or several known or unknown molecules, or molecular constituents.
- the score can also result from a computation that utilizes the number of matches.
- the invention provides a method to extract information about the molecules in a database.
- information that can be extracted from a database are total molecular mass, charge, isoelectric point, hydrophobicity and known or hypothetical chemical modification, and mass, charge, isoelectric point, hydrophobicity and known or hypothetical chemical modification of molecular constituents.
- the invention provides a method to perform actions on molecules in the database that are supposed to mimic actions occurring in an experiment.
- actions are degradation of molecules into molecular constituents by hydrolysis, where hydrolysis can result from the activity of chemicals or enzymes.
- the method can also perform actions that mimic experimental actions on molecular constituents. For example, the fragmentation of an excited molecular constituent into smaller pieces.
- the invention provides a method to derive a number of molecular pieces, k u , resulting from an action assumed to mimic an experimental situation.
- the pieces can be molecular constituents, such as proteolytic peptides resulting from enzymatic digestion of a protein, where different assumptions can be made concerning the degree of completeness of the enzymatic digestion.
- the pieces can be molecular constituents in the form of fragments of molecular constituents, e.g. fragments of proteolytic peptides.
- the invention provides a method to organize the masses of molecules or molecular constituents or fragments thereof. Examples of such organization are given in Fig. 1 and 2, where Fig. 1 displays the number of proteins in a database that match a given proteolytic peptide mass and Fig 2 displays the clustered distribution of proteolytic peptide masses. Masses clustering in this or similar fashions will be referred to as a mass distribution peak. Mass distribution peaks can be found for all molecules that contain a limited number of different atoms (e.g. C, H, N, O, S).
- the invention provides a method for defining mass regions wherein the frequency of various masses can be determined. The method defines fi as the fraction of masses of molecular constituents or fragments that falls into a mass region i.
- the invention provides a method that determines a probability pt that a particular molecule in a database will be found in a randomly chosen mass distribution peak in the mass region i:
- P t F(k u ,m i ,c) , where P is a function, mt is a mass region, and c denotes experimental and database search constraints.
- pi is given by: which describes the probability that a molecular constituent from a particular molecule characterized by fu. will be found in a single randomly chosen mass distribution peak.
- the denominator of the expression above describing pi represents the number of mass distributions peaks within the mass region i.
- ⁇ (mi, Am) can be interpreted as a statistical measure of the number of molecular constituent masses that can be found within ⁇ Am from a randomly chosen molecular constituent mass.
- the mass accuracy Am can be different for different mass regions, i.e., in that case denoted by Ami.
- the invention provides a method to determine ⁇ (mi, Am) by simulation of the relative frequency of masses around a randomly chosen mass in a mass distribution.
- ⁇ (mi, Am) is determined by integration of a function describing molecular constituent mass distributions and normalization to the total number of molecular constituent masses in a mass distribution peak.
- ⁇ (mi, Am) is determined by direct counting followed by normalization.
- a finite number of mass regions between m m in and m m ax is employed, each having an individually defined pi'.
- the probabilities pi' are employed to compute a total probability, p(k), for an individual molecule in the database to match randomly k out of n masses, where the n masses refers to the number of masses in the mass data.
- p(k) G(p i ',k,n,c') , where G is a function and c' denotes experimental and database search constraints.
- a score related to random matching is employed in the process of ranking molecules in a database.
- the probability p(k) is employed in the process of ranking molecules in a database.
- a whole database or a fraction of a database is processed and organized to allow the computation of p(k) for molecules in the database, k denotes the number of matches between the masses of molecular constituents of each database molecule investigated and masses in the mass data.
- the molecules in the database can be known or hypothetical.
- the molecule or molecules producing the mass data can be known or unknown.
- the ranking of the molecules in a database is based on the score S(p(k)), where ⁇ S is a function.
- the molecule in the database that yields the lowest S(p(k)) for k matches with the mass data is given the highest rank.
- the molecule in the database yielding the second lowest S(p(k)) for k matches is given the second highest rank and so on.
- the identification of a molecule or molecules is among the molecules having the highest ranks.
- the highest ranks can be the top ranked molecule only, but it can also be more molecules than the top ranked, e.g. the top two, top three, top four, top five, top ten, or top 100.
- the number of ranked molecules that are considered as identification results can also be determined by the use of a significance test.
- the invention provides a method of generating a frequency distribution of scores for a particular experimental condition, wherein the scores relate to random identifications of proteins.
- a frequency distribution is any compilation of the observed values of the variable being studied and how many times each value is observed.
- Frequency distributions can be in the form of a table of listings, a bar graph, a histogram, a frequency polygon, or a continuous curve.
- Functions derived from frequency distributions can be continuous (probability density function) or discrete (probability mass functions). Cumulative distribution functions of each type of function can also be derived.
- the frequency function is generated for a sub-population with H members from a database.
- the sub -population is selected based upon values of k u .
- the frequency function is generated for molecules ranked upon their number of matches.
Landscapes
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2001264517A AU2001264517A1 (en) | 2000-06-14 | 2001-06-12 | System for molecule identification |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SE0002214A SE517259C2 (sv) | 2000-06-14 | 2000-06-14 | System för molekylidentifiering |
SE0002214-5 | 2000-06-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001096861A1 true WO2001096861A1 (fr) | 2001-12-20 |
WO2001096861A8 WO2001096861A8 (fr) | 2002-08-01 |
Family
ID=20280077
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SE2001/001322 WO2001096861A1 (fr) | 2000-06-14 | 2001-06-12 | Systeme d'identification de molecule |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU2001264517A1 (fr) |
SE (1) | SE517259C2 (fr) |
WO (1) | WO2001096861A1 (fr) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005031343A1 (fr) * | 2003-10-01 | 2005-04-07 | Proteome Systems Intellectual Property Pty Ltd | Procede permettant de determiner la vraisemblance biologique de compositions ou structures candidates |
US7349809B2 (en) | 2000-02-02 | 2008-03-25 | Yol Bolsum Canada Inc. | Method of non-targeted complex sample analysis |
US8478762B2 (en) | 2009-05-01 | 2013-07-02 | Microsoft Corporation | Ranking system |
EP1376651B1 (fr) * | 2002-06-25 | 2014-06-11 | Hitachi, Ltd. | Méthode et dispositif d' analyse des données en spectrométrie de masse |
US10697969B2 (en) | 2005-09-12 | 2020-06-30 | Med-Life Discoveries Lp | Methods for diagnosing a colorectal cancer (CRC) health state or change in CRC health state, or for diagnosing risk of developing CRC or the presence of CRC in a subject |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5538897A (en) * | 1994-03-14 | 1996-07-23 | University Of Washington | Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases |
JP2000048765A (ja) * | 1998-07-24 | 2000-02-18 | Jeol Ltd | 飛行時間型質量分析計 |
EP1047107A2 (fr) * | 1999-04-06 | 2000-10-25 | Micromass Limited | Méthode pour l' identification de péptides et de protéines par spectrométrie de masse |
WO2000073787A1 (fr) * | 1999-05-27 | 2000-12-07 | Rockefeller University | Systeme expert pour l'identification de proteines utilisant l'information en spectrometrie de masse combinee a la recherche de base de donnees |
-
2000
- 2000-06-14 SE SE0002214A patent/SE517259C2/sv not_active IP Right Cessation
-
2001
- 2001-06-12 AU AU2001264517A patent/AU2001264517A1/en not_active Abandoned
- 2001-06-12 WO PCT/SE2001/001322 patent/WO2001096861A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5538897A (en) * | 1994-03-14 | 1996-07-23 | University Of Washington | Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases |
JP2000048765A (ja) * | 1998-07-24 | 2000-02-18 | Jeol Ltd | 飛行時間型質量分析計 |
EP1047107A2 (fr) * | 1999-04-06 | 2000-10-25 | Micromass Limited | Méthode pour l' identification de péptides et de protéines par spectrométrie de masse |
WO2000073787A1 (fr) * | 1999-05-27 | 2000-12-07 | Rockefeller University | Systeme expert pour l'identification de proteines utilisant l'information en spectrometrie de masse combinee a la recherche de base de donnees |
Non-Patent Citations (1)
Title |
---|
PATENT ABSTRACTS OF JAPAN * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7349809B2 (en) | 2000-02-02 | 2008-03-25 | Yol Bolsum Canada Inc. | Method of non-targeted complex sample analysis |
US7865312B2 (en) | 2000-02-02 | 2011-01-04 | Phenomenome Discoveries Inc. | Method of non-targeted complex sample analysis |
EP1376651B1 (fr) * | 2002-06-25 | 2014-06-11 | Hitachi, Ltd. | Méthode et dispositif d' analyse des données en spectrométrie de masse |
WO2005031343A1 (fr) * | 2003-10-01 | 2005-04-07 | Proteome Systems Intellectual Property Pty Ltd | Procede permettant de determiner la vraisemblance biologique de compositions ou structures candidates |
US10697969B2 (en) | 2005-09-12 | 2020-06-30 | Med-Life Discoveries Lp | Methods for diagnosing a colorectal cancer (CRC) health state or change in CRC health state, or for diagnosing risk of developing CRC or the presence of CRC in a subject |
US8478762B2 (en) | 2009-05-01 | 2013-07-02 | Microsoft Corporation | Ranking system |
Also Published As
Publication number | Publication date |
---|---|
AU2001264517A1 (en) | 2001-12-24 |
SE0002214D0 (sv) | 2000-06-14 |
WO2001096861A8 (fr) | 2002-08-01 |
SE517259C2 (sv) | 2002-05-14 |
SE0002214L (sv) | 2001-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6393367B1 (en) | Method for evaluating the quality of comparisons between experimental and theoretical mass data | |
Henzel et al. | Protein identification: the origins of peptide mass fingerprinting | |
US8639447B2 (en) | Method for identifying peptides using tandem mass spectra by dynamically determining the number of peptide reconstructions required | |
Blueggel et al. | Bioinformatics in proteomics | |
US6446010B1 (en) | Method for assessing significance of protein identification | |
Liska et al. | Combining mass spectrometry with database interrogation strategies in proteomics | |
Lu et al. | A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications | |
US20020046002A1 (en) | Method to evaluate the quality of database search results and the performance of database search algorithms | |
WO2001096861A1 (fr) | Systeme d'identification de molecule | |
WO2004083233A2 (fr) | Identification de peptides | |
EP1820133B1 (fr) | Methode et systeme d'identification de polypeptides | |
CA2477151A1 (fr) | Procede d'identification de proteines au moyen de donnees de spectrometrie de masse | |
US20040044481A1 (en) | Method for protein identification using mass spectrometry data | |
US20020152033A1 (en) | Method for evaluating the quality of database search results by means of expectation value | |
Hubbard | Computational approaches to peptide identification via tandem MS | |
Fridman et al. | The probability distribution for a random match between an experimental-theoretical spectral pair in tandem mass spectrometry | |
WO2002101355A9 (fr) | Analyse proteomique amelioree | |
Liu et al. | PRIMA: peptide robust identification from MS/MS spectra | |
US7603240B2 (en) | Peptide identification | |
Fang et al. | Feature selection in validating mass spectrometry database search results | |
WO2003087805A2 (fr) | Procede permettant de calculer de maniere efficace la masse de peptides modifies en vue de l'identification par recherche de base de donnees et spectrometrie de masse | |
WO2004070643A2 (fr) | Procede de prediction d'une fonction proteine | |
Phanse et al. | Proteomics and Protein Identification by Mass Spectrometry | |
Yan et al. | Separation of ion types in tandem mass spectrometry data interpretation-a graph-theoretic approach | |
Wu et al. | Peptide identification via tandem mass spectrometry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
AK | Designated states |
Kind code of ref document: C1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
WR | Later publication of a revised version of an international search report | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |