WO2002008469A2 - Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques - Google Patents
Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques Download PDFInfo
- Publication number
- WO2002008469A2 WO2002008469A2 PCT/US2001/023629 US0123629W WO0208469A2 WO 2002008469 A2 WO2002008469 A2 WO 2002008469A2 US 0123629 W US0123629 W US 0123629W WO 0208469 A2 WO0208469 A2 WO 0208469A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- algorithm
- allele
- ofthe
- quality value
- data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 161
- 238000004519 manufacturing process Methods 0.000 title claims abstract description 4
- 108700028369 Alleles Proteins 0.000 claims abstract description 464
- 230000008569 process Effects 0.000 claims description 59
- 150000007523 nucleic acids Chemical class 0.000 claims description 47
- 108020004707 nucleic acids Proteins 0.000 claims description 44
- 102000039446 nucleic acids Human genes 0.000 claims description 44
- 238000001514 detection method Methods 0.000 claims description 33
- 239000002773 nucleotide Substances 0.000 claims description 22
- 125000003729 nucleotide group Chemical group 0.000 claims description 22
- 238000006243 chemical reaction Methods 0.000 claims description 20
- 238000007781 pre-processing Methods 0.000 claims description 10
- 150000001413 amino acids Chemical class 0.000 claims description 9
- 239000012634 fragment Substances 0.000 description 47
- 208000003028 Stuttering Diseases 0.000 description 36
- 239000003550 marker Substances 0.000 description 29
- 238000012545 processing Methods 0.000 description 19
- 238000013459 approach Methods 0.000 description 15
- 230000037230 mobility Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 10
- 108020004414 DNA Proteins 0.000 description 9
- 230000003321 amplification Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 8
- 238000009499 grossing Methods 0.000 description 8
- 238000003199 nucleic acid amplification method Methods 0.000 description 8
- 239000000243 solution Substances 0.000 description 8
- 238000003860 storage Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 238000012935 Averaging Methods 0.000 description 7
- 101001024425 Mus musculus Ig gamma-2A chain C region secreted form Proteins 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 108091092878 Microsatellite Proteins 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 239000000975 dye Substances 0.000 description 6
- 230000002068 genetic effect Effects 0.000 description 6
- 108090000623 proteins and genes Proteins 0.000 description 6
- 238000005070 sampling Methods 0.000 description 6
- 238000003491 array Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 102000004169 proteins and genes Human genes 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 238000012408 PCR amplification Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 238000001502 gel electrophoresis Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 102000054765 polymorphisms of proteins Human genes 0.000 description 3
- 238000011002 quantification Methods 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- 241000252505 Characidae Species 0.000 description 2
- 101100480705 Cupriavidus necator (strain ATCC 17699 / DSM 428 / KCTC 22496 / NCIMB 10442 / H16 / Stanier 337) tauE gene Proteins 0.000 description 2
- 241000984642 Cura Species 0.000 description 2
- 102000053602 DNA Human genes 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 2
- 206010071602 Genetic polymorphism Diseases 0.000 description 2
- 239000007983 Tris buffer Substances 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000000546 chi-square test Methods 0.000 description 2
- 210000000349 chromosome Anatomy 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 239000012141 concentrate Substances 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000001962 electrophoresis Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000001048 orange dye Substances 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000008707 rearrangement Effects 0.000 description 2
- 239000001044 red dye Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000004513 sizing Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- LENZDBCJOHFCAS-UHFFFAOYSA-N tris Chemical compound OCC(N)(CO)CO LENZDBCJOHFCAS-UHFFFAOYSA-N 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 241000353097 Molva molva Species 0.000 description 1
- 101100400779 Mus musculus Mdfi gene Proteins 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 238000000211 autoradiogram Methods 0.000 description 1
- 238000013476 bayesian approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000001045 blue dye Substances 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 230000003467 diminishing effect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003205 genotyping method Methods 0.000 description 1
- 239000001046 green dye Substances 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 238000000329 molecular dynamics simulation Methods 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000000734 protein sequencing Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011179 visual inspection Methods 0.000 description 1
- 239000001043 yellow dye Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- This invention relates to data methods and systems for assigning values to nucleic information.
- the methods and systems are used for assigning values to nucleic information.
- nucleic acid information There are many techniques for analyzing nucleic acid information. For example, certain
- a polymorphism involves difference in a
- polymorphisms may occur in regions in which nucleic acids do not encode proteins.
- a given dinucleotide such as GC or CA
- trinucleotides or
- the larger repeat regions (larger number of nucleotide bases within a
- minisatellites have often been referred to as "minisatellites.”
- the smaller repeat regions (1, 2, 3, 4, 5, or 6 nucleotides within a repeated motif) have often been referred to as “microsatellites” or “short tandem repeats (STR's).”
- STR's short tandem repeats
- Such repeat regions can serve as genetic markers since individuals can vary in the number of repeats at a given locus (location) or at many loci (locations). Each different form at a given locus is known as an allele. These differences at a given position can serve as genetic markers that are useful for many purposes including positively identifying an individual from genetic material based on the unique genetic pattern of such an individual.
- methods of determining the number of dinucleotide repeats at a given locus include use of PCR to amplify the regions in question.
- the polymerase in the PCR reaction may slip and miss one or more ofthe repeat units that are present in the studied nucleic acid region.
- an extra A nucleotide may be included in the studied nucleic acid region.
- PCR stutter and/or plus A distortion may be added during amplification.
- amplification products typically will include not only the correct amplified allele, but also shorter
- the data may show multiple
- Certain embodiments ofthe invention provide a computer-implemented method for making allele calls.
- the method comprises:
- obtaining an allele call report comprising:
- unique calling algorithms are also provided.
- Figure 1 depicts an overview block diagram for use with methods and systems consistent with certain embodiments ofthe present invention when providing allele calls.
- Figure 2 depicts a flow chart ofthe steps performed by a data processing system in
- Figures 3 A-3D depict exemplary allele calling algorithms for use with methods and
- Figure 4 depicts a flow chart ofthe steps performed by the committee machine of Figure
- Figure 5 depicts a block diagram of a system for practicing methods and systems consistent with certain embodiments ofthe present invention.
- Figure 6 depicts data that may be generated and then interpreted using certain
- Figure 7 depicts data discussed in Example II (Envelope Caller).
- Figure 8 depicts data discussed in Example III (Optimizer Caller).
- Figure 9 depicts methods for searching for an allele that is discussed in Example III
- Figures 10 through 12 depict data that can be evaluated with the heuristic algorithm
- Figure 13 illustrates a typical standard heterozygous allele signature. (Circles denote user
- x-axis is in base pairs
- y-axis is in A/D counts (voltage intensity)
- FIG. 14 illustrates steps in the allele calling routine according to the embodiments
- Figure 15 depicts data discussed in Example V (Committee Machine Processing). It illustrates hypothesis formation in the optimizer routine. The two columns represent the optimal solution (left column) and a suboptimal solution (right column).
- Panel (a) shows the target vector with the two red lines showing the location ofthe candidate peaks.
- Panel (c) shows the hypothesis formed using different values of stutter and + A.
- the x-axis is somewhat meaningless at this point since it gets mapped back to base-pair indices after the winning hypothesis is chosen.
- Figure 16 depicts data discussed in Example V (Committee Machine Processing), and shows division of heterozygous signal into panels by the Envelope Caller algorithm.
- the panels are ranked according to signal energy and the three of interest are labeled pi, p2 and p3 with the two panels containing strong allele signatures being shaded in blue.
- Circles denote user annotated allele calls, (x-axis is in base pairs, y-axis is in A/D counts (voltage intensity))
- Figure 17 illustrates an example of how reporting could be accomplished as discussed in Example V (Committee Machine Processing). These are examples where consensus was not reached and show data that is difficult to interpret.
- Figure 18 depicts an overview block diagram ofthe system according to certain embodiments.
- Figure 19 depicts exemplary data ofthe effects of localvectorMin on baselining when the signal contains no "structure” ("structure” is "useful information” such as peaks.
- Figure 20 depicts exemplary data according to certain embodiments where the positive structure is eliminated.
- Figure 21 depicts an exemplary bottom baseline after eliminating the negative spikes.
- Figure 22 depicts exemplary data according to certain embodiments where baselining is generated by averaging the top and bottom. .
- Figure 23 depicts the baselined signal according to certain embodiments.
- Figure 24 depicts exemplary data according to certain embodiments.
- Figure 25 depicts exemplary data showing detail ofthe peak location according to certain embodiments.
- Figure 26 depicts exemplary data when the peak is symmetric.
- Figure 27 depicts exemplary data when the peak is not symmetric.
- Figure 28 depicts exemplary data when the peak is not symmetric.
- Figure 29 magnifies the region marked in red in Figure 28.
- Figure 30 depicts exemplary data by calculating the first derivative by fitting polynomials according to certain embodiments.
- Figure 31 depicts exemplary data using k to smooth the derivative according to certain embodiments.
- Figure 32 depicts peaks in certain exemplary data.
- Figure 33 depicts peaks in certain exemplary data.
- Figure 34 illustrates how, according to certain embodiments, to avoid certain artifacts.
- Figure 35 illustrates exemplary data showing a peak with shoulders.
- Figure 36 illustrates exemplary data which shows how, in certain embodiments, one may
- Figure 37 illustrates exemplary data which shows how, in certain embodiments, one may
- Figure 38 illustrates the final result ofthe peak detector's shoulder detections according
- Figure 39 depicts exemplary data of peaks, sizes, and a matching.
- Figure 40 illustrates a mesh of execution times according to certain embodiments.
- Figure 41 illustrates how each curve may hold constant the number of extra peaks
- Figure 42 illustrates how each curve may hold constant the number of sizes in the size-
- Figure 43 depicts linear interpolation according to certain embodiments.
- Figure 44 illustrates linear interpolation according to certain embodiments.
- Figure 45 illustrates exemplary data of a size calling algorithm according to certain
- Figure 46 depicts a flow chart ofthe system according to certain embodiments.
- the system may contain one or more ofthe algorithms depicted in Figure 46, which result in an allele call report.
- Allele - An allele is one of two or more alternate forms at the same locus.
- a diploid organism may be homozygous (having the same allele on each ofthe two
- Non-diploid organisms may have more than two alleles. Allele Calling - When fragment analysis is performed, the region of nucleic acid containing the marker is flanked by known primer sites which permit localization ofthe allele. For example, changes in the allele may result in different fragment lengths. Thus, for these alleles, determination ofthe length ofthe nucleic acid sequence between primers is referred to as allele calling. For example, if two alleles are present, there will be two pieces of nucleic acid with different lengths.
- Locus - A unique chromosomal location defining the position of an individual nucleic acid sequence.
- Allele Signature - During PCR amplification, PCR stutter often occurs, which results in additional peaks that emerge in a predictable pattern. Another artifact that may appear is plus A distortion. The combination ofthe original signal, the stutter, and other artifacts is referred to as the allele signature.
- Marker - Markers can be thought of as landmarks in the genome and can appear in noncoding regions of nucleic acid. Their use in linkage mapping stems from their polymorphism. Many different types of markers exist.
- Algorithm An algorithm is a process of one or more steps for accomplishing a result.
- the word "component” is used interchangeably in this application with the word “algorithm.”
- the system includes one or more ofthe algorithms or components depicted in the flowchart shown in Figure 46.
- the following sections discuss each ofthe algorithms set forth in that flowchart.
- the system in certain embodiments will include all ofthe algorithms in Figure 46. In certain embodiments, the system will not include all of those algorithms.
- the system may obtain information that has already been subjected to one or more prior algorithms set forth in Figure 46 and then proceeds with one or more ofthe subsequent algorithms set forth in Figure 46. For example, the system may start with information that has already been subjected to an offscale and multicomponenting process or similar processes, and then proceeds with one or more ofthe subsequent algorithms shown in the flowchart.
- the system may provide information obtained from algorithms to another system that then uses that information to obtain a result.
- the system allows automated scoring or sizing of DNA fragments.
- these fragments are mostly Microsatelhtes but other markers can be used (e.g. amelogenlin, snp markers).
- the scores from these markers can be used in a variety ofapplications.
- Two exemplary, but not limiting, applications for certain embodiments of the system are Linkage Mapping and Databasing for Human Identification (HID).
- the allele calls from a number of samples of related individuals are used to define a region of DNA in which a gene of interest lies.
- the allele calls for a set of markers form a profile for an individual. This can be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a crime.
- the profile of an individual may also be stored in a database and compared to the profiles obtained from crime scenes to match a suspect to a
- the system includes an offscale detection algorithm. If the data (e.g., a fluorescent signal) in any filter for a certain scan number is larger than a set maximum, an
- offscale detection algorithm will treat that position (scan number) as offscaled.
- offscale detection is performed in the
- the system need not
- the system includes a multicomponenting component for sample
- a multicomponenting algorithm is a process that converts optically filtered data to day
- the raw data may include fluorescence of different colored dyes that overlap.
- the multicomponenting purifies such signals such that a signal from a different dye
- the raw data signal F is a list of ⁇ tuples that give
- multicomponenting is performed in the data collection process. In such embodiments and in certain other embodiments, the system need not perform
- the system includes a baselining algorithm, which subtracts out certain baseline shifts from the signal.
- baseline shift may be caused by
- baseline shift may occur with different
- the baselining algorithm employs three parameters: window size,
- the system fixes the smooth size to -1 (no smoothing) and spike size to 21. In certain embodiments, the system uses different window sizes
- the baselining algorithm finds a bottom baseline that rides under
- the baselining algorithm works by finding minima and maxima in a signal.
- the baselining component defines localVectorMax to be the
- loca ⁇ VectorMax(s ⁇ gn ⁇ /, ,£) max ⁇ signal(i): x - k 2 ⁇ i ⁇ x + k, ⁇ .
- the parameter k is called the "Baseline Window Size”. Similarly, the baselining component
- loca.YVectorMin(signal,x,k) min ⁇ sign ⁇ l(i): x - k, ⁇ i ⁇ x + k 2 ⁇ .
- these operators are overloaded to provide vectors of minima and
- the signal contains no structure, but has a constantly sloping baseline.
- the baselining algorithm should leave the signal largely untouched. But consider the effect of localVectorMin in the figure. It took too much from the signal.
- the positive structure can be eliminated by executing
- such structure should not extend over any significant distance
- top localVectorMin(localVectorMax(5fgn /, ⁇ ), ⁇ );
- top localVectorMax(localVectorMin(top, k), k);
- Figure 22 shows the top baseline in green, the bottom baseline in blue, and the average baseline in black. It is a simple matter for the system to remove the baseline by subtracting it from the signal, as shown in Figure 23.
- the baselining window size is user-settable. In certain embodiments, one skilled in the art will be able to get an appropriate window size. In certain embodiments, windows that are too small will track peaks too closely, so that the baselined peaks will appear short. In certain embodiments, windows that are too large will not track baseline variations, such as the primer peak tail, closely enough, so that the baselined peaks will appear high and under-resolved.
- the system uses a peak detection algorithm. Such an algorithm helps predict where in the generated data there are actual peaks.
- such an algorithm employs four parameters: degree, window width, tauB (smallest slope at which peaks start) and tauE (smallest slope at which peaks end).
- the system uses 3 for degree, 99 for window width, 0.0 for tauB and 0.0 for tauE.
- the system uses degree 2.
- the algorithm also takes two additional parameters: min peak
- the system has a height and min peak width (full width at half maximum). In certain embodiments, the system
- the system fixes the min
- the system For the min peak height, in certain embodiments, the system
- the system uses a baselining algorithm to figure out the noise level and the min peak height is picked as 10 times that noise level. In certain embodiments, one may use the
- the user specifies the min peak heights for blue/green/yellow/red/orange dyes.
- the Size-calling peak detector is called the Savitzky-Golay
- a peak is a local maximum in a signal.
- the peak detector detects a peak when it sees a signal.
- the highest point may be used as the peak position.
- the Savitzky-Golay detector estimates the beginning and the end of a peak by thresholding the rising edges ofthe first derivative using two user-specified parameters, a non- negative TB and a non-positive r
- ⁇ B is called the "Slope Threshold for Peak Start”
- r £ is called the 'Slope Threshold for Peak End”.
- the detector finds the start of a peak by searching to the left ofthe peak position. The peak starts where the first derivative crosses ⁇ g from negative to positive.
- the detector finds the end of a peak by searching to the right ofthe peak position.
- the peak ends where the first derivative crosses r £ , also from negative to positive. If the peak is symmetric (a Gaussian, for example), typically
- ⁇ B ⁇ ⁇ r E ⁇ , as illustrated in Figure 26.
- the red curve is a quadratic fit to the 5 points
- the green curve is a cubic fit.
- the Savitzky-Golay technique may compute this derivative without having to fit a curve to every window (W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, General linear least squares, In Numerical Recipes in C, chapter 14.3, pages 528-539, Cambridge University Press, 1988.).
- the parameter d called "Polynomial Degree” in certain embodiments, determines the degree ofthe polynomial to use.
- the window size k is a control parameter for the detector. In certain embodiments, one sets k to 1.5 times the expected (not minimum) full peak width at half-max (FWHM). The effect of k may be evident in the presence of noise.
- the Savitzky- Golay technique is a kind of smoothing, with larger values for A; resulting in smoother curves. In certain embodiments, the Savitzky-Golay technique will not force peaks down (by lowering the maximum) and out (by raising the edges), in contrast with smoothing by averaging.
- sharp corners can cause artifacts in the algorithm.
- Figure 34 shows the effect of k - 11.
- the Savitzky-Golay detector will detect multiple peaks only if clear valleys separate them. For example, in such embodiments, in Figure 35, the detector will detect only one peak.
- This peak does, however, has shoulders.
- the algorithm detects left- and right-bank shoulders differently, though similarly.
- the first derivative is positive and is "trying" to cross zero (thereby causing a peak). So the position ofthe shoulder is marked by a local minimum in a positive first derivative.
- the algorithm finds this location by looking for a negative-to-positive zero crossing ofthe second derivative.
- the beginning ofthe shoulder is the point at which the slope stops increasing so quickly (in preparation for the shoulder), that is by a local maximum in the second derivative.
- the end ofthe shoulder is marked by the same condition (in preparation for the peak or another shoulder).
- Figure 36 marks these three locations (start, position, and end of shoulder) with small
- the first derivative is negative and is "trying" to cross zero (thereby causing a peak). So the position ofthe shoulder is marked by a local maximum in a negative first derivative.
- the algorithm finds this location by looking for a positive-to-negative
- the height of a peak is the maximum signal value from its start to
- the peak detecting algorithm will report a peak only if the peak's height is at least that ofthe peak amplitude threshold for that dye.
- the thresholds for the blue, green, yellow, red, and orange dyes are called respectively "B:", “G:”, “Y:”, and “R:”, and "0.”
- the peak detecting algorithm will report a peak only if the peak's width is at least that ofthe peak width threshold. In certain embodiments, this threshold
- the peak detecting algorithm measures the area of a peak to be sum ofthe
- SIZE STANDARD MATCHING Certain embodiments employ a size standard matching algorithm (which may also be referred to as a "size standard matcher” or “size matcher”). Such an algorithm matches data generated with a standard sample to actual sizes that should exist in the standard sample. For example, one may use a standard sample with nucleotide lengths 110, 114, 117, 120, and 125. One runs the standard sample and obtains several data peaks. The size standard matching algorithm predicts the peaks that correspond to the five known nucleotide lengths. Thus, one can subsequently compare data in a sample to those peaks to determine the nucleotide lengths of fragments in a sample.
- a size standard matching algorithm which may also be referred to as a "size standard matcher” or “size matcher”
- a size standard matching algorithm includes three parameters: ratio factor (the importance of peak height vs the importance ofthe local linearity), min acceptable quality (used for ending dynamic programming iteration), and number of extra peaks (the number of peaks participated in size matching is the number of size standard definition fragments plus the number of extra peaks).
- the algorithm fixes the ratio factor to 0.6 and min acceptable quality to 0.75.
- the algorithm fixes the number of extra peaks to 10 for Applied Biosystems 310/377 instrument data and 25 for Applied Biosystems 3700 instrument data.
- a statistically based quality value is generated for the matching result.
- one skilled in the art will be able to adjust the number of extra peaks that may be used with a given instrument.
- the algorithm ignores the peaks located within the offscale regions in the sample. In certain embodiment, the algorithm fails the size matching process if the size standard definitions are not fully matched in the matching process.
- the algorithm implements two primer peak detection methods.
- the first is the primer-peak-height-supression method. This method replaces the peak heights of the highest peaks with the peak height ofthe middle peak, assuming that the primer peaks are among the highest.
- the second is to find the primer peak location. The method assumes that the primer peak locates within the first half of the signal and the size standard fragments locate in the second half of the signal. For example, one takes the mean peak height of all the peaks in the second half and multiples that mean by five to get the potential primer peak height. The method works backwards in the first half of the signal to find the last primer peak.
- a size-standard matching algorithm takes as input a list of peaks (e.g., from an electropherogram) and a list of fragment sizes (e.g., in nucleotides). It produces as output a matching, that is, a list of pairs ofthe form ⁇ peak,size>, where each peak and each fragment size appears at most once.
- a size standard matching algorithm evaluates a matching, and uses an algorithm for finding good matchings.
- Certain embodiments employ an algorithm that evaluates a matching by treating its two constituent sequences as sequences of edges between points.
- a matching is also a correspondence between edges.
- a matching is also a correspondence between ratios. Under the assumption that the relation between peak position and fragment size is "more or less" linear, corresponding ratios typically should be equal.
- the algorithm derives a ratio cost to measure this property.
- the component also concentrates on big peaks by deriving a height cost. The total cost of a matching is a weighted sum of these constituent costs.
- the algorithm formulates the size standard matching problem as finding a matching with maximum cost.
- the cost is separable. That is, with some additional mathematics, the algorithm can maximize subsequences independently.
- the cost also enjoys the advantage of being local, thereby compensating for global deviations from linearity. This cost also leads to a quality value between 0 and 1.
- a size standard is a set of DNA fragments, each of known size.
- the definition of a size standard is simply a list of these sizes. Note that a size-standard definition typically does not depend on the instrument on which the size standard is run, and therefore not on any particular set of run conditions either.
- An in-lane size standard is a set of peaks resulting from running a size standard on an instrument. One determines the positions and the heights ofthe peaks.
- a size standard matching algorithm takes as input an in-lane size standard and a size standard definition. It produces as output a matching, that is, a list of pairs of the form (peak,size), where each peak and each fragment size appears at most once.
- a peak has a position (e.g., in scan numbers) and a height (e.g., in fluorescent units). Fragment sizes are given
- S [s 0 , s, , ... , s n t _, j is a list of n s fragment sizes in increasing nucleotides.
- H [2722, 6219, 1060, 5380, 7726, 1082, 7424, 1263, 7335, 7937, 1562].
- the size standard definition has n s - 5 sizes S:
- M ⁇ (3, 0), (4, 1), (6, 2), (8, 3), (9, 4) ⁇ . Big-Oh notation is used to express algorithm complexity. This notation is ubiquitous in
- the size standard matching algorithm evaluates a matching by examining its two constituent sequences. It treats the
- fragment size definition (index) sequence is [0, 1, 2, 3, 4], which also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4).
- peak edge (6, 8) corresponds to definition edge (2, 3). Assume that two edges are adjacent if they share an
- (4, 6) and (6, 8) are adjacent since they share peak 6.
- edges (i,j) and (j,k) define a ratio r lJk of lengths:
- the ratio cost of a matching is the sum of its individual costs.
- 0 ⁇ c h (i) ⁇ 1 for all peaks 0 ⁇ i ⁇ n p , and c h (i) 1, in certain embodiments, corresponds to the ideal of a maximally tall peak.
- an advantage ofthe above formulation is that the cost is separable. That is, with some additional mathematics, one can maximize subsequences independently. This property leads to an efficient dynamic programming algorithm.
- the algorithm is efficient (runs in low-order polynomial time and space) and guarantees an optimal solution.
- k,f denote the maximum cost of matching the peaks from 0 to A with the definition fragments from 0 to + 1 in such a way that peak/ matches size/and peak k matches size + 1. The cost of matching all sizes is therefore
- size standard matching algorithm computes the individual elements in a consistent order.
- the size standard matching algorithm only needs to
- the size standard matching algorithm needs to examine only subproblems
- Algorithm 2 solves Equation 10 and Algorithm 3 solves Equation 11.
- Algorithm 3 Compute Cost of Matching (whenf> 0)
- matching algorithm may reconstruct an optimal matching from Equation 9 by tracing backwards.
- the algorithm's run time complexity is dominated by the number
- a test program executes the matcher component 20 times and divides the elapsed time by 20 also, to give the time for each execution in milliseconds.
- Figures 40 through 42 show the results. The execution times themselves, rounded to the
- the arrays can be implemented as sparse arrays, so that they occupy ⁇ m 3 n s ) space as
- Another solution is to use full arrays, but to index them not with the peak indices,
- size standard matching algorithm should size.
- the size standard specifies the number of extra peaks to consider.
- the size standard specifies the number of extra peaks to consider.
- m 4. In certain embodiments,
- weighting factor ⁇ between 1/2 and 3/4.
- An analyst typically should choose a size standard definition that corresponds to the in-
- lane size standard it may be that an analyst terminated a run early, before the longer fragments have had a chance to elute. In this case, the definition is not accurate, strictly speaking. To provide some robustness in this situation, one may test if the optimal matching satisfies a minimum acceptable quality parameter. If not, one may remove the last definition size and try again, repeating this process until the quality is acceptable. Alternatively, if the quality is unacceptable, one may simply report this without returning a matching.
- the system uses a size calling algorithm.
- the size calling algorithm predicts the nucleotide size corresponding to data peaks from a sample in view ofthe standard sizes.
- such an algorithm uses at least one of five size calling algorithms: local southern, global southern, second order least square, third order least square, and cubic spline interpolation.
- the size-calling algorithm maps scan numbers (read-frames, data points, etc.) into fragment sizes.
- the size calling algorithm provides global (or least squares fit) methods and local (or interpolation) methods.
- the size calling algorithm includes three global methods (second order least squares, third order least squares, and global Southern) and two local methods (cubic spline and local Southern). Global Methods
- the global methods determine the ⁇ zef(x) of a fragment at scan number J: by evaluating a function/ The function depends on the method:
- a cubic spline is meant to simulate numerically a draftsman's mechanical spline.
- the size calling algorithm uses a so-called natural spline, for which the second derivative at the endpoints is 0.
- the size calling algorithm represents these constraints as a set of linear equations, which it then solves with Gaussian elimination (W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, General linear least squares, In Numerical Recipes in C, chapter 14.3, pages 528-539, Cambridge University Press, 1988.).
- a scan number x corresponds to time (since the capillary length, or well-to-read distance, is fixed), and so is inversely proportional to mobility.
- the size calling algorithm finds size-standard fragments a, b, c, and d so that scan x is between scans b and c. These fragments have known sizes (l/a),f s (l/c),f(l/d), mdfi(l/d) respectively.
- the size calling algorithm detects such collinear triplets and interpolates linearly in size-versus-mobility space (mobility space) to call a size. For example, suppose one has size standard fragments at 10, 20, and 30 base pairs, and that they elute at 12, 15, and 20 scans respectively. They then have mobilities of 1/12, 1/15, and 1/20 capillary lengths per scan. These points are collinear in mobility space, as shown in Figure 43.
- Figure 45 shows both cases, and shows how the size calling algorithm in certain embodiments would size a fragment at scan 17.
- the leftmost three size-standard points are collinear in mobility space, while the rightmost three points are collinear in scan space.
- the size calling algorithm obtains the blue '+' at scan 17 by linear interpolation in mobility space. In certain embodiments, it obtains the green '+' by solving a system of three Southern equations. It then sizes the fragment at scan 17 by averaging these two sizes, as shown by the black '+'.
- the system uses an allele calling component.
- an allele calling component is used to interpret what data actually corresponds to alleles.
- one uses one or more algorithms to determine the data points that actually correspond to an allele.
- one uses more than one allele calling algorithm and the component uses that combined information in a committee approach to provide the allele call. In certain embodiments, one may use a single allele calling algorithm.
- the following description of certain embodiments involves allele calling when one analyzes dinucleotide repeats at given loci using PCR amplification.
- the invention is in no way limited to such work and may involve any number of repeats or may involve other types of genetic polymorphisms.
- Other polymorphisms include, but are not limited to, SNP's (single nucleotide polymorphisms), single base insertions and deletions, insertions and deletions involving more than one base, and rearrangements.
- embodiments ofthe algorithms may be applied to other types of data in which multiple algorithms produce results that typically require interpretation and scoring in terms of their confidence values.
- Such other areas of application include, but are not limited to, the following: basecalling (de novo, mixed base and comparative sequence); SNP basecalling; spot- finding for microarrays; protein sequencing; protein gene expression; peptide searches (a noisy time series alignment problem); and modeling of biological systems.
- basecalling de novo, mixed base and comparative sequence
- SNP basecalling spot- finding for microarrays
- protein sequencing protein gene expression
- peptide searches a noisy time series alignment problem
- modeling of biological systems One skilled in the art will appreciate all ofthe many types of nucleic acid and amino acid information that may be evaluated according to the present inventions. Examples include, but are not limited to, data from any ofthe applications above and any evaluation of properties including nucleic acid or amino acid length, molecular weight, or nucleic acid or amino acid identity.
- the committee approach for all of these applications of interpreting data, one uses the output of more than one algorithm rather than relying upon but one algorithm. Often, different algorithms may have various advantages over others depending on various conditions.
- the committee approach uses different algorithms to generate a meaningful confidence value on the correct interpretation of multiple data points. According to certain embodiments, the committee approach is particularly powerful when combined with the concept of establishing the operating environment first, an example of which is illustrated by the Envelope Caller described herein.
- the stutter results in fragments that are shorter by one or more
- nucleotides may be added, which results in artifacts in Figure 6 having an extra basepair (i.e., at
- the present invention provides a heterozygous individual with alleles 93 and 103 and that includes artifacts.
- the artifacts that may be introduced, however, are not always simply disregarded when the actual alleles are closer together and allele signatures overlap.
- the present invention provides typically more reliable allele calling.
- the present invention provides typically more reliable allele calling.
- Figure 1 depicts an overview block diagram of a committee system 100 in which methods
- Data 102 includes
- Data 102 may be passed to multiple allele calling algorithms, such as the Envelope Detection Caller algorithm 104, Optimizer Caller algorithm 106, and a Heuristic Caller algorithm
- Envelope Detection Caller algorithm 104 detects a heterozygous allele pattern when alleles
- Optimizer Caller algorithm 106 identifies impulse functions (e.g.,
- a response signal e.g., a raw microsatellite signal
- Caller algorithm 108 uses multiple rules and filters to eliminate peaks that are not alleles from
- Each algorithm reports their results to a committee machine 110 that uses logic and/or
- Committee machine 110 produces robust results and can predict calls. That is, machine 110 receives call results from several callers and can
- the confidence level may
- Results 112 contain the confidence level for each test performed by committee machine 110, and results 112
- the committee system 100 provides a number of benefits over traditional allele calling
- Figure 2 depicts a flow chart ofthe steps performed by a data processing system in processing allele calls according to certain embodiments.
- the data processing system receives size-called fragment analysis data (step 202).
- the received data may then be processed using various allele calling algorithms (step 204).
- Each caller algorithm operates well in different environments and on different signals.
- committee machine 110 may assign a confidence level to the call.
- Algorithms may either examine the data's complexity, and should it pass certain requirements, make the appropriate calls or make the calls regardless of data complexity. Several exemplary calling algorithms are described in Figures.3A-3D.
- the results of each call are transferred to a committee machine 110 (step 206).
- Committee machine 110 processes the results ofthe calls (step 208) and arbitrates a decision and assigns an appropriate confidence value for the results ofthe calling algorithms.
- the results of this arbitration are reported to a user as the fragment lengths (calls) accompanied by a confidence value (step 210).
- FIG. 3 A depicts a flow chart ofthe steps performed by a data processing system when processing alleles with the Envelope Caller algorithm according to certain embodiments.
- the Envelope Caller algorithm typically is used to detect a heterozygous allele pattern where the alleles are well separated spatially.
- the Envelope Caller assesses the complexity of a signal from the nucleic acid sequencer prior to making a call and if the signal's complexity is below a threshold (i.e., the signal is in the caller's operating region) it makes the call.
- a threshold i.e., the signal is in the caller's operating region
- the algorithm may perform preprocessing such as smoothing (step 302).
- preprocessing such as smoothing (step 302).
- the algorithm may use N-point smoothing replacing each point with a local average over itself and N points on either side. By replacing each point with such a mean, noise is removed from the signal, and a smoother signal remains.
- the minima and maxima ofthe signal are determined (step 303) using a technique such as the Savitzky-Golay algorithm (See, e.g., Numerical Recipes in C: The Art of Scientific Computing, William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Cambridge University Press, 1992, pages 650-655) which uses calculation ofthe derivatives of
- step 304 a new signal is formed by retaining only the maxima. This has the effect of determining the envelope ofthe signal. In Figure 7, this signal is shown as the dotted line.
- the signal is passed back through the algorithm that determines the minima and maxima (step 305). With this new representation the original signal is then divided into panels at each minimum (step 306).
- a panel is a large section ofthe signal that is bounded by the signal's deep local minima. In Figure 7, 6 panels exist and are bounded as outlined in table 1.
- the algorithm In order to determine the signal complexity and whether or not the algorithm should make a call, the algorithm first determines if three panels exist (step 308). If, at least three panels exist, the algorithm computes an energy level for each panel, for example, by summing the square of each element in the panel (step 312). Other methods of assessing the signal's energy in defined regions may be used. Since the algorithm is searching for the envelope characteristic of two well separated alleles, one typically uses three panels to ascertain if two distinct allele signatures exist.
- the Envelope Caller algorithm performs a "threshold" determination (step 314). That is, using the three energy levels (El , E2, and E3), the algorithm
- certain embodiments ofthe Envelope Caller may include the following:
- the energy in these regions is
- the allele calls are the maxima in the two panels
- the following code may be used according to certain embodiments ofthe Envelope Caller methods.
- Line 6 calls the subroutine envelope (lines 21-53) which divides the signal into the panels
- Line 10 tests the condition given in step 4. If these conditions are met, line 11
- % function hl divider(cur, plotflag) % % cur - structure containing the data % plotflag - set to anything to plot the process % % hi - vector of division points % %
- Optimizer Caller operates as follows.
- the algorithm operates on the principle of deconvolution that identifies the impulse functions
- routine uses model fit optimization to effect deconvolution.
- the model parameters optimized are peak location, peak height, and stutter percentage.
- the algorithm first performs dimensionality reduction by sampling at bins and then identifies the largest peak as the dominant allele. Bins are locations where one would expect to find alleles. Due to the way the data is generated, fragment lengths
- this threshold is +/- 0.15
- the main Optimizer Caller algorithm steps are
- the algorithm searches for the location, height, and stutter percentage ofthe B allele peak
- the B peak may in fact be the same as the
- a peak i.e. a homozygote.
- FIG 9 illustrates two different attempts in the search for the B allele. Recall that the A
- the hypothesized signal is then compared to the A and B hypotheses.
- Figure 3C depicts a flow chart ofthe steps performed by a data processing system when processing alleles with the Heuristic Caller algorithm according to certain embodiments.
- Heuristic Caller algorithm uses multiple rules (filters) to eliminate peaks that are not alleles.
- the remaining peak(s) may be alleles.
- any of a number of preprocessing steps may be performed. Examples include the N-
- Noise quantification is used to assess the quality ofthe signal.
- Quantification includes:
- the SSE will be low and more faith can be placed in the calls. If the SSE is high then the user is alerted that it might be wise to look at the signal and make calls manually.
- the process includes step 342 where the Heuristic Caller algorithm forms a peak list using a peak detection algorithm such as the Savitzky-Golay algorithm.
- a list is formed with an entry for each peak that contains the following three pieces of information; peak location, peak height, and peak width.
- various filters are applied to remove peaks that are not the correct allele calls (step 344).
- Nonlimiting examples of one or more rules that may be employed include:
- Split peaks are two peaks that appear in the peak list that are similar in height (for example, at least about 70%) and typically less than about 0.1 basepairs apart. They typically are caused by a mixture of double and single stranded DNA. According to certain embodiments, if split peaks are detected, only the highest ofthe split peaks is preserved.
- Background peaks are spurious peaks that do not have any significant stutter. Stutter almost always occurs for dinucleotide markers. Thus, peaks that do not have any significant stutter are considered background peaks and are removed from the list. Background peaks are typically due to sample contamination.
- Spikey peaks are spurious peaks that are tall but have a width that is not typical ofthe other peaks.
- a peak list has height, width and location data. Thus, an average peak width can be determined and any peaks that are too narrow compared to the rest ofthe population are removed. They are typically caused by sample contamination.
- Shoulder peaks are peaks that appear very close to another peak and thus have the appearance of a shoulder. They are similar to spikey peaks except typically are lower in height, greater than 0.1 bp away, and less than 1 bp away. These are often caused by instrument noise. In certain embodiments, the shoulder peaks are removed.
- the filters applied in step 344 include at least one of those shown in the flow chart of Figure 3D.
- the One basepair Checker checks neighboring peaks to see whether there are one basepair peaks present.
- the Plus A checker and the Shoulder peak checker are switched with one another in the flowchart of Figure 3D. (The Final Assembler shown in Figure 3D assembles the final results and calls the alleles.)
- the Heuristic Caller algorithm determines if there are one or two remaining peaks (step 346). If there are more than two remaining peaks, additional filters are applied (step 348) in order to reduce the number of peaks to one or two.
- rule when four peaks remain - generally, the lowest two can be removed.
- Figures 10 through 12 depict data that can be evaluated with the heuristic algorithm
- the heuristic caller assumes that there are a maximum of two alleles for a given marker. In certain embodiments, there is no such assumption for a maximum
- Figure 4 depicts the steps performed by committee machine 110 according to certain aspects
- Committee machine 110 arbitrates the calls by using a set of rules.
- committee machine 110 determines the correct calls to transmit and assigns a
- the confidence level for these calls (step 404). According to certain embodiments, the confidence
- committee machine 110 defines the confidence value as 0.970. If there is no call for the
- Heuristic algorithm, and the same call for the Envelope method and the Optimizer, committee machine 110 passes those calls to the user and assigns a confidence value of 0.621. If only the
- Optimizer produces a call, committee machine 110 assigns a confidence value of 0.692 correct. And finally, any cases that do not fit into the above scenarios are assigned the calls given by the
- Confidence levels can also be assigned by a person who is familiar with use ofthe
- envelope refuses to make a call.
- hypotheses based on parameterization of an allele signal using allele location, amount of stutter and +A artifact.
- the hypothesis that best explains the signal's energy content is declared the
- winner and the allele calls are those used in forming the winning hypothesis.
- This threshold is determined as cutoffValue * the peak's maximum
- Figure 13 illustrates a typical standard heterozygous allele signature.
- Cercles denote user annotated allele calls, x-axis is in base pairs, y-axis is in A/D counts (voltage intensity))
- Certain embodiments of this algorithm include five parameters: cutoffValue, + A distance, + A ratio, stutter distance and stutter ratio.
- the program provides default values for these parameters and allows the user to adjust these values in the User Interface.
- Bleedthrough (pullup) peak Peaks exists due to strong neighboring color peaks and multicomponenting inaccuracy. This may be less than optimal for HID applications.
- Background peak One single background peak exists due to poor gel slabs.
- Spiky stutter peak Abnormally high and narrow stutter peaks.
- the heuristic algorithm addresses these potential sources of error.
- the heuristic algorithm includes additional rules. According to certain embodiments,
- the algorithm proceeds as follows.
- Noise Checker The noise level in the signal is checked. If the signal is too noisy, the
- preprocessing simply involves sampling the original signal to
- the signal By representing the signal in such a compact form, the search space is reduced significantly.
- the peaks form the set of candidate allele peaks that will be
- top frame show the signal that is being approximated.
- the candidate alleles are given by the
- the middle frames show the hypothesized signal given different stutter
- the Envelope caller is developed on the principle that while other callers may generally
- the data is determined to be heterozygous.
- Allele calling is then simply performed by finding the maximum peak in each hump. While some simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules could be added to slightly increase the accuracy. Specifically, these simple heuristic rules
- the calling strategies should be fundamentally different in order that they are to increase confidence to the close to one hundred percent mark in this subset ofthe data.
- the calling strategies should be fundamentally different in order that they are to increase confidence to the close to one hundred percent mark in this subset ofthe data.
- the panel marked p3 contains the third largest energy content.
- the algorithm proceeds to make a call if the following two criteria are met (1) > 0.2
- the call is made by finding the maximums in each of panels 1 and 2.
- the individual algorithms may not be optimal when employed alone.
- the degree of confidence for a call is based on the statistical probability of an answer being correct given the degree on consensus between the different callers. This is a particularly apt approach when one considers that one ofthe callers
- Table 3 Results illustrating confidence values that are created by considering agreement between the calling algorithms.
- the final two rows are for the same chunk of data. They show that the default caller should be the heuristic as it has a higher percentage of correct calls.
- the multi-caller approach is significant in that it provides hard numbers for the confidence in the calls. As well, by partitioning the data into different categories based on how easily the data is classified, it does well in providing a method for checking results.
- FIG. 5 is a block diagram that illustrates a computer system 500, according to certain embodiments, upon which embodiments ofthe invention may be implemented.
- Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a processor 504 coupled with bus 502 for processing information.
- Computer system 500 also includes a memory 506, which can be a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for determining allele calls, and instructions to be executed by processor 504.
- Memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504.
- Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504.
- ROM read only memory
- a storage device 510 such as a magnetic disk or optical disk, is provided and coupled to bus 502 for storing information and instructions.
- Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user.
- a display 512 such as a cathode ray tube (CRT) or liquid crystal display (LCD)
- An input device 514 is coupled to bus 502 for communicating information and command selections to processor 504.
- cursor control 516 is Another type of user input device, such as a mouse, a trackball or cursor direction keys for
- This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to
- a computer system 500 provides allele calls and provides a level of confidence for the
- an allele call is provided by computer system 500 in response to processor 504 executing one or
- memory 506 reads into memory 506 from another computer-readable medium, such as storage device 510.
- Such a medium may take
- Non-volatile media includes, for example, optical or magnetic disks, such as storage
- Volatile media includes dynamic memory, such as memory 506.
- media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise
- Transmission media can also take the form of acoustic or light waves, such as those
- Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
- Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution.
- the instructions may initially be carried on magnetic disk of a remote computer.
- the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
- a modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
- An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502.
- Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions.
- the instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
- systems consistent with certain embodiments ofthe present invention provide a committee machine that receives calls as input from at least two different allele calling algorithms. By receiving these calls, the committee machine is able to determine a level of confidence in a variety of conditions.
- implementation includes software but the present invention may be implemented as a combination of hardware and software or in hardware alone.
- the invention may be implemented
- the system uses a bin assigning algorithm. In certain embodiments, the system uses a bin assigning algorithm.
- a bin typically is composed of a center point ofthe known allele size
- embodiments will include a center point of a known allele size and include 0.5 points on either
- Bins it is assigned as an unknown allele. Bins can be any appropriate size.
- the system includes an auto binning algorithm. In such embodiments, the system includes an auto binning algorithm.
- such an algorithm may be used to establish alleles size bins when no allele sizes are yet known for a population. In certain embodiments, such an algorithm may be used to add more allele bins
- an auto binning component may use an auto binning component to determine additional allele size bins for a population when alleles have
- the auto binning algorithm collects all alleles that have been
- the algorithm also calculates a cost for each allele
- the auto binning algorithm then calculates a total cost for all ofthe alleles.
- the chosen bin centers are finalized. If the total cost is below a certain threshold level, then the chosen bin centers are finalized. If the
- the auto binning algorithm reassigns bin centers to each allele and calculates the total cost in an iterative process until the total cost is below the threshold level.
- a given value on either side ofthe bin centers is added to obtain the final bin.
- 0.5 is added to each side ofthe bin center to obtain the final bin width.
- the auto binning algorithm uses a classic k-means clustering algorithm for auto binning.
- the algorithm collects all the alleles from the input samples, removes the alleles whose quality values are less than certain number (0.1) (see quality value discussion below) and feeds them into a iteration process to find the bins as shown in Table 4 below .
- the system includes a preprocessing algorithm, which comprises at least one of an offscale detection algorithm, a multicomponenting algorithm, and a baselining
- the system includes a data conversion algorithm, which
- a peak detecting algorithm comprises at least one of a peak detecting algorithm, a size standard matching algorithm, a ladder
- the system includes a allele call reporting algorithm, comprising at
- the allele call report is the reported allele call that may be
- the allele call report may be provided after an allele calling algorithm and one or more subsequent algorithms have been applied.
- the allele call report may be
- the predicted accuracy of an allele call report may be generated in view of certain quality values as discussed below. In certain embodiments, the predicted
- the system uses one or more quality values (QVs) and/or a
- quality values may be used to predict the accuracy ofthe called alleles in the data.
- quality values and/or warning flags may be used to predict the accuracy ofthe called alleles in the data.
- the predicted prediction is used to predict the accuracy ofthe allele call report.
- the predicted prediction is used to predict the accuracy of the allele call report.
- Quality values may be used for any or all ofthe algorithms within a system. Exemplary
- the system uses a multicomponenting QV to determine the quality value ofthe multicomponenting result. In certain embodiments, one may employ
- the system uses a baselining QV to determine the quality value
- the residual signal is an indication ofthe error ofthe baseline.
- the system uses a size standard matching QV to determine the
- the size standard is a size standard
- the matching QV is determined using two processes.
- the first process is that it calculates the scan
- the matching result is not correct and the quality value is 0.0.
- the second process is based on chi-square test. From the size standard definition, it calculates the theoretical (expected) distances (in base pair) among all these fragments. From the size standard definition, it calculates the theoretical (expected) distances (in base pair) among all these fragments. From the size standard definition, it calculates the theoretical (expected) distances (in base pair) among all these fragments. From the size standard definition, it calculates the theoretical (expected) distances (in base pair) among all these fragments. From
- peaks After sizing, peaks have two values: Size (the size ofthe peak) and scan number (the time
- the system uses an allele calling QV to determine the quality
- the more than one allele calling algorithm is the more than one allele calling algorithm.
- an allele calling QV that is based on
- an allele calling QV is generated for each allele calling
- each allele calling algorithm that makes an allele call may generate
- the quality value of that allele calling algorithm may be used as the overall quality value.
- allele calling quality value may be generated in view ofthe percent of correct calls that fit into
- the heuristic allele calling algorithm uses some heuristic rules to
- Quality value starts with value 1.0; 2. For the Noise Checker, the quality value is multiplied by (1.0 - noiseLevel);
- the quality value is further decreased if the called allele peaks violate the user settable peak
- the system uses an auto binning QV to determine the quality value ofthe auto binning component.
- the auto binning QV is
- the auto binning component iterates through all the alleles involved and bin centers to calculate the residue (mean square error). This residue is adjusted by the marker repeat.
- adjusted residue AR is used as a determinant for binning quality value.
- the quality value is set at 1.0.
- the system uses a bin assigning QV to determine the quality
- the bin assigning assigning
- QV is determined by the distance given alleles are located from the bin centers.
- the bin assigning QV is set at 1 if the allele falls within a bin, and is set at 0.1 if
- the allele does not fall within a bin.
- the system reports to the user multiple warning flags.
- warning flags alert the user that there could be potential problems with the accuracy ofthe data. Certain embodiments employ the following warning flags:
- Offscale - The flag is set when there is an offscale peak within the calling range.
- Spiky Peak - The flag is set when there is spiky peak present in the marker signal.
- the flag is set if the narrowest peak in a cluster has a width 50 % less than the
- the flag is set in certain embodiments when there are two called alleles that
- Peak Height Ratio - The flag is set when there are two alleles present and the ratio between
- this level is set to 0.5.
- Peak Absolute Height - The flag is set when the alleles are lower than the specified values.
- these values are set to 200 if homozgyous and 100 if heterozygous.
- Bleedthrough - The flag is set when the marker signal contains bleed through peaks (pull up
- bleed through is detected when there is a peak in a different color within 1 scan and that peak is less than 20 % ofthe larger one.
- this value is set to 1.5 base pair. In certain embodiments, one measures the peak width at half of the peak's height.
- a background peak is one that does not fit into a cluster.
- a background peak is determined to exist when there is a small peak beside a large
- the flag is set when the number of alleles exceed the maximum number possible for the species or no alleles are found.
- Offscale is also used for all three (Linkage Dinucleotide, Linkage Tris and Tetras, and HID Tris and Tetras) according to certain embodiments.
- the system uses an allele call report QV (also called an overall QV)
- allele call report may be provided after an allele calling algorithm and a bin assigning algorithm
- qvSizeMatch comes from the size matching algorithm.
- QvAllelePeakPick comes from the allele peak picking algorithm. It may be a consensus value if
- the system uses more than one allele peak picking algorithm.
- QvBin comes from the setting ofthe bins for a population. In certain embodiments, this value is
- Allele Peak Picking QV (In certain embodiments, it may be the consensus value, which is
- Marker Quality Value may be used rather than the consensus value.
- the quality value is set at 0.
- the value is set at 1 for the factor that is edited.
- each allele has its own
- the system is used for human identification. In certain such embodiments, the system is used for human identification.
- the known alleles are provided to the user as a ladder to which the generated data can be compared.
- the ladder is a sample that includes differently sized nucleotides, each corresponding to a particular allele for a given marker.
- the user is also apprised of bins that have bin centers that each correspond to the expected size of each ofthe differently sized nucleotides for each allele in the ladder. From run to run and instrument to instrument, when one employs the ladders in a process, there may exist some shifts for these ladder locations. In other words, the data generated when one uses the ladders in an experiment may include ladder peak sizes that do not correspond exactly with the expected bin centers and may include more peaks than expected bin centers. Thus, in certain embodiments, one may use a ladder shift algorithm to adjust the bin locations to account for these ladder shifts and/or additional peaks to provide bins that may provide more accurate results for determining the size of alleles in an experimental sample than unadjusted bin locations.
- the system finds the locations of the ladders (by searching bin definitions, which are the expected bin centers for the alleles of a ladder that are reported to the user) and uses a dynamic programming algorithm to match the bin locations to the peaks ofthe ladder signal.
- bin definitions which are the expected bin centers for the alleles of a ladder that are reported to the user
- the matching algorithm employs a minimum peak height of 100 to 150 rfu since the ladders typically are very strong signals.).
- the shifts are calculated for each ladder bin definition/peak pair.
- Each ladder is then provided revised bins for assigning peaks obtained from a sample. For example, after the system calls alleles in a sample, the alleles are assigned to bins that have been adjusted using the shifts.
- the process proceeds using the flow chart in Table 5:
- the size standard matching component discussed above is used for ladder shifts as follows.
- alleles within a given ladder are assigned to bins.
- the user is also alerted to virtual bins.
- Virtual bins are bins in which an allele may occur, but that possible allele is not provided in the ladder.
- the virtual bins may need to be shifted when there is a shift determined for the actual alleles in the ladder.
- the shifts are detected for the ladder of each marker independently from other ladders for other markers.
- the size standard matching algorithm discussed above in the Size Standard Matching section is used to evaluate the data generated with the ladders by matching the peaks expected in the ladder to the peaks actually observed (in certain embodiments, one uses peaks > 100 rfu).
- the size standard matching component will attempt to fit the observed pattern to the expected pattern.
- the matching will generate a marker quality value (also called a ladder shift quality value).
- the marker quality value is generated using the same technique that is discussed above in the Size Standard Matching QV section. (This marker quality value may be used instead of the allele calling quality value in the overall genotyping quality.)
- the algorithm is now aware of which ladder peak represents which bin. It takes the allelic ladder peak size calculated above and subtracts from it the value ofthe expected bin center. This gives a bin shift for that bin in that allelic ladder file. Any virtual bins are given the same shift as the closest ladder bin to its left. Thus, if a ladder file has a shifted an allele bin center + 0.2 from the expected bin center, a virtual bin to the right of such a ladder bin will also
- this shift is calculated for each marker, and bin shifts from each
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Complex Calculations (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA002416764A CA2416764A1 (fr) | 2000-07-21 | 2001-07-23 | Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques |
AU2002211212A AU2002211212A1 (en) | 2000-07-21 | 2001-07-23 | Methods, systems, and articles of manufacture for evaluating biological data |
EP01979226A EP1349960A2 (fr) | 2000-07-21 | 2001-07-23 | Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques |
JP2002513951A JP2004516455A (ja) | 2000-07-21 | 2001-07-23 | 生物学的データを評価するための方法、システム、および製品 |
Applications Claiming Priority (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US21969700P | 2000-07-21 | 2000-07-21 | |
US60/219,697 | 2000-07-21 | ||
US22755600P | 2000-08-23 | 2000-08-23 | |
US60/227,556 | 2000-08-23 | ||
US72491000A | 2000-11-28 | 2000-11-28 | |
US09/724,910 | 2000-11-28 | ||
US29012901P | 2001-05-10 | 2001-05-10 | |
US60/290,129 | 2001-05-10 |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2002008469A2 true WO2002008469A2 (fr) | 2002-01-31 |
WO2002008469A3 WO2002008469A3 (fr) | 2003-07-17 |
WO2002008469A9 WO2002008469A9 (fr) | 2003-11-20 |
Family
ID=27499158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/023629 WO2002008469A2 (fr) | 2000-07-21 | 2001-07-23 | Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1349960A2 (fr) |
JP (1) | JP2004516455A (fr) |
AU (1) | AU2002211212A1 (fr) |
CA (1) | CA2416764A1 (fr) |
WO (1) | WO2002008469A2 (fr) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004003234A3 (fr) * | 2002-06-28 | 2005-04-07 | Applera Corp | Systeme et methode de regroupement de genotypes snp |
WO2007119779A1 (fr) | 2006-04-14 | 2007-10-25 | Nec Corporation | Méthode de discrimination individuelle et appareil |
EP1862929A1 (fr) * | 2006-02-28 | 2007-12-05 | Hitachi Software Engineering Co., Ltd. | Procédé et dispositif d'évaluation de résultat de genotypage |
RU2635954C2 (ru) * | 2013-08-27 | 2017-11-17 | Ниссан Мотор Ко., Лтд. | Многозвенный поршневой кривошипно-шатунный механизм для двигателя внутреннего сгорания |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4713138B2 (ja) * | 2004-12-06 | 2011-06-29 | 株式会社日立ソリューションズ | 遺伝子情報の表示方法及び表示装置及びプログラム |
JP2017532699A (ja) * | 2014-09-05 | 2017-11-02 | ナントミクス,エルエルシー | 起源の判定のためのシステムと方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5580728A (en) * | 1994-06-17 | 1996-12-03 | Perlin; Mark W. | Method and system for genotyping |
US6019896A (en) * | 1998-03-06 | 2000-02-01 | Molecular Dynamics, Inc. | Method for using a quality metric to assess the quality of biochemical separations |
WO1999053423A1 (fr) * | 1998-04-16 | 1999-10-21 | Northeastern University | Systeme expert d'analyse des electrophoregrammes d'adn |
-
2001
- 2001-07-23 JP JP2002513951A patent/JP2004516455A/ja active Pending
- 2001-07-23 EP EP01979226A patent/EP1349960A2/fr not_active Withdrawn
- 2001-07-23 AU AU2002211212A patent/AU2002211212A1/en not_active Abandoned
- 2001-07-23 CA CA002416764A patent/CA2416764A1/fr not_active Abandoned
- 2001-07-23 WO PCT/US2001/023629 patent/WO2002008469A2/fr active Search and Examination
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004003234A3 (fr) * | 2002-06-28 | 2005-04-07 | Applera Corp | Systeme et methode de regroupement de genotypes snp |
EP1862929A1 (fr) * | 2006-02-28 | 2007-12-05 | Hitachi Software Engineering Co., Ltd. | Procédé et dispositif d'évaluation de résultat de genotypage |
WO2007119779A1 (fr) | 2006-04-14 | 2007-10-25 | Nec Corporation | Méthode de discrimination individuelle et appareil |
RU2635954C2 (ru) * | 2013-08-27 | 2017-11-17 | Ниссан Мотор Ко., Лтд. | Многозвенный поршневой кривошипно-шатунный механизм для двигателя внутреннего сгорания |
Also Published As
Publication number | Publication date |
---|---|
WO2002008469A3 (fr) | 2003-07-17 |
AU2002211212A1 (en) | 2002-02-05 |
CA2416764A1 (fr) | 2002-01-31 |
EP1349960A2 (fr) | 2003-10-08 |
JP2004516455A (ja) | 2004-06-03 |
WO2002008469A9 (fr) | 2003-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20020116135A1 (en) | Methods, systems, and articles of manufacture for evaluating biological data | |
US10347365B2 (en) | Systems and methods for visualizing a pattern in a dataset | |
US11371074B2 (en) | Method and system for determining copy number variation | |
US20040117130A1 (en) | System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms | |
US8392126B2 (en) | Method and system for determining the accuracy of DNA base identifications | |
Ewing et al. | Base-calling of automated sequencer traces usingPhred. I. Accuracy assessment | |
US6236944B1 (en) | Expert system for analysis of DNA sequencing electropherograms | |
US7406385B2 (en) | System and method for consensus-calling with per-base quality values for sample assemblies | |
US20220101944A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
CN111868832A (zh) | 识别拷贝数异常的方法 | |
EP3884502B1 (fr) | Procede et produit programme informatique pour l'analyse de l'adn foetal par sequencage massif | |
WO2002008469A2 (fr) | Procedes, systemes et articles manufactures destines a evaluer des donnees biologiques | |
US7912652B2 (en) | System and method for mutation detection and identification using mixed-base frequencies | |
Walther et al. | Basecalling with lifetrace | |
WO2024140881A1 (fr) | Procédé et dispositif de détermination de la concentration d'adn fœtal | |
Vranckx et al. | Analysis of MALDI‐TOF MS Spectra using the BioNumerics Software | |
WO2003102211A2 (fr) | Méthode de détection des variations de l'adn dans des données de séquences | |
Hellenthal | Population structure, demography and recent admixture | |
KR20200085144A (ko) | 모체 시료 중 태아 분획을 결정하는 방법 | |
Parsley | Benchmark of Tools and Methods Used in the Analysis of Massively Parallel Reporter Assays | |
Gafurov et al. | Probabilistic Models of k-mer Frequencies | |
Zavala et al. | Benchmarking for genotyping and imputation using degraded DNA for forensic applications across diverse populations | |
Puurand et al. | Y-mer: A k-mer based method for determining human Y chromosome haplogroups from ultra-low sequencing depth data | |
CN117393054A (zh) | 鉴定核酸样本拷贝数变异真假阳性和细胞分裂来源的方法及装置 | |
KR20180094498A (ko) | 핵산 시퀀스를 분석하는 방법 및 장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2416764 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2001979226 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002211212 Country of ref document: AU |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 2001979226 Country of ref document: EP |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/40-40/40, DRAWINGS, REPLACED BY NEW PAGES 1/36-36/36; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2001979226 Country of ref document: EP |
|
DPE2 | Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101) |