WO2018148260A1 - Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) - Google Patents
Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) Download PDFInfo
- Publication number
- WO2018148260A1 WO2018148260A1 PCT/US2018/017193 US2018017193W WO2018148260A1 WO 2018148260 A1 WO2018148260 A1 WO 2018148260A1 US 2018017193 W US2018017193 W US 2018017193W WO 2018148260 A1 WO2018148260 A1 WO 2018148260A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dna
- digital data
- information
- error correction
- codes
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 126
- 108020004414 DNA Proteins 0.000 title claims description 268
- 102000053602 DNA Human genes 0.000 title claims description 267
- 108091034117 Oligonucleotide Proteins 0.000 claims abstract description 234
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 claims abstract description 73
- 230000008569 process Effects 0.000 claims abstract description 62
- 239000002773 nucleotide Substances 0.000 claims description 160
- 125000003729 nucleotide group Chemical group 0.000 claims description 160
- 238000012937 correction Methods 0.000 claims description 117
- 238000003786 synthesis reaction Methods 0.000 claims description 71
- 230000015572 biosynthetic process Effects 0.000 claims description 70
- 230000006820 DNA synthesis Effects 0.000 claims description 57
- 230000002255 enzymatic effect Effects 0.000 claims description 31
- 238000012163 sequencing technique Methods 0.000 claims description 27
- 238000013507 mapping Methods 0.000 claims description 24
- 230000007704 transition Effects 0.000 claims description 21
- 239000003550 marker Substances 0.000 claims description 20
- 238000002887 multiple sequence alignment Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 15
- 230000002194 synthesizing effect Effects 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 238000012217 deletion Methods 0.000 description 76
- 230000037430 deletion Effects 0.000 description 76
- 238000003780 insertion Methods 0.000 description 33
- 230000037431 insertion Effects 0.000 description 33
- 230000006870 function Effects 0.000 description 31
- 238000013500 data storage Methods 0.000 description 27
- 238000009826 distribution Methods 0.000 description 21
- 238000006467 substitution reaction Methods 0.000 description 17
- 238000001712 DNA sequencing Methods 0.000 description 16
- 238000012408 PCR amplification Methods 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 14
- 238000013459 approach Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 11
- 238000011084 recovery Methods 0.000 description 11
- 238000002474 experimental method Methods 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 9
- 230000000875 corresponding effect Effects 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 7
- 238000003752 polymerase chain reaction Methods 0.000 description 7
- 230000003321 amplification Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 6
- 238000003199 nucleic acid amplification method Methods 0.000 description 6
- 108091028043 Nucleic acid sequence Proteins 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 230000008439 repair process Effects 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000031018 biological processes and functions Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000007672 fourth generation sequencing Methods 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 241000169170 Boreogadus saida Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 238000002493 microarray Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000004544 DNA amplification Effects 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003851 biochemical process Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003467 diminishing effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000001962 electrophoresis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 229920001519 homopolymer Polymers 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
- 230000033458 reproduction Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/29—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes
- H03M13/2906—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes using block codes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C13/00—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
- G11C13/0002—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
- G11C13/0009—RRAM elements whose operation depends upon chemical change
- G11C13/0014—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
- G11C13/0019—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/03—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
- H03M13/05—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
- H03M13/11—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
- H03M13/1102—Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/03—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
- H03M13/05—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
- H03M13/13—Linear codes
- H03M13/15—Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
- H03M13/151—Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
- H03M13/1515—Reed-Solomon codes
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/03—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
- H03M13/05—Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
- H03M13/13—Linear codes
- H03M13/15—Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
- H03M13/151—Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
- H03M13/152—Bose-Chaudhuri-Hocquenghem [BCH] codes
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/37—Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
- H03M13/3761—Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 using code combining, i.e. using combining of codeword portions which may have been transmitted separately, e.g. Digital Fountain codes, Raptor codes or Luby Transform [LT] codes
Definitions
- the present principles relate generally to digital information storage in deoxyribonucleic acid (DNA).
- DNA contains the genetic program for the biological development of life.
- DNA may be used as a compact storage medium for petabytes of information by encoding digital information into sequences of DNA nucleotides.
- the potential benefits of DNA storage include: (1) Extremely high-density storage beyond the order of terabytes within 1 gram of DNA; (2) Decades of stability and durability at moderate temperatures; (3) Biological replication due to PCR-amplification; (4) Rapid biological search and indexing mechanisms via primers; (5) Biological editing and re-encoding of segments via enzymes.
- An end-to-end system for DNA storage comprises at least several components.
- Source information e.g., a movie file
- Source information is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments.
- the DNA segments must be sequenced, demodulated, assembled, and decoded while addressing errors that may occur.
- an embodiment of a sytem for storing digital data in DNA using a DNA process that may be error prone, e.g., comprising at least one of a lossy or noise DNA synthesis process and a lossy or noisy DNA sequencing process includes decoding data encoded into DNA oligonucleotides using coded multiple sequence alignment wherein a decoder includes a block which processes one or more versions of each DNA oligonucleotide using synchronization information included in the oligonucleotides, e.g., synchronization markers corresponding to nucleotides inserted into the oligonucleotides during encoding of the digital data.
- synchronization may be striped synchronization markers
- the synchronization marker may be a simple code or more complex synchronization codes.
- the coded alignment may be based on dynamic programming algorithms to efficiently compute the alignment. Coded alignment may also involve multiple versions of an oligonucleotide.
- an embodiment of a sytem for storing digital data in DNA comprises harnessing diversity of multiple synthesized sequences.
- two or more versions of an oligo may be used for alignment.
- Diversity may provide various advantages including boosting the performance of coded alignment (dynamic programming solution) and enabling use of consensus algorithms which correct errors in the oligonucleotides using techniques such as majority voting mechanisms.
- an embodiment of a sytem for storing digital data in DNA comprises using consensus, e.g., consensus algorithms.
- Consensus utilizes diversity (e.g., two or more versions of an oligo) to correct nucleotide errors . If one version of an oligo is corrupted in a certain position, other versions may not be corrupted in the same position. Thus, if synchronization/alignment is obtained, consensus over multiple diverse versions can greatly improve decoding accuracy.
- Consensus algorithms can scale efficiently to utilize 10, 100, or 1000 versions of an oligo because because relatively simple majority voting style approaches may be applied.
- an embodiment comprises consensus utilizing probabilistic information (e.g., similar to votes) from many versions of an oligo to correct errors in nucleotides.
- an embodiment of a sytem for storing digital data in DNA comprises error-correction features, e.g., an error correction pipeline.
- error-correction features e.g., an error correction pipeline.
- an individual error-correction code ECC
- oligo ECC error-correction code
- block error correction may be included by adding oligo s representing block error correction codes during encoding (block ECC).
- block ECC block error correction feature enables correcting errors across multiple oligos, in case any oligo is missing due to the lossy characteristics of the DNA synthesis process.
- Alignment and consensus may provide synchronization and partial error-correction in accordance with an aspect of the present principles.
- the oligo ECC and also block ECC over multiple oligos provide further error-correction of bits.
- an embodiment of a sytem for storing digital data in DNA comprises modulation blocks.
- an encoder includes a final block wherein bits are converted to nucleotides.
- Another aspect comprises a modular structure enabling the system to accommodate different modulation schemes.
- a decoder comprises demodulation as a first block or function.
- the modular structure enables organizing error-correction for DNA storage. For example, traditional error-correction occurs over bits (not nucleotides) so in an embodiment error correction occurs before modulation in the encoder, and after demodulation in the decoder.
- one or more embodiments as described herein may provide one or more advantages such as the following.
- Systems in accordance with the present principles are highly error-tolerant, e.g., capable of correcting for nucleotide errors on the order of 20% or more produced by a lossy or error- prone DNA system, e.g., lossy synthesis and/or sequencing; correcting for nucleotide insertion erros and substitution errors; and correcting for missing oligos.
- Systems incoroporating coding in accordance with the present principles may reduce the required number of reads of DNA for a lossy DNA synthesis process due at least in part to the error tolerance of an embodiment in accordance with the present principles.
- Systems incoroporating coding in accordance with the present principles and, in particular, error tolerant aspects of coding in accordance with the present principles may enable increasing the length of each oligo produced by synthesis, thereby allowing more bits to be encoded per oligo, despite the synthesis process being lossy or hyper-lossy such as may be the case with low cost synthesis approaches such as enzymatic synthesis and/or lossy DNA sequencing approaches during data reads.
- embodiments in accordance with the present principles make use of low-cost lossy or error-prone DNA systems including synthesis and/or sequencing processes for reliable storage of large volumes of data.
- an embodiment of a method of encoding digital data comprises adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide using coded alignment.
- DNA deoxyribonucleic acid
- an embodiment of a method of encoding digital data including adding synchronization information to the digital data may include the synchronization information comprising a synchronization marker.
- an embodiment of a method of encoding digital data wherein a plurality of codes may include synchronization information, the synchronization information may be further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotides produced by a lossy or error-prone DNA synthesis process during decoding.
- an embodiment of a method of encoding digital data wherein a plurality of codes may include synchronization information, the synchronization information may be further configured to enable a use of diversity including consensus.
- an embodiment of a method of encoding digital data may further include adding error correction information to the digital data after adding address information and before converting the digital data to codes, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes.
- the error correction information may include a block error correction code and a word error correction code.
- a block error correction code may include one of a Reed- Solomon code and a Fountain code.
- a word error correction code may include one of a BCH code and a LDPC code.
- an embodiment of a method of encoding digital data may comprise adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
- DNA deoxyribonucleic acid
- synchronization information may comprise a synchronization marker.
- a plurality of codes including synchronization information may be further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotide produced by the lossy or error-prone DNA synthesis process during the subsequent decoding.
- a plurality of codes including synchronization information may be further configured to enable the use of diversity to include a use of consensus.
- error correction information may include a block error correction code and a word error correction code.
- a block error correction code may include one of a Reed- Solomon code and a Fountain code.
- a word error correction code may include one of a BCH code and a LDPC code.
- converting digital data to a plurality of codes may comprise a mapping of the digital data to nucleotides according to a modulation map.
- converting digital data to a plurality of codes using a modulation map may comprise the modulation map being in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
- converting digital data to a plurality of codes using a modulation map may comprise the modulation map writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
- an embodiment of a method of decoding digital data stored in DNA may comprise aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy or erro- prone DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
- synchronization information may comprise a synchronization marker.
- aligning may further comprise using a diversity among the plurality of oligonucleotides.
- aligning using the diversity may further include a use of consensus.
- each of the plurality of binary words may include error correction information, and extracting the digital data may include error correction based on the error correction information.
- error correction information may include a block error correction code and a word error correction code.
- a block error correction code may include one of a Reed- Solomon code and a Fountain code.
- a word error correction code may include one of a BCH code and a LDPC code.
- a modulation map may be in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
- a modulation map may comprise writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
- an embodiment of a lossy or error-prone DNA synthesis process for digital data storage may comprise an enzymatic molecular synthesis process.
- an embodiment of an encoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
- an embodiment of a decoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
- an embodiment may comprise a non-transitory computer-readable medium storing computer-executable instructions executable to perform one or more aspects of embodiments of methods as described herein.
- an embodiment of a method of decoding digital data stored in DNA molecules may comprise accessing a plurality of DNA molecules storing encoded digital data, wherein the encoded digital data includes a code component comprising an address, synchronization information, and error correction information, and the DNA molecules were synthesized using a lossy or error- prone DNA synthesis process; sequencing the plurality of DNA molecules; merging and assembling the plurality of DNA molecules to form a plurality of DNA oligonucleotides; and decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes applying a coded multiple sequence alignment (MSA) procedure to the plurality of DNA segments using the marker symbol in each of the plurality of segments to align the plurality of segments, processing the aligned plurality of segments using a diversity scheme to extract digital information from the aligned plurality of segments, and performing error correction on the extracted digital information using the error correction component to produce the decoded digital data.
- MSA coded multiple sequence alignment
- an embodiment of an encoder comprises encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and an archive to store the plurality of DNA segments.
- an embodiment comprises a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder aligns the plurality of oligonucleotides using coded alignment based on synchronization information included in the plurality of oligonucleotides; demodulates the aligned plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and extracts the digital data from the binary words.
- An aspect of the present principles comprises an embodiment of a system for storage of digital information in DNA including components comprising: (1) Source data in bits; (2) Encoding mechanism including all error-correction codes and modulation from bits to nucleotides; (3) DNA synthesis of multiple DNA segments; (4) PCR-amplification of DNA pools; (5) DNA archival storage; (6) DNA sequencing; (7) Merging and assembling multiple DNA segments; (8) Demodulation and decoding of all codes for reliable recovery.
- an embodiment addresses a challenge of designing a coded DNA storage system involving ensuring the efficiency and compatibility of the different components which correct diverse errors.
- An aspect of the present disclosure involves error-correction being introduced for DNA storage based on low-cost enzymatic molecular synthesis.
- An embodiment of a modular pipelined system provides encoding and decoding digital information reliably in DNA molecules in the presence of a large fraction of differentiated errors.
- a source of data is efficiently encoded, modulated, and stored in a set of oligonucleotide segments.
- DNA sequencing the information stored on multiple segments is assembled and decoded reliably.
- a classification of errors is provided for the biological processes of low-cost DNA synthesis, PCR-amplification, and DNA sequencing.
- Error-correction is designed to recover data corrupted by insertions, deletions, and substitutions of nucleotides. Low-cost synthesis also places constraints on the modulation of bits to nucleotides.
- a flexible system is presented comprised of error-correction codes (e.g., Reed-Solomon, LDPC, or polar codes), synchronization codes, and constrained modulation codes.
- the encoding step comprises modulation, synchronization, addressing, and error-correction within each oligonucleotide and per block of multiple oligonucleotides.
- the decoding step comprises reciprocal measures to assemble source data from multiple segments. Error-correction is designed to recover data corrupted by several types of errors including missing segments. The amount of overhead redundancy envisioned for reliable storage is specified and compared to information-theoretic bounds.
- an enzymatic synthesizer which is intended to produce one oligonucleotide may actually produce multiple output strands which have varying amounts of insertions, deletions, and substitutions.
- Another aspect relates to an error-correction system suitable for low-cost enzymatic synthesis.
- Traditional error-correction systems for DNA storage are based on high-fidelity molecular synthesis (e.g., using Agilent micro-array platforms). Such synthesis machines are too costly for commercial purposes, and cannot enable ubiquitous storage.
- Enzymatic synthesis in accordance with aspects of the present disclosure produces DNA oligonucleotides using low-cost chemicals. The low-cost property allows eventually scaling up to terabytes of synthesized data.
- an aspect of the present disclosure involves introducing a constraint graph for nucleotide patterns.
- a modulation scheme implements the constraint graph. The purpose of modulation is to maximize the capacity of information (how many bits per nucleotide stored), and to minimize the probability of insertions/deletions (due to less reliable nucleotide transitions). Modulation in accordance with the present principles avoids less reliable transitions entirely. It is also possible to minimize the usage of the less reliable transitions.
- Another aspect of the present principles addresses an issue with low-cost enzymatic synthesis involving multiple oligonucleotide segments produced all of which do not have the same lengths.
- An aspect of the present disclosure involves using synchronization codes to address distribution of lengths of oligonucleotides, otherwise the storage capacity may be significantly reduced.
- One type of synchronization code is a marker code.
- a marker code may comprise known marker symbols which are inserted in the bit stream prior to modulation.
- synchronization errors insertion and deletion errors
- oligonucleotide segments are synchronized, the positions which have been deleted may be corrected.
- error-correction within each oligonucleotide is provided to solve this problem.
- BCH codes are applied for error-correction after the oligonucleotide has been synchronized to the correct length. That is, in accordance with the present principles, first synchronize and then correct.
- synchronization provides (posterior) probabilities for the sequence of bits stored in each oligonucleotide.
- other codes such as LDPC (low-density parity-check codes), polar codes, short Turbo codes, or convolutional codes could be applied as well.
- Another aspect of the present principles addresses an issue with DNA storage involving assembling the original data from multiple, short, sequenced oligonucleotides.
- an address for each oligonucleotide is provided for assembly.
- address bits are concatenated with data bits prior to applying error-correction for each oligonucleotide.
- decoding after an oligonucleotide has been synchronized and error-corrected, its address is retrieved and thus its position in assembling all of the original data is known.
- Another aspect of the present disclosure involves addressing an issue with DNA storage involving that some oligonucleotides might be corrupted beyond repair, and/or lost in the DNA solution (e.g., if PCR-amplification does not yield enough copies for sequencing).
- a code e.g., Reed-Solomon code
- Another aspect of the present disclosure involves correcting for multiple types of errors that may occur simultaneously with DNA storage based on low-cost enzymatic synthesis.
- this problem is addressed using a modular error-correction pipeline.
- the order of the blocks in the pipeline provides for correcting diverse types of errors. It is possible to jointly design the blocks in the pipeline to improve efficiency, but a modular approach allows to improve each block separately.
- Encoding in accordance with the present principles involves the following in order: (1) Store data bits in multiple oligonucleotides with redundancy per block of oligonucleotides; (2) Add an address per oligonucleotide; (3) Add error-correction per oligonucleotide; (4) Add synchronization per oligonucleotide; (5) Transform bits to nucleotides using modulation scheme.
- Decoding in accordance with the present principles involves the following in order: (1) Demodulate each oligonucleotide from nucleotides to bits; (2) De- synchronize each oligonucleotide; (3) Error-correction by decoding each oligonucleotide; (4) retrieve address for each oligonucleotide; (5) If any oligonucleotide in a block is corrupted beyond repair or missing, utilize the redundancy per block of oligonucleotides to reconstruct the original source data.
- an embodiment of an encoder for encoding digital data comprises at least one processor configured at least one processor configured for adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide using coded alignment.
- a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process
- an embodiment of a device for encoding digital data comprises at least one processor configured for adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
- the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
- an embodiment of a device for decoding digital data stored in DNA comprises at least one processor configured for aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
- an embodiment of a system for storing digital information in DNA comprises: an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; an archive to store the plurality of DNA segments; a sequencer sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder: aligns the plurality of
- FIG. 1 illustrates, in block diagram form, a DNA storage system in accordance with the present principles
- Figure 2 illustrates, in state diagram form, a system for DNA synthesis
- Figure 3 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles
- Figure 4 illustrates, in state diagram form, an exemplary embodiment of constraints for synthesizing DNA
- Figure 5 illustrates, in block diagram form, an exemplary embodiment of molecular-level DNA storage
- Figure 6 illustrates, in block diagram form, an exemplary embodiment of an error- correction pipeline for DNA storage of digital data
- Figure 7 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles
- Figure 8 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 9 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 10 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 11 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 12 illustrates, in chart form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 13 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 14 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles
- FIG. 15 illustrates, in block diagram form, an aspect of DNA digital data storage system in accordance with the present principles
- FIG. 16 illustrates, in block diagram form, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 17 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 18A and 18B illustrate, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 19 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 20 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles
- Figure 21 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles.
- Figure 22 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles.
- the present principles are generally directed to storage of digital information in deoxyribonucleic acid (DNA).
- DNA deoxyribonucleic acid
- Figure 1 depicts an embodiment of a system, method or apparatus providing a complete storage cycle for storing and retrieving data using DNA as the storage medium.
- the system includes at least the following components: (1) Source data in bits; (2) Encoding mechanism including all error-correction codes and modulation from bits to nucleotides; (3) DNA synthesis of multiple DNA segments; (4) PCR-amplification of DNA pools; (5) DNA archival storage; (6) DNA sequencing; (7) Merging and assembling multiple DNA segments; (8) Demodulation and decoding of all codes for reliable recovery.
- a system, apparatus and method in accordance with the present principles address a challenge of designing a coded DNA storage system involving ensuring the efficiency and compatibility of the different components which correct diverse errors.
- An end-to-end system for DNA storage includes several components.
- Source information e.g., a movie file
- Source information is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments.
- the DNA segments must be sequenced, demodulated, assembled, and decoded while also addressing errors that may occur.
- Figure 1 depicts the complete digital data storage and recovery system.
- the system involves a digital data management and processing portion 101 including encoding, decoding and error management of digital data and a DNA storage and recovery portion 102 including DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing.
- the system is comprised of the following components as shown in Fig.
- source data 110 in bits in bits
- encoding mechanism 120 including all error- correction codes and modulation from bits to nucleotides
- DNA synthesis 130 of multiple DNA segments PCR-amplification 140 of DNA pools
- DNA archival storage 150 DNA sequencing 160; merging and assembling multiple DNA segments 170; demodulation and decoding 180 of all codes for reliable recovery.
- One challenge of designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
- DNA storage In DNA storage, several types of errors occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines.
- DNA storage channels By modeling the errors of DNA processing technologies, it is possible to define "DNA storage channels", many of which have only approximately known information-theoretic capacities by contrast to standardized, precisely-mapped wireless communication channels. For example, the capacity of the deletion channel is only known to within upper and lower bounds for independent and identically distributed deletion.
- errors such as deletions must be addressed to provide a reliable data storage system, especially when utilizing low-cost, next-generation DNA processes that may be lossy or noisy.
- a DNA process that is lossy or noisy as referenced herein is intended to encompass various DNA synthesis and/or sequencing processes that may be error prone, e.g., produce errors such as deletions, insertions and substitutions as described herein. Such errors may occur as a result of using low cost DNA processing as part of a DNA data storage system.
- a low cost system may include at least one of a lossy or noisy DNA synthesis process and a lossy or noisy DNA sequencing process.
- a lossy or noisy DNA synthesis process may include a low-cost enzymatic synthesis process.
- Such processes use low cost chemicals and, therefore, may advantageously reduce the cost of a DNA data storage system.
- such low cost synthesis may introduce diverse errors at a relatively high rate.
- nucleotide deletions may occur at error rates on the order of 25% per nucleotide.
- high-fidelity DNA synthesis machines utilizing chemical synthesis methods may provide a combined error rate one or two orders of magnitude less, e.g., under 1%, but at a much higher cost.
- Errors may also be introduced during sequencing of DNA oligonucleotides when reading or recovering data, e.g., by using a sequencing process such as nanopore sequencing that may tend to be lossy or noisy.
- oligonucleotide segments are equipped with an address code which is a unique identifier.
- Digital payload information is stored across multiple segments, protected by modern codes organized in both the "horizontal" dimension (per oligonucleotide), and the "vertical" dimension (across multiple oligonucleotides). While accuracy in retrieval is an important consideration, so also is efficiency of such codes to reduce overhead costs in DNA synthesis and sequencing.
- a fundamental constraint of storing information in DNA molecules is the limited length of oligonucleotide segments.
- Current DNA synthesis machines synthesize short segments nucleotide by nucleotide. Each oligonucleotide may be of variable length, depending on the number of synthesis errors incurred.
- a DNA synthesizer produces M oligonucleotide segments of variable lengths.
- the fixed-length input x m and variable-length output w m of a synthesis machine are denoted as follows for the m th oligonucleotide where
- Figure 2 illustrates a Markov chain mathematical model which includes various types of errors. Insertion, deletion, substitution, and burst errors have associated probabilities denoted by respectively.
- the proof is available in Appendix I, and may be adapted to several variations of the Markov chain model.
- the matrix E of efficiencies may be asymmetric.
- a constraint graph may be constructed based on nucleotide efficiencies; i.e., only transitions which have a higher probability of success for biological synthesis are included as valid transitions.
- a constraint graph is a directed graph.
- An irreducible constraint graph contains a path between any ordered pair of nodes.
- Example 3 (Constraint Graph): Figure 4 depicts a constraint graph which outlines valid and invalid transitions between nucleotides.
- the adjacency matrix of the constraint graph is
- edge transitions (A, A), (T, T), (C, C), (G, G) are not included in the graph.
- edge transitions (A, G), (T, G), (G, T), (C, T) are invalid transitions.
- nucleotides Beyond constraints on transitions between nucleotides, constraints on longer sequences of nucleotides are often necessitated to ensure proper PCR-amplification and DNA sequencing. For example, the exclusion of reverse complementary subsequences might be required, and/or the avoidance of specific nucleotide patterns reserved for primers. To encourage the correct binding of primers, a balanced distribution of nucleotides might also be enforced.
- a modulation code maps sequences of bits to valid sequences of nucleotides.
- a demodulation function maps sequences of nucleotides to sequences of bits. The demodulation function may correct nucleotide errors partially if they occur, and/or defer error-correction to other decoding units. However, in either case, the demodulation function should not drastically affect the synchronization of the sequences.
- the modulation function ⁇ £ maps a sequence of k mod bits to a sequence of n mod nucleotides. If nucleotide insertions and deletions occur, the demodulation function may accept a sequence of nucleotides of length.
- demodulation function produces a sequence of bits of length k m ' od where it may be the case that if error correction is deferred.
- the modulation map ⁇ £ writes an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
- ⁇ £ (00111010) ATCGCACA.
- Nucleotide transitions (A, G), (G, T), (C, T), (T, G) are not permitted in the modulation, as indicated by the constraint graph of Figure 4.
- the demodulation function ⁇ t> D maps both A and T to a 0, and both C and G to a 1.
- the mapping functions are optimal- each nucleotide may transition to two other nucleotides which is the maximum allowed by the constraint graph. It is observed that Furthermore, a
- nucleotide insertion, deletion, or substitution error after modulation results in only a single bit of error in the demodulated binary sequence.
- variable-length segments are stored compactly in an ultra-dense DNA solution.
- Figure 5 illustrates a macro-level view of DNA storage.
- the input to the system is
- each short oligonucleotide is re-assembled from reads.
- the modeling of nucleotide errors such as insertions, deletions, substitutions was already included mathematically for DNA synthesis, and is therefore not covered again at this stage.
- a new type of error occurs during sequencing.
- each oligonucleotide is absent from the read output due to low coverage in its DNA solution. The low coverage may be the result of a lack of PCR-amplification for a particular oligonucleotide.
- the mathematical model for sequencing each oligonucleotide is,
- variable-length segments After sequencing, the final output of the storage system is variable-length segments
- the merging of segment copies may be done easily if each segment contains a protected address as a unique identifier.
- the merging is combined with a consensus approach based on edit distances.
- the edit distance between two strings is the minimum number of symbol insertions, deletions, or substitutions needed to transform one of the strings to the other string.
- the edit distance for variable-length segments is computed efficiently via dynamic programming which is essentially optimal in terms of its quadratic complexity in the length of the segments.
- the DNA molecular channel contains several sources of error.
- the inputs are M oligonucleotide segments each of fixed size K nucleotides.
- the capacity for the storage channel is at most 2MK bits, if each nucleotide represents 2 bits. This upper bound assumes no constraints for modulation, and no sources of error. When incorporating individual types of error in the system, the upper bound of 2MK bits may be tightened.
- C ST (M, K) be the storage capacity of the DNA molecular-level storage system described herein.
- the storage system contains M input oligonucleotides each of length K nucleotides.
- the capacity of the DNA storage system is upper-bounded as follows.
- the adjacency matrix A e corresponds to an irreducible constraint graph specifying valid nucleotide sequences.
- the alphabet ⁇ ⁇ A, T, C, G ⁇ .
- the upper bound due to the permutation channel in Eqn. (5) is important to show that both the number of oligonucleotide segments M and the length of each oligonucleotide K must grow in order to increase storage capacity adequately.
- the upper bound is computed as follows for a few pairs (M, K) .
- Oligonucleotide segments must not be too short, despite biological constraints due to low- cost DNA synthesis.
- the length K must grow at least as log 2 M which is the number of bits to uniquely address each segment. As expected, the capacity is upper bounded by a rate of growth linear in K.
- the upper bound relaxations provided in Theorem 2 are not achievable in most cases.
- the capacity of the deletion channel is not known exactly. Therefore, the upper bound in Eqn. (6) is not attainable.
- Figure 6 illustrates a pipeline for error-correction and processing comprised of the following components: (1) Modulation and demodulation between bits and nucleotides; (2) Encoding and decoding for synchronizing sequences of bits; (3) Error-correction across each oligonucleotide; (4) Error-correction across multiple oligonucleotides.
- the information stream is in bits.
- the modulation map converts the stream of bits to a stream of nucleotides obeying the constraint graph described above.
- Example 4 defined a modulation map in which 1 bit mapped to 1 nucleotide.
- the input to the synchronization encoding block in Figure 6 is a stream of bits for each oligonucleotide.
- a stream of input bits may be grouped into symbols.
- 2 bits corresponds to 1 symbol.
- a mechanism to achieve synchronization is to insert a marker symbol in the input stream. Markers delineate boundaries to detect insertions and deletions. For every k sync information symbols, (n sync — k sync ) marker symbols may be inserted, resulting in an overhead of Thus, an exemplary embodiment of a synchronization
- the input symbol stream is modulated into nucleotides, transmitted over the DNA channel, and demodulated back into bits which are then grouped back into symbols.
- a random output symbol stream denote a random output symbol stream as with a
- a synchronization decoder A v may feed the raw probabilities to an outer code, or directly estimate
- the decoder A D is able to efficiently compute posterior probabilities via dynamic programming.
- Appendix III details the / ⁇ recursions involved based on the Markov chain model discussed above.
- error-correction may be applied per oligonucleotide.
- a deletion error causes a synchronization error, but even if the position of the deletion is known via synchronization, correction of the deleted symbol may be needed.
- An example of a short block code is the Bose-Chaudhuri-Hocquenghem (BCH) code.
- BCH Bose-Chaudhuri-Hocquenghem
- Primitive BCH codes over the Galois field JF 2 are a standard class of BCH codes.
- a positive integer m > 3 a t-error-correcting code has parameters
- d min is the minimum distance of the linear code.
- m 6
- t 4.
- BCH codes of longer lengths exist: BCH(127,78,7), BCH(127,71,9), BCH(127,64,10), BCH(127,57,11), BCH(127,50,13).
- BCH codes are applicable for DNA storage due to their short block lengths and error-correcting abilities per oligonucleotide. For a t- error-correcting BCH code with storage overhead the code corrects a fraction of errors.
- error-correction across multiple oligonucleotide segments provides protection against missing segments, and segments that have an individual decoding error.
- a basic kind of error-correction code is a Reed-Solomon (RS) code.
- An code has a minimum Hamming distance of n rs — k rs + 1.
- the storage overhead ratio is The code is able to correct up to U errors with unknown locations and E errors with known locations, where 2U + E ⁇ n— k.
- a Reed-Solomon code operates over a Galois field where m is a positive integer.
- n rs 2 m — 1.
- m 8
- a fraction ⁇ of the oligonucleotide segments have low-coverage during PCR-amplification. Thus, these segments are missing for decoding and reconstruction. Furthermore, despite synchronization and error-correction within each oligonucleotide, a fraction of oligonucleotide segments may be corrupted beyond repair. For these dual purposes, an error-correction code across blocks of oligonucleotides is crucial.
- An RS (n rs , k rs ) block is comprised of n rs oligonucleotide segments of which k rs segments hold information, and (n rs — k rs ) segments hold redundant parity information.
- each oligonucleotide may be allocated address bits in addition to information bits.
- the address indicates not only the RS -block for each oligonucleotide, but also the specific position of the segment within the RS-block. Only correctly decoded oligonucleotide segments labeled by addresses are included for RS-block decoding.
- Each oligonucleotide is equipped with an address since the DNA channel is partially described by a permutation channel as described herein. Thus, an overhead is incurred for DNA storage of where k addr bits per oligonucleotide are reserved for
- the end-to-end coding system aims to achieve zero error (perfect reliability) in order to extract data from DNA.
- an amount of overhead is selected in order for the overall system to have a very low probability of error in reconstruction.
- Each block of Figure 6 has an associated storage overhead.
- the total storage overhead v incurred by the system is given by,
- the overall storage ratio includes overhead for the addressing of each oligonucleotide error-correction across multiple oligonucleotide segments error- correction within each oligonucleotide synchronization within each oligonucleotide
- the overhead ratio for modulation and demodulation includes a factor of
- Example 5 (Synchronization Overhead vs. Reliability): To illustrate the effect of synchronization on decoding error within each oligonucleotide, the following experiment was simulated based on all code blocks of Figure 6, excluding error-correction across multiple oligonucleotides.
- each oligonucleotide is able to store 50 bits comprised of both address and information bits. An extra fixed bit was added to the end of the stream to allow for an even length of 128 bits.
- a varying amount of overhead was selected. After every k sync symbols in which each symbol is set equivalent to 2 bits, a marker symbol was inserted. For example, for a synchronization overhead of
- lengths of the symbol streams were 128,95,85,79 symbols respectively.
- Each symbol of k-mod— 2 bits was modulated onto mod — 1 nucleotide, i.e., unconstrained modulation with .
- mod — 1 nucleotide i.e., unconstrained modulation with .
- the oligonucleotide lengths were 128,95,85,79 nucleotides
- Figure 7 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of deletions. An increasing fraction of synchronization markers increases the probability of correct decoding at the expense of extra redundancy.
- the results of decoding individual oligonucleotides is plotted in Figure 7 for different probabilities of deletion p del in the DNA channel. Other types of error were excluded in the simulation.
- Each data-point in the curves is the result of decoding 10 4 oligonucleotides.
- the synchronization code with the highest overhead yields the best performance curve at the cost of synthesizing a longer oligonucleotide.
- each oligonucleotide is able to hold 50 bits of combined address and payload data.
- An extra fixed bit was added to each bit stream to allow for an even length of 128 bits.
- the synchronization code was selected with so that the total number of symbols was
- Each symbol was mapped to a nucleotide with unconstrained modulation yielding oligonucleotide segments of length 95 nucleotides.
- Figure 8 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of insertions, deletions, and substitutions.
- the plot shown in Figure 8 illustrates the decoding success probability under varying probabilities for across the DNA channel. A combined presence of diverse
- deletions and insertions errors are costly to accommodate in terms of storage overhead.
- the addressing overhead is costly if a large number of oligonucleotide segments is necessary.
- the following parameters would enable a terabyte storage system for the DNA storage channel. Assuming unconstrained modulation in which a BCH (127,50) error-
- each oligonucleotide is of length 95 nucleotides, and stores 8 bits ( 1 byte) of payload information.
- the storage overhead is costly due primarily in terms of the address space overhead of
- Each oligonucleotide is able to store 16 bits (2 bytes) of information which allows 2 34 addresses possible.
- the parameters for both exemplary terabyte and gigabyte DNA storage systems were selected to achieve high reliability in the presence of diverse errors in the DNA storage channel.
- the exact probability of perfect reconstruction may be estimated based on decoding curves for decoding individual oligonucleotide segments, as illustrated in Example 5 and Example 6. For example, for a rate one-half Reed-Solomon code per block of oligonucleotide segments, if more than one -half of the segments are not missing and decoded correctly, the code recovers the entire block with zero error.
- an embodiment comprising coded multiple sequence alignment substantially reduces the overhead and/or corrects for hyper-lossy DNA channels.
- Enzymatic synthesis of an oligonucleotide occurs incrementally with the addition of each nucleotide.
- the probability of deletion p del as defined in the Markov chain model described above may be greater than 5%, i.e., p del ⁇ 0.05.
- the biochemical process of synthesis produces thousands of parallel versions of the same oligonucleotide. Each parallel version provides biological diversity. The diverse versions may be corrupted by errors in different locations along the strand. In accordance with an aspect of the present principles, an embodiment harnesses this diversity as an aspect of error-correction. It will be assumed that the errors in multiple versions are statistically independent and identically distributed, although in practice the errors may be correlated.
- Table 2 provides an example of multiple versions of a 16-nt oligonucleotide after enzymatic synthesis, PCR-amplification, and sequencing.
- the sequencing is conducted by reading several thousand times.
- Each parallel version accumulates a count of reads proportional to its prevalence relative to other versions synthesized.
- the table lists the versions ranked in order of the highest number of reads. Versions corrupted by rare types of errors do not rank highly in terms of the number of reads. Thus, filtering and sorting versions according to the number of reads improves decoding. However, as shown in Table 2, the versions with the highest number of reads do not necessarily have the lowest number of deletions or errors.
- Multiple sequence alignment may be applied across the top-ranked candidate versions.
- Aligning protein and nucleotide sequences is a classic problem in bioinformatics. For aligning more than two sequences, the original problem is NP-hard in time complexity.
- well-known techniques include the Needleman-Wunsch algorithm for global alignment, and the Smith- Waterman algorithm for local alignment. Both algorithms apply dynamic programming to compare sequences and produce optimal local or globalmatchings. Given two sequences of lengths / and P, each algorithm has a time complexity of 0 J X P) .
- a straightforward generalization of the recursive algorithms to K > 2 sequences results in a prohibitive time complexity of 0(J a ) where / is the maximum length of all sequences.
- synchronization codes described above may be extended to incorporate multiple versions of an oligonucleotide.
- Each oligonucleotide is coded with synchronization symbols.
- multiple versions are produced with different types of errors but each version has an identical synchronization pattern.
- Example 7 Dynamic Programming— Two Versions To simulate improvements in decoding due to alignment with two versions, the parameters of the DNA storage system of Example 6 are replicated: ⁇ addition, only deletion
- coded MSA with two parallel versions improves the probability of correct decoding of an oligonucleotide by an order of magnitude.
- the diversity in two parallel versions improves synchronization.
- an embodiment may comprise coded alignment of multiple sequences as described herein.
- an embodiment may comprise consensus via product rule. For multiple short DNA sequences, decoding via dynamic programming is feasible and yields the optimal per-symbol decoding rule. However, in general, aligning ⁇ > 2 sequences of moderate lengths requires efficient consensus methods. Assume that ⁇ versions of an oligonucleotide are sequenced, where ⁇ is an even integer. Denote all paired versions as
- Equation (10) In order to combine the probabilities across all pairs, assuming that each symbol transmitted is equally likely, one method is to apply the product rule over all pairs,
- the product rule is well-known to those skilled in the art.
- the assumption is that each version of an oligonucleotide is independently synthesized in parallel, similar to the case of combining classifier outputs based onindependent measurements.
- Example 8 Coded MSA With 10 Versions
- probability p del is varied. As shown in Figure 10, coded MSA with ten parallel versions improves the probability of correct decoding of an oligonucleotide by several orders of magnitude. The diversity in pairs of sequences is exploited via dynamic programming algorithms. The product rule of Equation 11 is applied for consensus over five pairs of sequences.
- an embodiment may comprise consensus via majority v.oting
- Theorem 3 shows that as long as the individual estimates based on each pair of versions are correct more than half the time, then a majority consensus over all pairs boosts the probability of correct decoding rapidly. Conversely, if the individual estimates are too noisy, then consensus via majority voting will not help improve decoding.
- Theorem 4 is valid for any P de i, it is an asymptotic bound in the length of the binary string.
- the constant C may be large depending on p ⁇ e an d the theorem does not provide a characterization for reconstruction from a fixed number of samples.
- the number of diverse sequences per oligonucleotide might not equal a sub -exponential amount.
- coded alignment allows for recovery from fewer total samples. It is also possible that existing algorithms for trace reconstruction could be modified to include synchronization markers and error-correction codes. The exact tradeoff of the amount of redundancy vs. accuracy in decoding is an open question.
- Figure 12 diagrams a DNA storage experiment with four oligonucleotide templates denoted by H2, H3, H4, H5.
- Each oligonucleotide stores 16 bits comprising 14 bits of data and 2 bits of address. The 16 bits are modulated onto 16 nucleotides.
- all nucleotide sequences include a synchronization pattern C, T, A, C. Aside from addressing, synchronization, and simple modulation, there is no extra redundancy added for error- correction. The total overhead redundancy is or slightly more than twice the
- the dotted-dashed line of each bar plot indicates a mixture distribution 0.2Q x (x) + 0.8P x (x) where Q x (x) is a uniform distribution over short oligo lengths, and P x (x) is a binomial distribution in which only deletions are modeled for the particular Eureka experiment.
- the distribution Qx(x) represents the possibility of early termination of synthesis which results in short oligo lengths, whereas the binomial distribution P x (x) is the expected distribution if expecting only deletion errors.
- the H4 data is the most noisy.
- the statistics of the read distribution reveal that the synthesis as well as the subsequent amplification and sequencing are hyper-lossy, and that the binomial models the tail of the data distribution.
- the synthesis process is able to produce many longer lengths before the target length which is difficult to obtain.
- the synthesized versions of each template H2-H5 contain deletions, but such versions may be utilized for decoding using diversity. If given enough synthesized versions of adequate lengths, the original template may be decoded with high probability.
- nucleotides in length in length.
- a few versions may be synthesized perfectly if enough reads are performed during sequencing.
- synthesizing a perfect oligonucleotide has a very low probability of occurrence which becomes even less as K increases. As the input length K increases, it is more and more likely to read versions which have a few deletions, and to read versions which are diverse.
- the diversity in the types of versions may be harnessed for decoding the original template correctly.
- decoding the H4 template is analyzed. Based on the original data set of 107034 sequenced reads, multiple bootstrapped data sets are created depending on a target number of total sequenced reads. The versions which have high read counts also have a high probability of appearing in bootstrapped data sets. For a particular bootstrapped data sample for the H4 template, Table 3 lists the top 10 filtered versions. Dynamic programming is utilized to decode the top 5 versions and the bottom 5 versions separately, followed by a product rulefor consensus. Such an algorithm is successful for the particular example. In general, the probability of correctly decoding each of the H2-H5 templates for a different number of target reads is plotted in Figure 14. The overall decoding illustrates that a decreased number of total reads is necessary to decode the original templates since synchronization with diversity and consensus is able to compensate for hyper-lossy conditions.
- Figure 12 illustrates a nucleotide template for the modulation used for the oligos of shown in Figure 12.
- Figure 18A there are 16 blocks representing 16 nucleotides of oligos produced according to the scheme.
- the four blocks labeled C, A, T, C are encoded as these types of nucleotides to provide a synchronization pattern as described herein.
- the eight unshaded and unlabeled blocks correspond to nucleotides encoding trits rather than bits. That is, 12 bits are encoded into 8 nucleotides or approximately 1.5 bits per nucleotide, i.e., one trit per nucleotide.
- the four shaded blocks, i.e., blocks 8, 11, 14 and 16, correspond to nucleotides that will encode one bit per nucleotide. Therefore, the 12 nucleotides other than the synchronization nucleotides encode a total of 16 bits., i.e., 12 bits encoded using trits into 8 nucleotides plus 4 bits encoded into 4 nucleotides.
- each nucleotide could encode two bits because there are four possible types of nucleotides, i.e., A, T, C, G, which might appear to enable encoding 24 bits into 12 nucleotides, not all transitions are permitted as explained previously in regard to constraints associated with the modulation scheme of Figure 4. Similar constraints apply to the modulation scheme of Figure 12. For example, two adjacent nucleotides of the same type are not permitted.
- Figure 18B shows the modulation mapping to produce the oligo H4 shown in Figure 12 and based on the overall approach explained above in regard to Figure 18A. Details of the mapping shown in Figure 18B may be better understood by referring to the tables shown in Figures 19, 20 and 21.
- the table on the left side of Figure 19 shows a mapping of bits to trits used to encode 12 bits into the eight nucleotides encoding trits.
- the table on the right side shows the inverse mapping, i.e., trits to bits, that may be used for decoding.
- the table in Figure 20 shows the mapping of bits to nucleotides for the four nucleotides encoding bits.
- Figure 20 defines the current nucleotide to write given the previous nucleotide (left most column in Figure 20), the next nucleotide (second column from the left), and the current value of the bit to be encoded, i.e., 0 or 1 (two columns on the right side of the table).
- the table in Figure 20 takes into account the transitions forbidden by the constraints.
- Figure 21 illustrates the encoding of trits to nucleotides. That is, using the mapping of bits to trits shown in Figure 19, the resulting trits are then encoded into nucleotides using the table of Figure 21.
- the mapping associated with the table shown in Figure 21 takes into account the transitions forbidden by the constraints.
- the nucleotide to be written depends on both the trit to be encoded and the previous nucleotide as shown in Figure 21.
- Figure 22 shows another exemplary embodiment of modulation mapping using a striped synchronization pattern and 24 nucleotides total of which 7 nucleotides form a synchronization pattern (e.g., C, T, A, C, T, A, C in Figure 22), 13 nucleotides encode bits, and 4 nucleotides encode trits (6 bits).
- Figure 15 illustrates a high-level block diagram of another embodiment of the overall encoding and decoding process, apparatus or system for storage of data in DNA and retrieval of that data.
- the system illustrated in Figure 15 includes certain functions that are similar to those illustrated in Figure 1. That is, certain blocks of Figure 15 correspond to blocks of Figure 1. Exceptions are explained below. Those that correspond have like reference designators.
- Figure 15 depicts a complete digital data storage and recovery system.
- a digital data management and processing portion 101 indicated in phantom includes encoding, decoding and error management of digital data
- a DNA storage and recovery portion 102 also indicated in phantom includes DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing.
- the system of Figure 15 is comprised of the following components.
- Block 110 indicates digital data or information existing in digital data bits both for input data to be stored and output data recovered from storage.
- Encoding mechanism 120 encodes the input information in bits including all error-correction codes and modulation from bits provided by block 110 to oligonucleotides referred to as oligos in Figure 15.
- a DNA synthesis function 130 synthesizes or writes the oligos encoded at 120 to multiple DNA segments.
- DNA synthesis 130 is followed by PCR amplification of the DNA segments 140 and storage of the amplified DNA segments 150.
- Recovery or retrieval of the data in the embodiment of Figure 1 begins by DNA sequencing of the amplified DNA strands at 160.
- the embodiment of Figure 15 differs from that of Figure 1 in that synthesis of the DNA strands 130, i.e., writing the encoded oligo information to DNA molecules, is followed by storage of the synthesized strands at 1550. Then, recovery of data from the DNA stored at 1550 begins by selecting the DNA to be recovered and PCR amplification of the selected DNA at 1540. Storage immediately following synthesis and PCR amplification of only DNA selected from storage as in Figure 15 is referred to as random-access amplification.
- DNA sequencing occurs in Figure 15 at 160 as in Figure 1 and is followed by clustering and filtering of the DNA reads at 170 to produce recovered oligos (corresponds to merging at 170 in Figure 1).
- the recovered oligos produced at 170 are decoded at 180 in Figure 15 (and in Figure 1) to produce recovered data bits or payload information at 110 in both Figure 15 and Figure 1.
- Figure 16 shows a block diagram of another embodiment of an overall encoding and decoding process, apparatus or system in accordance with the present principles and provides additional information regarding each step of the process or component of the system.
- Input data also referred to herein as information or payload
- Input data 1605 is first processed in an encoder 1601 that encodes the input digital bits to oligos. Encoding 1601 begins at 1610 with addition of address bits to the digital data words or rows.
- the address bits enable identification of the sequence of words when decoded.
- the result of adding address bits is shown at 1612 where two address bits are prepended to each digital input or payload row or word. Although shown Figure 16 as prepended to the digital words, it will be apparent to one skilled in the art that the added bits may be added at other locations in the digital words.
- an additional row or word is added to provide block error correction capability, e.g., Reed-Solomon or Fountain, as illustrated at 1617. Adding a row enables recovery of a missing row if two of three rows are recovered.
- the data from 1615 is processed at 1620 to add bits to each row to implement error correction within one word or row, e.g., BCH or LDPC error correction.
- the added bits in each row are illustrated at 1622 as three bits appended to each row. Although shown in Figure 16 as appended to the digital words, it will be apparent to one skilled in the art that the added bits may be added at other locations in the digital words.
- the rows of bits ouput from 1620 are mapped to nucleotides at 1625 according to a modulation map such as that described above in Figure 4 or that shown in Figure 12. In addition to modulation, processing at 1625 adds nucleotides beyond those produced by the modulation to provide for synchronization as described herein.
- each row provides an oligo code to be input to an enzymatic DNA synthesis process for synthesis or writing of a DNA strand for the oligo code.
- the rows shown at 1627 illustrate the synchronization feature added at 1625.
- each row or oligo code includes C, A and T nucleotides at the fifth, eighth, and eleventh positions. This pattern of CAT nucleotides at specific positions in each row enables implementation of synchronizing or alignment as described herein of the oligos that are recovered or retrieved from the pool of stored DNA strands. Other synchronization patterns may be used as described herein.
- decoding occurs in decoder 1602.
- DNA oligos to be decoded are illustrated at 1652.
- the oligos available for decoding at 1652 are produced by selecting and retrieving DNA strands from the DNA pool or solution, PCR amplifying the selected DNA (e.g., random access amplification as described in regard to 1540 of Figure 15), and sequencing the amplified DNA.
- the result from sequencing that is illustrated at 1652 indicates that a plurality of versions of the first oligo and of the third oligo (e.g., ten of each as indicated by "10X" at 1652) have been selected from the amplified material.
- the second oligo was not available or is missing, e.g., as a result of the lossy characteristics of the low-cost enzymatic DNA synthesis process.
- the available oligos are processed at 1660 applying coded alignment, consensus and diversity techniques as described herein to the plurality of versions of each of the first and third oligos.
- the processing produces one version of each of the first and third oligos at 1662 that statistically are the closest match to the correct oligos. However, typically errors remain within each oligo.
- the third, ninth and twelfth (final) nucleotides may be erroneous.
- the second oligo continues to be missing, i.e., there is no information regarding the second oligo at 1662.
- Demodulation occurs at 1665 using the synchronization information provided during encoding and by applying the inverse of the modulation mapping used during encoding at 1625. That is, 1665 maps the nucleotides of the oligos at 1662 to data bits creating data words or rows as illustrated at 1667. Due to the errors remaining in the rows or oligos available at 1662, corresponding errors exist in the words or rows of data bits produced by demodulation.
- the data bits of the binary words produced during demodulation corresponding to the erroneous nucleotides will also be erroneous.
- These errors within the binary rows or words at 1667 are corrected at 1670 by applying the oligo error correction feature implemented during encoding at 1620, e.g., BCH/LDPC codes.
- the error corrected binary rows produced by 1670 are illustrated at 1672. That is, the first and third rows at 1672 are correct or bit accurate. However, the second row remains missing at 1672.
- the missing second row at 1672 is recovered at 1675 by using the block error correction information of the third row added during encoding at 1615, e.g., RS/Fountain coding.
- the result of 1675 is two bit accurate rows of digital information or data illustrated at 1677.
- Each row of binary data at 1677 includes the addressing information added to the payload data during encoding at 1610.
- the payload data is extracted at 1680 to produce the two output binary words at 1682 that are bit accurate reproductions of the original input data at 1605.
- Figure 17 shows charts illustrating the multiple diverse versions of an oligo that may be produced by a lossy DNA synthesis procedure such as enzymatic synthesis.
- the top of Figure 17 illustrates a correct intended version of an oligonucleotide designated "H4" that might be produced by encoder 1601 of Figure 16 as described above.
- Nucleotides inserted into H4 as synchronization markers during modulation are illustrated in the upper chart of Figure 17 by the highlighted sixth, ninth, twelfth and fifteenth nucleotides in H4.
- the chart in the lower portion of Figure 17 illustrates ten versions of oligo H4 that may be available in the DNA pool or solution following enzymatic synthesis and PCR amplification as described herein (e.g., following 130, 1550 and 1540 in Figure 15).
- the ten versions illustrated in Figure 17 corrrespond to the ten versions listed in Table III above.
- the versions further illustrate various errors that may be produced by the lossy enzymatic synthesis process, e.g., nucleotide deletions and/or replacements and/or insertions. That is, one or more versions may be shorter or longer than H4 due to deletions or insertions; one or more versions may have incorrect nucleotides in certain positions due to replacments.
- a diverse group of erroneous oligo versions as illustrated in Figure 17 may be processed in accordance with the present principles using coded multiple sequence alignment and consensus techniques as described herein, e.g., the description above regarding Table III, to recover a sufficiently accurate version of H4 that may be used to recover the original payload data.
- an embodiment comprises a method for correcting hyper-lossy DNA storage systems as described herein.
- An aspect of the method comprises harnessing the diversity of enzymatic synthesis to assemble each oligonucleotide.
- the exemplary DNA storage method is robust to the various types of errors introduced by low-cost biological processes.
- an embodiment comprises: (1) coded multiple sequence alignment; (2) harnessing diversity in sequences; (3) harnessing consensus and combining classifiers; (4) harnessing error- correction codes; (5) synchronization codes; (6) specialized modulation from bits to nucleotides; (7) dynamic programming methods for efficient synchronization.
- an embodiment decreases the number of reads necessary for reliable decoding, and allows for enzymatic synthesis to write longer oligonucleotide segments.
- the present principles enable use of low-cost, lossy or error-prone DNA systems for reliable data storage enabling use of such systems for various data storage applications in any system processing and storing data to improve the operation and data storage capabilities of such systems.
- coded alignment as described herein provides one way of coding to at least partially address use of a losy or error-prone DNA system, e.g., low cost, enzymatic synthesis and/or nanopore sequencing.
- replication or redundancy of nucleotides may be used to address high error rates that persist.
- Duplication can be detected in stored oligos similar to the detection of synchronization markers described herein.
- Duplication may be combined with consensus techniques as described herein to mitigate lossy or error-prone processing, e.g., synthesis such as low-cost enzymatic synthesis and/or sequencing such as nanopore sequencing.
- random variable T the number of nucleotides written starting from state s k and ending at state s k+1 in the Markov chain.
- random variable U the number of nucleotides written starting from either the WRITE state or the WRITE ERROR state and ending at the state s k+1 .
- the random variable U has a geometric distribution:
- the moments of T may be generated from the function G T ( ⁇ ). For example, the mean and variance of the number of nucleotides written starting at state s k and ending at state s k+1 are
- permutations destroy ordering of oligonucleotide segments.
- the total capacity in bits of the permutation channel follows directly, and is as claimed in Eqn. (5).
- the conditional distribution of the quaternary channel is as follows. If a nucleotide is transmitted, then with probability 1— p sub , the correct nucleotide is received.
- the Markov chain as explained above may be approximated by "unwrapping" it.
- the assumed modulation is unconstrained with 2 bits mapping to 1 nucleotide.
- a symbol is assumed to be 2 bits. Insertions, deletions, or substitutions of nucleotides correspond to insertions, deletions, or substitutions of symbols.
- the / ⁇ probabilities may be computed recursively. Denote by the probability of a uniformly random symbol. In the case of symbols of 2 bits, Furthermore, define the following function involving the
- Equation (14) the forward probabilities defined in Equation (14) may be determined recursively as follows,
- the posterior probabilities may be computed based on the ⁇ / ⁇ recursions.
- a simple estimator is able to decode each input 7 ⁇ by selecting the symbol with the highest posterior probability.
- an outer code could utilize the full sequence of raw posterior probabilities to jointly decode the input symbols.
- the probability represents a deletion event in the S sequence and a deletion event in the V sequence.
- the ⁇ 2 probability represents a deletion event in the S sequence and a transmission event in the V sequence.
- the ⁇ 3 probability represents a transmission event in the S sequence and a deletion event in the V sequence.
- the ⁇ probability represents transmission events in both sequences.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Genetics & Genomics (AREA)
- Probability & Statistics with Applications (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
A system to store digital information in DNA using a lossy DNA process involves encoding (1601) synchronization and addressing information with the digital information (1605) into DNA oligonucleotides (1627) and decoding (1602) the digital information from multiple oligonucleotide segments using coded alignment based on the synchronization information.
Description
APPARATUS, METHOD AND SYSTEM FOR
DIGITAL INFORMATION STORAGE IN DEOXYRIBONUCLEIC ACID (DNA)
TECHNICAL FIELD
The present principles relate generally to digital information storage in deoxyribonucleic acid (DNA).
BACKGROUND
Deoxyribonucleic acid (DNA) contains the genetic program for the biological development of life. However, DNA may be used as a compact storage medium for petabytes of information by encoding digital information into sequences of DNA nucleotides. The potential benefits of DNA storage include: (1) Extremely high-density storage beyond the order of terabytes within 1 gram of DNA; (2) Decades of stability and durability at moderate temperatures; (3) Biological replication due to PCR-amplification; (4) Rapid biological search and indexing mechanisms via primers; (5) Biological editing and re-encoding of segments via enzymes.
An end-to-end system for DNA storage comprises at least several components. Source information (e.g., a movie file) is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments. To access and reconstruct the source data reliably, the DNA segments must be sequenced, demodulated, assembled, and decoded while addressing errors that may occur.
One issue associated with storing data reliably in DNA molecules is a high cost of typical approaches that produce low error rates, i.e., substantially lossless synthesis. Relatively low-cost DNA processes exist such as enzymatic synthesis utilizing low-cost chemicals for growing DNA oligonucleotides of lengths under 103 base pairs. However, such low-cost approaches may be lossy or error-prone and produce a large fraction of errors such as insertion, deletion, or substitution errors.
SUMMARY
These and other drawbacks and disadvantages of the prior art are addressed by the present principles, which are generally directed to digital information storage in deoxyribonucleic acid (DNA).
In accordance with an aspect of the present principles, an embodiment of a sytem for storing digital data in DNA using a DNA process that may be error prone, e.g., comprising at least one of a lossy or noise DNA synthesis process and a lossy or noisy DNA sequencing process includes decoding data encoded into DNA oligonucleotides using coded multiple sequence alignment wherein a decoder includes a block which processes one or more versions of each DNA oligonucleotide using synchronization information included in the oligonucleotides, e.g., synchronization markers corresponding to nucleotides inserted into the oligonucleotides during encoding of the digital data. By coding (e.g., synchronization may be striped synchronization markers) the original synthesized DNA oligo, it is possible to align one or more corrupted versions of the original oligo, using the synchronization markers. The synchronization marker may be a simple code or more complex synchronization codes. The coded alignment may be based on dynamic programming algorithms to efficiently compute the alignment. Coded alignment may also involve multiple versions of an oligonucleotide.
In accordance with another aspect of the present principles, an embodiment of a sytem for storing digital data in DNA comprises harnessing diversity of multiple synthesized sequences. For example, two or more versions of an oligo may be used for alignment. Diversity may provide various advantages including boosting the performance of coded alignment (dynamic programming solution) and enabling use of consensus algorithms which correct errors in the oligonucleotides using techniques such as majority voting mechanisms.
In accordance with another aspect of the present principles, an embodiment of a sytem for storing digital data in DNA comprises using consensus, e.g., consensus algorithms. Consensus utilizes diversity (e.g., two or more versions of an oligo) to correct nucleotide errors . If one version of an oligo is corrupted in a certain position, other versions may not be corrupted in the same position. Thus, if synchronization/alignment is obtained, consensus over multiple diverse versions can greatly improve decoding accuracy. Consensus algorithms can scale efficiently to utilize 10, 100, or 1000 versions of an oligo because because relatively simple majority voting style approaches may be applied. In accordance with another aspect of the present principles, an embodiment comprises
consensus utilizing probabilistic information (e.g., similar to votes) from many versions of an oligo to correct errors in nucleotides.
In accordance with another aspect of the present principles, an embodiment of a sytem for storing digital data in DNA comprises error-correction features, e.g., an error correction pipeline. For example, an individual error-correction code (ECC) may be added to each oligo during encoding (oligo ECC). If alignment and consensus do not correct an oligo entirely, oligo error-correction will then correct errors within an oligo. In accordance with another aspect, block error correction may be included by adding oligo s representing block error correction codes during encoding (block ECC). The block error correction feature enables correcting errors across multiple oligos, in case any oligo is missing due to the lossy characteristics of the DNA synthesis process. Alignment and consensus may provide synchronization and partial error-correction in accordance with an aspect of the present principles. However, the oligo ECC and also block ECC over multiple oligos, provide further error-correction of bits.
In accordance with another aspect of the present principles, an embodiment of a sytem for storing digital data in DNA comprises modulation blocks. In accordance with an aspect, an encoder includes a final block wherein bits are converted to nucleotides. Another aspect comprises a modular structure enabling the system to accommodate different modulation schemes. In accordance with another aspect, a decoder comprises demodulation as a first block or function. In accordance with another aspect, the modular structure enables organizing error-correction for DNA storage. For example, traditional error-correction occurs over bits (not nucleotides) so in an embodiment error correction occurs before modulation in the encoder, and after demodulation in the decoder.
In accordance with an aspect of the present principles, one or more embodiments as described herein may provide one or more advantages such as the following. Systems in accordance with the present principles are highly error-tolerant, e.g., capable of correcting for nucleotide errors on the order of 20% or more produced by a lossy or error- prone DNA system, e.g., lossy synthesis and/or sequencing; correcting for nucleotide insertion erros and substitution errors; and correcting for missing oligos. Systems incoroporating coding in accordance with the present principles may reduce the required number of reads of DNA for a lossy DNA synthesis process due at least in part to the error
tolerance of an embodiment in accordance with the present principles. Systems incoroporating coding in accordance with the present principles and, in particular, error tolerant aspects of coding in accordance with the present principles may enable increasing the length of each oligo produced by synthesis, thereby allowing more bits to be encoded per oligo, despite the synthesis process being lossy or hyper-lossy such as may be the case with low cost synthesis approaches such as enzymatic synthesis and/or lossy DNA sequencing approaches during data reads. Thus, embodiments in accordance with the present principles make use of low-cost lossy or error-prone DNA systems including synthesis and/or sequencing processes for reliable storage of large volumes of data.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data comprises adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide using coded alignment.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data including adding synchronization information to the digital data may include the synchronization information comprising a synchronization marker.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data wherein a plurality of codes may include synchronization information, the synchronization information may be further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotides produced by a lossy or error-prone DNA synthesis process during decoding.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data wherein a plurality of codes may include synchronization information, the synchronization information may be further configured to enable a use of diversity including consensus.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data may further include adding error correction information to the digital data after adding address information and before converting the digital data to codes, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes. In accordance with another aspect, the error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed- Solomon code and a Fountain code. In accordance with another aspect, a word error correction code may include one of a BCH code and a LDPC code.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data may comprise adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment. In accordance with another aspect, synchronization information may comprise a synchronization marker. In accordance with another aspect, a plurality of codes including synchronization information may be further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotide produced by the lossy or error-prone DNA synthesis process during the subsequent decoding. In accordance with another aspect, a plurality of codes including synchronization information may be further configured to enable the use of diversity to include a use of consensus. In accordance with another aspect, error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed- Solomon code and a Fountain code. In accordance with another aspect, a word error correction code
may include one of a BCH code and a LDPC code. In accordance with another aspect, converting digital data to a plurality of codes may comprise a mapping of the digital data to nucleotides according to a modulation map.
In accordance with another aspect, converting digital data to a plurality of codes using a modulation map may comprise the modulation map being in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions. In accordance with another aspect, converting digital data to a plurality of codes using a modulation map may comprise the modulation map writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
In accordance with another aspect of the present principles, an embodiment of a method of decoding digital data stored in DNA may comprise aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy or erro- prone DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words. In accordance with another aspect, synchronization information may comprise a synchronization marker. In accordance with another aspect, aligning may further comprise using a diversity among the plurality of oligonucleotides. In accordance with another aspect, aligning using the diversity may further include a use of consensus. In accordance with another aspect, each of the plurality of binary words may include error correction information, and extracting the digital data may include error correction based on the error correction information. In accordance with another aspect, error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed- Solomon code and a Fountain code. In accordance with another aspect, a word error correction code may include one of a BCH code and a LDPC code. In accordance with another aspect, a modulation map may be in accordance with a constraint graph configuring the mapping to decrease a likelihood of
unreliable transitions. In accordance with another aspect, a modulation map may comprise writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
In accordance with another aspect of the present principles, an embodiment of a lossy or error-prone DNA synthesis process for digital data storage may comprise an enzymatic molecular synthesis process.
In accordance with another aspect of the present principles, an embodiment of an encoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
In accordance with another aspect of the present principles, an embodiment of a decoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
In accordance with another aspect of the present principles, an embodiment may comprise a non-transitory computer-readable medium storing computer-executable instructions executable to perform one or more aspects of embodiments of methods as described herein.
In accordance with another aspect of the present principles, an embodiment of a method of decoding digital data stored in DNA molecules may comprise accessing a plurality of DNA molecules storing encoded digital data, wherein the encoded digital data includes a code component comprising an address, synchronization information, and error correction information, and the DNA molecules were synthesized using a lossy or error- prone DNA synthesis process; sequencing the plurality of DNA molecules; merging and assembling the plurality of DNA molecules to form a plurality of DNA oligonucleotides; and decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes applying a coded multiple sequence alignment (MSA) procedure to the plurality of DNA segments using the marker symbol in each of the plurality of segments to align the plurality of segments, processing the aligned plurality of segments using a diversity scheme to extract digital information from the aligned plurality of segments, and performing error correction on the extracted digital information using the error correction component to produce the decoded digital data.
In accordance with another aspect of the present principles, an embodiment of an encoder comprises encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and an archive to store the plurality of DNA segments.
In accordance with another aspect of the present principles, an embodiment comprises a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder aligns the plurality of oligonucleotides using coded alignment based on synchronization information included in the plurality of oligonucleotides; demodulates the aligned plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and extracts the digital data from the binary words.
An aspect of the present principles comprises an embodiment of a system for storage of digital information in DNA including components comprising: (1) Source data in bits; (2) Encoding mechanism including all error-correction codes and modulation from bits to nucleotides; (3) DNA synthesis of multiple DNA segments; (4) PCR-amplification of DNA pools; (5) DNA archival storage; (6) DNA sequencing; (7) Merging and assembling multiple DNA segments; (8) Demodulation and decoding of all codes for reliable recovery.
In accordance with another aspect of the present principles, an embodiment addresses a challenge of designing a coded DNA storage system involving ensuring the efficiency and compatibility of the different components which correct diverse errors. An aspect of the present disclosure involves error-correction being introduced for DNA storage based on low-cost enzymatic molecular synthesis. An embodiment of a modular pipelined system provides encoding and decoding digital information reliably in DNA molecules in the presence of a large fraction of differentiated errors. A source of data is efficiently
encoded, modulated, and stored in a set of oligonucleotide segments. Using DNA sequencing, the information stored on multiple segments is assembled and decoded reliably. A classification of errors is provided for the biological processes of low-cost DNA synthesis, PCR-amplification, and DNA sequencing. Error-correction is designed to recover data corrupted by insertions, deletions, and substitutions of nucleotides. Low-cost synthesis also places constraints on the modulation of bits to nucleotides. A flexible system is presented comprised of error-correction codes (e.g., Reed-Solomon, LDPC, or polar codes), synchronization codes, and constrained modulation codes.
Another aspect involves addressing the problem of noisy synthesis by providing modular error-correction codes. In an embodiment, the encoding step comprises modulation, synchronization, addressing, and error-correction within each oligonucleotide and per block of multiple oligonucleotides. After DNA amplification and sequencing, the decoding step comprises reciprocal measures to assemble source data from multiple segments. Error-correction is designed to recover data corrupted by several types of errors including missing segments. The amount of overhead redundancy envisioned for reliable storage is specified and compared to information-theoretic bounds. To boost the reliability of decoding information in the presence of synchronization errors, for each oligonucleotide synthesized, it is also possible to utilize multiple outputs of an enzymatic synthesizer (e.g., an enzymatic synthesizer which is intended to produce one oligonucleotide may actually produce multiple output strands which have varying amounts of insertions, deletions, and substitutions).
Another aspect relates to an error-correction system suitable for low-cost enzymatic synthesis. Traditional error-correction systems for DNA storage are based on high-fidelity molecular synthesis (e.g., using Agilent micro-array platforms). Such synthesis machines are too costly for commercial purposes, and cannot enable ubiquitous storage. Enzymatic synthesis in accordance with aspects of the present disclosure produces DNA oligonucleotides using low-cost chemicals. The low-cost property allows eventually scaling up to terabytes of synthesized data.
Another aspect of the current disclosure involves addressing an issue with low-cost enzymatic synthesis regarding when writing successive nucleotides (e.g., ATCAGTGAGCTAGC...). Not all transitions have a high probability of success. To avoid
unreliable transitions, an aspect of the present disclosure involves introducing a constraint graph for nucleotide patterns. A modulation scheme implements the constraint graph. The purpose of modulation is to maximize the capacity of information (how many bits per nucleotide stored), and to minimize the probability of insertions/deletions (due to less reliable nucleotide transitions). Modulation in accordance with the present principles avoids less reliable transitions entirely. It is also possible to minimize the usage of the less reliable transitions.
Another aspect of the present principles addresses an issue with low-cost enzymatic synthesis involving multiple oligonucleotide segments produced all of which do not have the same lengths. An aspect of the present disclosure involves using synchronization codes to address distribution of lengths of oligonucleotides, otherwise the storage capacity may be significantly reduced. One type of synchronization code is a marker code. A marker code may comprise known marker symbols which are inserted in the bit stream prior to modulation.
Another aspect of the present disclosure addresses an issue with low-cost enzymatic synthesis wherein synchronization errors (insertion and deletion errors) may be corrected. Once oligonucleotide segments are synchronized, the positions which have been deleted may be corrected. In accordance with the present principles, error-correction within each oligonucleotide is provided to solve this problem. For example, BCH codes are applied for error-correction after the oligonucleotide has been synchronized to the correct length. That is, in accordance with the present principles, first synchronize and then correct. In addition, synchronization provides (posterior) probabilities for the sequence of bits stored in each oligonucleotide. As a result, other codes such as LDPC (low-density parity-check codes), polar codes, short Turbo codes, or convolutional codes could be applied as well.
Another aspect of the present principles addresses an issue with DNA storage involving assembling the original data from multiple, short, sequenced oligonucleotides. In accordance with the present principles, an address for each oligonucleotide is provided for assembly. During encoding, address bits are concatenated with data bits prior to applying error-correction for each oligonucleotide. During decoding, after an oligonucleotide has been synchronized and error-corrected, its address is retrieved and thus its position in assembling all of the original data is known.
Another aspect of the present disclosure involves addressing an issue with DNA storage involving that some oligonucleotides might be corrupted beyond repair, and/or lost in the DNA solution (e.g., if PCR-amplification does not yield enough copies for sequencing). We solve this problem by introducing a code (e.g., Reed-Solomon code) for a block of multiple oligonucleotides. After synchronization of each oligonucleotide, error- correction of each oligonucleotide, and retrieving the address of each oligonucleotide, multiple oligonucleotides in a block are reinforced by extra redundancy in case any of them is lost or corrupted beyond repair.
Another aspect of the present disclosure involves correcting for multiple types of errors that may occur simultaneously with DNA storage based on low-cost enzymatic synthesis. In accordance with the present principles, this problem is addressed using a modular error-correction pipeline. The order of the blocks in the pipeline provides for correcting diverse types of errors. It is possible to jointly design the blocks in the pipeline to improve efficiency, but a modular approach allows to improve each block separately.
Another aspect involves the approach to ordering of encoding and decoding for low-cost enzymatic synthesis. Encoding in accordance with the present principles involves the following in order: (1) Store data bits in multiple oligonucleotides with redundancy per block of oligonucleotides; (2) Add an address per oligonucleotide; (3) Add error-correction per oligonucleotide; (4) Add synchronization per oligonucleotide; (5) Transform bits to nucleotides using modulation scheme. Decoding in accordance with the present principles involves the following in order: (1) Demodulate each oligonucleotide from nucleotides to bits; (2) De- synchronize each oligonucleotide; (3) Error-correction by decoding each oligonucleotide; (4) Retrieve address for each oligonucleotide; (5) If any oligonucleotide in a block is corrupted beyond repair or missing, utilize the redundancy per block of oligonucleotides to reconstruct the original source data.
In accordance with another aspect, an embodiment of an encoder for encoding digital data comprises at least one processor configured at least one processor configured for adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information
is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide using coded alignment.
In accordance with another aspect, an embodiment of a device for encoding digital data comprises at least one processor configured for adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
In accordance with another aspect, an embodiment of a device for decoding digital data stored in DNA comprises at least one processor configured for aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
In accordance with another aspect, an embodiment of a system for storing digital information in DNA comprises: an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; an archive to store the plurality of DNA segments; a sequencer
sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder: aligns the plurality of oligonucleotides using coded alignment based on synchronization information included in the plurality of oligonucleotides; demodulates the aligned plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and extracts the digital data from the binary words.
BRIEF DESCRIPTION OF THE DRAWING
The present principles can be readily understood by considering the detailed description below in conjunction with the accompanying drawings wherein:
Figure 1 illustrates, in block diagram form, a DNA storage system in accordance with the present principles;
Figure 2 illustrates, in state diagram form, a system for DNA synthesis;
Figure 3 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles;
Figure 4 illustrates, in state diagram form, an exemplary embodiment of constraints for synthesizing DNA;
Figure 5 illustrates, in block diagram form, an exemplary embodiment of molecular-level DNA storage;
Figure 6 illustrates, in block diagram form, an exemplary embodiment of an error- correction pipeline for DNA storage of digital data;
Figure 7 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles;
Figure 8 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 9 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 10 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 11 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 12 illustrates, in chart form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 13 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 14 illustrates, in graphical form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 15 illustrates, in block diagram form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 16 illustrates, in block diagram form, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 17 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 18A and 18B illustrate, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 19 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 20 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles;
Figure 21 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles; and
Figure 22 illustrates, in the form of a table, an aspect of DNA digital data storage system in accordance with the present principles.
It should be understood that the drawings are for purposes of illustrating exemplary aspects of the present principles and are not necessarily the only possible configurations for illustrating the present principles. To facilitate understanding, throughout the various figures like reference designators refer to the same or similar features.
DETAILED DESCRIPTION
The present principles are generally directed to storage of digital information in deoxyribonucleic acid (DNA).
While one of ordinary skill in the art will readily contemplate various applications to which the present principles can be applied, the following description will focus on embodiments of the present principles applied to improving systems for processing and storing digital information or data. Processing and storing digital information in accordance with the present principles may be used in various embodiments and devices in accordance with the present principles such as archival, backup or cloud data storage systems, head-end systems such as those operated by a service provider to provide digital content, etc. However, one of ordinary skill in the art will readily contemplate other sytems, devices and applications to which the present principles can be applied, given the teachings of the present principles provided herein, while maintaining the spirit of the present principles. For example, the present principles have potential application for any system or device that has data processing capability and stores data. It is to be appreciated that the preceding description of systems and of devices is merely exemplary and illustrative and not exhaustive.
In accordance with the present principles, Figure 1 depicts an embodiment of a system, method or apparatus providing a complete storage cycle for storing and retrieving data using DNA as the storage medium. As explained in more detail below, the system includes at least the following components: (1) Source data in bits; (2) Encoding mechanism including all error-correction codes and modulation from bits to nucleotides; (3) DNA synthesis of multiple DNA segments; (4) PCR-amplification of DNA pools; (5) DNA archival storage; (6) DNA sequencing; (7) Merging and assembling multiple DNA segments; (8) Demodulation and decoding of all codes for reliable recovery. A system, apparatus and method in accordance with the present principles address a challenge of designing a coded DNA storage system involving ensuring the efficiency and compatibility of the different components which correct diverse errors.
An end-to-end system for DNA storage includes several components. Source information (e.g., a movie file) is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments. To access and reconstruct the source data reliably, the
DNA segments must be sequenced, demodulated, assembled, and decoded while also addressing errors that may occur. Figure 1 depicts the complete digital data storage and recovery system. The system involves a digital data management and processing portion 101 including encoding, decoding and error management of digital data and a DNA storage and recovery portion 102 including DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing. The system is comprised of the following components as shown in Fig. 1: source data 110 in bits; encoding mechanism 120 including all error- correction codes and modulation from bits to nucleotides; DNA synthesis 130 of multiple DNA segments; PCR-amplification 140 of DNA pools; DNA archival storage 150; DNA sequencing 160; merging and assembling multiple DNA segments 170; demodulation and decoding 180 of all codes for reliable recovery. One challenge of designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
In DNA storage, several types of errors occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines. By modeling the errors of DNA processing technologies, it is possible to define "DNA storage channels", many of which have only approximately known information-theoretic capacities by contrast to standardized, precisely-mapped wireless communication channels. For example, the capacity of the deletion channel is only known to within upper and lower bounds for independent and identically distributed deletion. However, errors such as deletions must be addressed to provide a reliable data storage system, especially when utilizing low-cost, next-generation DNA processes that may be lossy or noisy.
A DNA process that is lossy or noisy as referenced herein is intended to encompass various DNA synthesis and/or sequencing processes that may be error prone, e.g., produce errors such as deletions, insertions and substitutions as described herein. Such errors may occur as a result of using low cost DNA processing as part of a DNA data storage system. For example, a low cost system may include at least one of a lossy or noisy DNA synthesis process and a lossy or noisy DNA sequencing process. As a specific example, a lossy or
noisy DNA synthesis process may include a low-cost enzymatic synthesis process. Such processes use low cost chemicals and, therefore, may advantageously reduce the cost of a DNA data storage system. However, such low cost synthesis may introduce diverse errors at a relatively high rate. For example, diverse errors including nucleotide deletions, substitutions and insertions may occur at error rates on the order of 25% per nucleotide. In comparison, high-fidelity DNA synthesis machines utilizing chemical synthesis methods may provide a combined error rate one or two orders of magnitude less, e.g., under 1%, but at a much higher cost. Errors may also be introduced during sequencing of DNA oligonucleotides when reading or recovering data, e.g., by using a sequencing process such as nanopore sequencing that may tend to be lossy or noisy.
A feasible system should provide error resilience and reliable reconstruction from imperfectly synthesized and/or sequenced DNA strands. For example, in accordance with an aspect of the present disclosure, multiple levels of hybrid error-protection may be involved. In the following analysis, oligonucleotide segments are equipped with an address code which is a unique identifier. Digital payload information is stored across multiple segments, protected by modern codes organized in both the "horizontal" dimension (per oligonucleotide), and the "vertical" dimension (across multiple oligonucleotides). While accuracy in retrieval is an important consideration, so also is efficiency of such codes to reduce overhead costs in DNA synthesis and sequencing.
TABLE 1. MODEL PARAMETERS FOR DNA SYNTHESIS
Several biological constraints distinguish the storage of digital information in DNA molecules from storage on traditional data devices: tape drives, disk drives, read-only memory (ROM) cartridges, USB memory, and flash memories.
A fundamental constraint of storing information in DNA molecules is the limited length of oligonucleotide segments. Current DNA synthesis machines synthesize short segments nucleotide by nucleotide. Each oligonucleotide may be of variable length, depending on the number of synthesis errors incurred. To formalize the process, a DNA synthesizer produces M oligonucleotide segments of variable lengths. The fixed-length input xm and variable-length output wm of a synthesis machine are denoted as follows for the mth oligonucleotide where
of K < 1000 base pairs. Low-cost synthesizers may only function adequately over a smaller range of input lengths such as 100 < K < 250 base pairs. While the fixed input length is K, a synthesized oligonucleotide wm is of variable length Lm. The parameters listed in Table 1 specify a statistical model for the synthesis of molecules which is diagrammed in Figure 2. A key element is the presence of memory in the synthesis channel due to deletion errors.
It is possible that low-cost DNA synthesis exhibits insertion, deletion, substitution, and burst nucleotide errors. For example, in the case of an exemplary enzymatic synthesis, deletion errors are prevalent. The notion of memory in a deletion channel differs from memoryless channel models. A deletion occurring at any nucleotide within an oligonucleotide affects the position of all subsequent nucleotides.
Figure 2 illustrates a model for DNA synthesis. In state sk, the DNA synthesis machine aims to write the kth nucleotide of an oligonucleotide where Depending on
the number of insertions, deletions, and burst errors, the output oligonucleotide is of variable length. In more detail, Figure 2 illustrates a Markov chain mathematical model which includes various types of errors. Insertion, deletion, substitution, and burst errors have associated probabilities denoted by respectively. The
following theorem derives the mean and variance of the length of the mth synthesized oligonucleotide.
Theorem 1. (Variable -Length Oligonucleotide Segments):
Consider the Markov chain of Figure 2. Let T be the random number of nucleotides written starting from state sk and ending at state sk+1 for k G [1, ... , K] . Based on the nucleotide error probabilities the probability generating function of
For the mth output oligonucleotide, the expected length is
The proof is available in Appendix I, and may be adapted to several variations of the Markov chain model.
More specifically, consider an input length of K = 100 nucleotides to the synthesis machine. If the deletion probability is pdel = 0.10, then E[Lm] = 90, and VAR[Lm] = 9.
Thus, if a high probability of deletion exists, the output oligonucleotide segments will be of many different sizes. The information in each variable-length segment must be decoded properly. Example 2 (Distribution of Lengths): The distribution of the lengths of synthesized DNA segments is simulated and illustrated in Figure 3. For non-negligible insertion and deletion probabilities, a large fraction of the segments have unequal, variable lengths. Conventional DNA storage systems rely on high-fidelity micro-array synthesizers which produce segments of correct lengths. For example, in an experiment to store 22 mega-bytes of information in DNA, high-cost synthesis yielded a distribution spiked at the intended template length. Low-cost enzymatic synthesis may have a non-negligible deletion probability pdel which, in accordance with an aspect of the present disclosure, may be addressed by synchronization codes to correct for variable lengths. As shown in Figure 3, the distribution of the length of synthesized oligonucleotides varies according to insertion and deletion probabilities. Even for the strands have variable lengths
that, in accordance with an aspect of the present disclosure, may be synchronized.
There exists no qualitative difference between storing information in bits versus nucleotides. However, the limitations of biological processes for molecular- level storage constrain the set of valid nucleotide sequences. Low-cost DNA synthesis, PCR- amplification, and DNA sequencing require information to be stored in valid nucleotide patterns; invalid patterns are highly prone to error. In the case of low-cost DNA synthesis, an efficiency matrix characterizes the accuracy of writing successive nucleotides.
of efficiencies (probabilities) for Element is the probability of
successfully writing a nucleotide j given that a nucleotide i has been written.
The matrix E of efficiencies may be asymmetric. A constraint graph may be constructed based on nucleotide efficiencies; i.e., only transitions which have a higher probability of success for biological synthesis are included as valid transitions.
Definition 2 (Constraint Graph, Adjacency Matrix): A constraint graph is a directed graph. An irreducible constraint graph contains a path between any ordered pair of nodes. Let Ae denote the adjacency matrix of a constraint graph.
Example 3 (Constraint Graph): Figure 4 depicts a constraint graph which outlines valid and invalid transitions between nucleotides. The adjacency matrix of the constraint graph is
Homopolymer sequences of arbitrary lengths are invalid since edge transitions (A, A), (T, T), (C, C), (G, G) are not included in the graph. In addition, edge transitions (A, G), (T, G), (G, T), (C, T) are invalid transitions.
Beyond constraints on transitions between nucleotides, constraints on longer sequences of nucleotides are often necessitated to ensure proper PCR-amplification and DNA sequencing. For example, the exclusion of reverse complementary subsequences might be required, and/or the avoidance of specific nucleotide patterns reserved for primers. To encourage the correct binding of primers, a balanced distribution of nucleotides might also be enforced.
Even if there occur no nucleotide errors as described above, DNA storage requires constraints on input sequences of nucleotides. A modulation code maps sequences of bits to valid sequences of nucleotides. A demodulation function maps sequences of nucleotides to sequences of bits. The demodulation function may correct nucleotide errors partially if they occur, and/or defer error-correction to other decoding units. However, in either case,
the demodulation function should not drastically affect the synchronization of the sequences.
Definition 3 (Modulation and Demodulation Functions): Denote a modulation function Φ£ and a demodulation function as follows, where
The modulation function Φ£ maps a sequence of kmod bits to a sequence of nmod nucleotides. If nucleotide insertions and deletions occur, the demodulation function
may accept a sequence of nucleotides of length The
demodulation function produces a sequence of bits of length km' od where it may be the case that
if error correction is deferred.
Example 4 (Constrained Modulation and Demodulation): Consider the constraint graph of Figure 4. To construct a modulation map, define an auxiliary function
The auxiliary sequence is dx = 1, and for ί > 1,
Adhering to the constraint graph, an efficient modulation map
Similarly, a simple decoding and demodulation function
is given as follows for l≤j≤ km' od,
Regarding Example 4 and Figure 4, the modulation map Φ£ writes an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1. For example, Φ£(00111010) = ATCGCACA. Nucleotide transitions (A, G), (G, T), (C, T), (T, G) are not permitted in the modulation, as indicated by the constraint graph of Figure 4. The demodulation function <t>D maps both A and T to a 0, and both C and G to a 1. The mapping functions are optimal- each nucleotide may transition to two other nucleotides which is the maximum allowed by the constraint graph. It is observed that Furthermore, a
nucleotide insertion, deletion, or substitution error after modulation results in only a single bit of error in the demodulated binary sequence.
After biological synthesis of each oligonucleotide, the variable-length segments are stored compactly in an ultra-dense DNA solution. In the
absence of specially-allocated addresses for each segment, storing DNA in a solution implies the loss of ordering between segments. More precisely, the set of oligonucleotide segments {wm} is reordered arbitrarily via a noiseless permutation channel, yielding where π is a random one-to-one permutation of the original
order.
Figure 5 illustrates a macro-level view of DNA storage. The input to the system is
M oligonucleotide segments {xm} each of fixed length K. Synthesis produces variable- length segments {wm}, and storage in solution yields {w^^}. Prior to sequencing, the stored segments may be PCR-amplified to aid in the read process. The final step is DNA sequencing of all detectable strands.
Whether DNA sequencing occurs via the process of electrophoresis through a gel, via an automated DNA sequencer, or via next-generation microfluidic and nanopore devices, the information in each short oligonucleotide is re-assembled from reads. The modeling of nucleotide errors such as insertions, deletions, substitutions was already included mathematically for DNA synthesis, and is therefore not covered again at this stage. However, a new type of error occurs during sequencing. With a probability y, each
oligonucleotide is absent from the read output due to low coverage in its DNA solution. The low coverage may be the result of a lack of PCR-amplification for a particular oligonucleotide. The mathematical model for sequencing each oligonucleotide is,
After sequencing, the final output of the storage system is variable-length segments
In describing the sequencing, the process of merging copies of oligonucleotides due to PCR-amplification is not explored in further detail. The merging of segment copies may be done easily if each segment contains a protected address as a unique identifier. The merging is combined with a consensus approach based on edit distances. The edit distance between two strings is the minimum number of symbol insertions, deletions, or substitutions needed to transform one of the strings to the other string. The edit distance for variable-length segments is computed efficiently via dynamic programming which is essentially optimal in terms of its quadratic complexity in the length of the segments.
distance D(t, r, V, V) may be defined recursively. In the base case, if min = 0 , D(t, r, = max ( Otherwise, in the recursive case,
As displayed in Figure 5, the DNA molecular channel contains several sources of error. The inputs are M oligonucleotide segments each of fixed size K nucleotides. Thus, the capacity for the storage channel is at most 2MK bits, if each nucleotide represents 2 bits. This upper bound assumes no constraints for modulation, and no sources of error. When incorporating individual types of error in the system, the upper bound of 2MK bits may be tightened.
Theorem 2. (Upper Bounds on Storage Capacity): Let CST(M, K) be the storage capacity of the DNA molecular-level storage system described herein. The storage system contains M input oligonucleotides each of length K nucleotides. The capacity of the DNA storage system is upper-bounded as follows.
The adjacency matrix Ae corresponds to an irreducible constraint graph specifying valid nucleotide sequences. The alphabet∑ = {A, T, C, G}. The function
eigenvalue of a matrix is denoted by The proof is provided in Appendix II.
The upper bound due to the permutation channel in Eqn. (5) is important to show that both the number of oligonucleotide segments M and the length of each oligonucleotide K must grow in order to increase storage capacity adequately. The upper bound is computed as follows for a few pairs (M, K) .
For a constant K, the increase in storage capacity only grows at most as 0(log2 ) .
Oligonucleotide segments must not be too short, despite biological constraints due to low- cost DNA synthesis. To increase capacity at a rate linear in , the length K must grow at least as log2 M which is the number of bits to uniquely address each segment. As expected, the capacity is upper bounded by a rate of growth linear in K.
The upper bound relaxations provided in Theorem 2 are not achievable in most cases. For example, the capacity of the deletion channel is not known exactly. Therefore, the upper bound in Eqn. (6) is not attainable. For binary deletion channels, the capacity bounded as
random codes with block length approaching infinity. Even at finite lengths, the exponential complexity of decoding is impractical. Moreover, less is known for quaternary or z-ary deletion channels applicable for DNA storage. Beyond deletions, it is theoretically possible to analyze multiple errors occurring simultaneously to tighten the bounds jointly.
Measured with respect to upper bounds, practical codes for DNA storage must be efficient even in the case of short block lengths due to short lengths of each oligonucleotide. A key emphasis is handling a larger fraction of diverse errors. Figure 6 illustrates a pipeline for error-correction and processing comprised of the following components: (1) Modulation and demodulation between bits and nucleotides; (2) Encoding and decoding for synchronizing sequences of bits; (3) Error-correction across each oligonucleotide; (4) Error-correction across multiple oligonucleotides.
Prior to the modulation and demodulation blocks in Figure 6, the information stream is in bits. The modulation map converts the stream of bits to a stream of nucleotides
obeying the constraint graph described above. Example 4 defined a modulation map in which 1 bit mapped to 1 nucleotide. To simplify the discussion, assume an optimal unconstrained modulation map in which 2 bits map to 1 nucleotide. The modulation function is defined with parameters kmod = 2andnmod = 1. More precisely, Φ£ (01) =
The modulation map Φ£ is applied multiple times to pairs of bits in an even-length bit sequence; e.g.,
The demodulation function is defined by
10. During the demodulation of a nucleotide sequence to a bit sequence, the presence of an insertion or deletion of a nucleotide results in an insertion or deletion of 2 bits in the demodulated bit stream.
The input to the synchronization encoding block in Figure 6 is a stream of bits for each oligonucleotide. A stream of input bits may be grouped into symbols. For purposes of the present discussion, due to the subsequent unconstrained modulation, assume 2 bits corresponds to 1 symbol. A mechanism to achieve synchronization is to insert a marker symbol in the input stream. Markers delineate boundaries to detect insertions and deletions. For every ksync information symbols, (nsync— ksync) marker symbols may be inserted, resulting in an overhead of Thus, an exemplary embodiment of a synchronization
encoder Λ£ inserts marker symbols.
A synchronization decoder is more complex. Denote a random input symbol stream with markers as with a particular realization t =
For information symbols comprised of 2 bits, any symbol is assumed equally likely, For marker symbols if the correct
oligonucleotide, the input symbol stream is modulated into nucleotides, transmitted over the DNA channel, and demodulated back into bits which are then grouped back into symbols. Denote a random output symbol stream as with a
deletions occur, J≠ I. Despite this misalignment, to achieve synchronization it is possible to compute as the posterior probability of an input symbol for each ί
within range 1 < i < /. A synchronization decoder Av may feed the raw probabilities to an outer code, or directly estimate
The decoder AD is able to efficiently compute posterior probabilities via dynamic programming. Appendix III details the /β recursions involved based on the Markov chain model discussed above.
In addition to the synchronization achieved by the synchronization blocks in Figure 6, error-correction may be applied per oligonucleotide. For example, a deletion error causes a synchronization error, but even if the position of the deletion is known via synchronization, correction of the deleted symbol may be needed. An example of a short block code is the Bose-Chaudhuri-Hocquenghem (BCH) code. For this code, kbch bits are information bits, and bits are coded redundant bits. Thus, a storage overhead
of is incurred.
Primitive BCH codes over the Galois field JF2 are a standard class of BCH codes. For a positive integer m > 3, a t-error-correcting code has parameters
where dmin is the minimum distance of the linear code. As a concrete example, for m = 6, a code exists for which t > 4.
The following t -error-correcting BCH codes exist: BCH (63,57,1) , BCH(63,45,3), BCH(63,39,4), BCH(63,36,5), BCH(63,30,6). Similarly, the following BCH codes of longer lengths exist: BCH(127,78,7), BCH(127,71,9), BCH(127,64,10), BCH(127,57,11), BCH(127,50,13). These BCH codes are applicable for DNA storage due to their short block lengths and error-correcting abilities per oligonucleotide. For a t- error-correcting BCH code with storage overhead the code corrects a fraction of
errors.
In accordance with an aspect of the present disclosure, error-correction across multiple oligonucleotide segments provides protection against missing segments, and segments that have an individual decoding error. A basic kind of error-correction code is a
Reed-Solomon (RS) code. An
code has a minimum Hamming distance of nrs— krs + 1. The storage overhead ratio is The code is able to correct up to U errors with unknown locations and E errors with known locations, where 2U + E < n— k. A Reed-Solomon code operates over a Galois field where m is a positive integer. The
block length is nrs = 2m— 1. For example, setting m = 8, the well-known RS (nrs = 255, krs = 223) code corrects up to 16 errors with unknown locations, and up to 32 errors with known locations.
For the DNA storage system of Figure 6, a fraction γ of the oligonucleotide segments have low-coverage during PCR-amplification. Thus, these segments are missing for decoding and reconstruction. Furthermore, despite synchronization and error-correction within each oligonucleotide, a fraction of oligonucleotide segments may be corrupted beyond repair. For these dual purposes, an error-correction code across blocks of oligonucleotides is crucial. An RS (nrs, krs) block is comprised of nrs oligonucleotide segments of which krs segments hold information, and (nrs— krs) segments hold redundant parity information. The segments which are missing or corrupted beyond repair are detectable because each oligonucleotide may be allocated address bits in addition to information bits. The address indicates not only the RS -block for each oligonucleotide, but also the specific position of the segment within the RS-block. Only correctly decoded oligonucleotide segments labeled by addresses are included for RS-block decoding.
Each oligonucleotide is equipped with an address since the DNA channel is partially described by a permutation channel as described herein. Thus, an overhead is incurred for DNA storage of where kaddr bits per oligonucleotide are reserved for
oligonucleotide.
The end-to-end coding system aims to achieve zero error (perfect reliability) in order to extract data from DNA. Depending on the number of insertions, deletions, and substitutions expected from the DNA channel, an amount of overhead is selected in order for the overall system to have a very low probability of error in reconstruction. Each block of Figure 6 has an associated storage overhead. The total storage overhead v incurred by the system is given by,
The overall storage ratio includes overhead for the addressing of each oligonucleotide error-correction across multiple oligonucleotide segments error-
correction within each oligonucleotide synchronization within each oligonucleotide
2 since the conversion is from kmod bits to nmod nucleotides.
Example 5 (Synchronization Overhead vs. Reliability): To illustrate the effect of synchronization on decoding error within each oligonucleotide, the following experiment was simulated based on all code blocks of Figure 6, excluding error-correction across multiple oligonucleotides. For each oligonucleotide, the BCH (nbch, kbch) code was selected with parameters kbch = 50 bits, and nbch = 127 bits. Thus each oligonucleotide is able to store 50 bits comprised of both address and information bits. An extra fixed bit was added to the end of the stream to allow for an even length of 128 bits.
In addition, for the synchronization code, a varying amount of overhead was selected. After every ksync symbols in which each symbol is set equivalent to 2 bits, a marker symbol was inserted. For example, for a synchronization overhead of
after every 2 symbols or 4 bits, a marker symbol is inserted. The beginning and end of the stream were without marker symbols. For synchronization ratios of tne overa^
lengths of the symbol streams were 128,95,85,79 symbols respectively. Each symbol of k-mod— 2 bits was modulated onto mod— 1 nucleotide, i.e., unconstrained modulation with . Thus, the oligonucleotide lengths were 128,95,85,79 nucleotides
respectively. An increasing oligo length accommodates more synchronization overhead. Figure 7 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of deletions. An increasing fraction of synchronization markers increases the probability of correct decoding at the expense of extra redundancy.
The results of decoding individual oligonucleotides is plotted in Figure 7 for different probabilities of deletion pdel in the DNA channel. Other types of error were excluded in the simulation. Each data-point in the curves is the result of decoding 104 oligonucleotides. The synchronization code with the highest overhead yields the best performance curve at the cost of synthesizing a longer oligonucleotide.
Example 6 (Effect of Diverse Errors on Reliability): In this experiment, a code was selected with a t = 13 error-correction ability, so
that each oligonucleotide is able to hold 50 bits of combined address and payload data. An extra fixed bit was added to each bit stream to allow for an even length of 128 bits. The synchronization code was selected with so that the total number of symbols was
95. Each symbol was mapped to a nucleotide with unconstrained modulation yielding oligonucleotide segments of length 95 nucleotides.
Figure 8 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of insertions, deletions, and substitutions. The plot shown in Figure 8 illustrates the decoding success probability under varying probabilities for across the DNA channel. A combined presence of diverse
errors is more challenging to accommodate compared to the presence of only one type of error. Nevertheless, the storage system is robust to varying storage conditions. Each data- point in the curves is the result of decoding 104 oligonucleotides.
Based on the simulations provided, deletions and insertions errors are costly to accommodate in terms of storage overhead. In addition, the addressing overhead is costly if a large number of oligonucleotide segments is necessary. As a basic calculation, the following parameters would enable a terabyte storage system for the DNA storage channel. Assuming unconstrained modulation in which a BCH (127,50) error-
correction code within each oligonucleotide (plus 1 bit of padding), a synchronization code of rate a rate one-half Reed-Solomon code with and an addressing
overhead of 42 bits of address vs. 8 bits ( 1 byte) of payload information per oligonucleotide, the total overhead is computed as follows.
In the above calculation, each oligonucleotide is of length 95 nucleotides, and stores 8 bits ( 1 byte) of payload information. The system is able to store potentially 242 X 1 = 4 terabytes of information since 242 oligonucleotide addresses are allocated. The storage overhead is costly due primarily in terms of the address space overhead of
5.25 which is fundamental to the problem of assembling segments in DNA storage.
To adjust the above calculations, it is possible to write more bits per oligonucleotide. For example, if 34 bits are reserved for addresses and 16 bits for payload information per oligonucleotide, then The storage overhead is then given by
Each oligonucleotide is able to store 16 bits (2 bytes) of information which allows 234 addresses possible. In this case, the storage system is able to store only a maximum of 234 X 2 = 32 gigabytes of information.
The above designs for DNA storage assume unconstrained modulation, with oligonucleotide lengths of 95 nucleotides. If modulation is constrained under low-cost molecular synthesis, the overhead ratio for modulation could rise up to in which
1 bit is written for every 1 nucleotide. In addition to increasing the total storage overhead ratio v, the lengths of each oligonucleotide would increase two-fold to 190 nucleotides.
The parameters for both exemplary terabyte and gigabyte DNA storage systems were selected to achieve high reliability in the presence of diverse errors in the DNA storage channel. The exact probability of perfect reconstruction may be estimated based on decoding curves for decoding individual oligonucleotide segments, as illustrated in Example 5 and Example 6. For example, for a rate one-half Reed-Solomon code per block
of oligonucleotide segments, if more than one -half of the segments are not missing and decoded correctly, the code recovers the entire block with zero error.
The preceding description illustrates that a much higher overhead redundancy ratio may be necessary to correct for synchronization errors. In accordance with an aspect of the present principles, an embodiment comprising coded multiple sequence alignment substantially reduces the overhead and/or corrects for hyper-lossy DNA channels.
Enzymatic synthesis of an oligonucleotide occurs incrementally with the addition of each nucleotide. In the hyper-lossy regime, the probability of deletion pdel as defined in the Markov chain model described above may be greater than 5%, i.e., pdel≥ 0.05. Fortunately, the biochemical process of synthesis produces thousands of parallel versions of the same oligonucleotide. Each parallel version provides biological diversity. The diverse versions may be corrupted by errors in different locations along the strand. In accordance with an aspect of the present principles, an embodiment harnesses this diversity as an aspect of error-correction. It will be assumed that the errors in multiple versions are statistically independent and identically distributed, although in practice the errors may be correlated.
Table 2 provides an example of multiple versions of a 16-nt oligonucleotide after enzymatic synthesis, PCR-amplification, and sequencing. The sequencing is conducted by reading several thousand times. Each parallel version accumulates a count of reads proportional to its prevalence relative to other versions synthesized. The table lists the versions ranked in order of the highest number of reads. Versions corrupted by rare types of errors do not rank highly in terms of the number of reads. Thus, filtering and sorting versions according to the number of reads improves decoding. However, as shown in Table 2, the versions with the highest number of reads do not necessarily have the lowest number of deletions or errors. Multiple sequence alignment (MSA) may be applied across the top-ranked candidate versions.
Aligning protein and nucleotide sequences is a classic problem in bioinformatics. For aligning more than two sequences, the original problem is NP-hard in time complexity. In the case of two sequences, well-known techniques include the Needleman-Wunsch algorithm for global alignment, and the Smith- Waterman algorithm for local alignment. Both algorithms apply dynamic programming to compare sequences and produce optimal local or globalmatchings. Given two sequences of lengths / and P, each algorithm has a time complexity of 0 J X P) . A straightforward generalization of the recursive algorithms to K > 2 sequences results in a prohibitive time complexity of 0(Ja) where / is the maximum length of all sequences. Several low-complexity algorithms for MSA with κ > 2 sequences have been developed such as T/M-COFFEE. Such algorithms reduce the time complexity from exponential to polynomial in the input sequence lengths. However, heuristics for aligning more than two sequences using consensus methods are sub-optimal.
The analysis of synchronization codes described above may be extended to incorporate multiple versions of an oligonucleotide. Each oligonucleotide is coded with synchronization symbols. During enzymatic synthesis, multiple versions are produced with different types of errors but each version has an identical synchronization pattern. A compatible decoder is able to differentiate between errors across parallel versions and improve its accuracy under certain restrictions. As described above, let T = T1T2 ...
denote the random input symbols, and S = S-^■■■Sj_1Sj the output symbol sequence. In
addition, assume that a second version of output symbols isavailable, denoted by V = V1V2 ... Vp-jVp. The sequences T, S, and V may have different lengths depending on the number of deletions and insertions that occur. In Appendix 10, the following posterior probability for the i -th transmitted symbol is derived, The
posterior probabilities are efficiently computable for two versions in a time complexity of 0(J X P). Similar to Eqn. (17) for single coded sequences, the following decoding rule may be applied by a synchronization decoder utilizing two parallel versions,
Example 7. Dynamic Programming— Two Versions To simulate improvements in decoding due to alignment with two versions, the parameters of the DNA storage system of Example 6 are replicated: ^ addition, only deletion
errors are simulated with pdel varied. As shown in Figure ?, coded MSA with two parallel versions improves the probability of correct decoding of an oligonucleotide by an order of magnitude. The diversity in two parallel versions improves synchronization.
In accordance with an aspect of the present principles, an embodiment may comprise coded alignment of multiple sequences as described herein. In accordance with another aspect, an embodiment may comprise consensus via product rule. For multiple short DNA sequences, decoding via dynamic programming is feasible and yields the optimal per-symbol decoding rule. However, in general, aligning κ > 2 sequences of moderate lengths requires efficient consensus methods. Assume that κ versions of an oligonucleotide are sequenced, where κ is an even integer. Denote all paired versions as
compute the per-symbol probability given in Equation (10). In order to combine the probabilities across all pairs, assuming that each symbol transmitted is equally likely, one method is to apply the product rule over all pairs,
For combining classifiers, the product rule is well-known to those skilled in the art. The assumption is that each version of an oligonucleotide is independently synthesized in parallel, similar to the case of combining classifier outputs based onindependent measurements.
Example 8. (Coded MSA With 10 Versions) To simulate improvements in decoding due to alignment with multiple versions, the parameters of the DNA storage system of Example 6 are replicated again: ^6 deletion
probability pdel is varied. As shown in Figure 10, coded MSA with ten parallel versions improves the probability of correct decoding of an oligonucleotide by several orders of magnitude. The diversity in pairs of sequences is exploited via dynamic programming algorithms. The product rule of Equation 11 is applied for consensus over five pairs of sequences.
In accordance with another aspect of the present principles, an embodiment may comprise consensus via majority v.oting
To further understand the product rule of Equation 11, it is possible to analyze a related rule, majority voting, as a method of consensus across multiple pairs of sequences. In majority voting, based on the r-th pair of output versions, an estimate ii r is computed of the i-th input symbol tj,
symbol based on the r-th pair of output versions. Define the following independent Bernoulli random variables,
The independence of the Zix variables follows from the assumption that the multiple versions are independent. Under certain restrictions, majority voting boosts the probability of correct decoding via consensus.
Theorem 3. Consensus Via Majority Voting Consider independent Bernoulli random variables Zix as in Equation (13). For each i-th input symbol and r-th pair of output versions, assume that the estimate is incorrect less than half the time, i.e.
versions κ increases, the probability of majority error decreases to zero exponentially fast. Proof. Provided in Appendix V.
Theorem 3 shows that as long as the individual estimates based on each pair of versions are correct more than half the time, then a majority consensus over all pairs boosts the probability of correct decoding rapidly. Conversely, if the individual estimates are too noisy, then consensus via majority voting will not help improve decoding.
Without the element of synchronization codes, the sample complexity of uncoded alignment in the presence of deletion errors was recently solved theoretically. The following theorem establishes that a sub-exponential number of diverse sequences is required at a minimum to decode a binary string corrupted by deletions.
Theorem 4. Trace Reconstruction For any deletion probability pdel < 1, and any δ > 0, there exists a finite constant C such that, for any original binary string of length n, trace reconstruction succeeds with probability at least 1— δ if given
i.i.d. samples of the deletion channel applied to the string.
Thus, without coding, a significant number of diverse sequences may be necessary for alignment in the presence of deletions. Although Theorem 4 is valid for any Pdei, it is an asymptotic bound in the length of the binary string. The constant C may be large depending on p^e and the theorem does not provide a characterization for reconstruction from a fixed number of samples. In our DNA storage system, the number of diverse
sequences per oligonucleotide might not equal a sub -exponential amount. Moreover, although very short sequences are numerous, the number of sequences of adequate length is low for enzymatic synthesis. Therefore, coded alignment allows for recovery from fewer total samples. It is also possible that existing algorithms for trace reconstruction could be modified to include synchronization markers and error-correction codes. The exact tradeoff of the amount of redundancy vs. accuracy in decoding is an open question.
Experimental verification of theory is provided in this section. It is shown empirically that coded alignment utilizing diversity and consensus allows for correct decoding of oligonucleotides with a fewer number of reads. In addition, by using redundancy, codes enable the synthesis of longer oligonucleotides by compensating for hyper- lossy conditions.
Figure 12 diagrams a DNA storage experiment with four oligonucleotide templates denoted by H2, H3, H4, H5. Each oligonucleotide stores 16 bits comprising 14 bits of data and 2 bits of address. The 16 bits are modulated onto 16 nucleotides. In addition, all nucleotide sequences include a synchronization pattern C, T, A, C. Aside from addressing, synchronization, and simple modulation, there is no extra redundancy added for error- correction. The total overhead redundancy is or slightly more than twice the
information content. In an experiment designated "Eureka experiment", a total of 7 character bytes of the message "Eureka!" are synthesized and decoded correctly.
The read distribution indicates the number of reads for each length of oligonucleotide. As displayed in Figure 11 for each template oligo H2-H5, out of the total number of reads, a significant portion of the reads are short, and a diminishing fraction reach the target length of K = 16. While the exact distribution of the data is provided, it is helpful to model the distribution mathematically. The dotted-dashed line of each bar plot indicates a mixture distribution 0.2Qx(x) + 0.8Px(x) where Qx(x) is a uniform distribution over short oligo lengths, and Px(x) is a binomial distribution in which only deletions are modeled for the particular Eureka experiment.
The distribution Qx(x) represents the possibility of early termination of synthesis which results in short oligo lengths, whereas the binomial distribution Px(x) is the expected distribution if expecting only deletion errors. The probability of deletion pdel for each template H2-H5 was slightly different in the fitted mixture distribution. For H2 template, pdel = 0.59; for the H3 template, pdel = 0.55; for the H4 template, pdel = 0.65; and for the H5 template pdel = 0.56. The H4 data is the most noisy.
The statistics of the read distribution reveal that the synthesis as well as the subsequent amplification and sequencing are hyper-lossy, and that the binomial models the tail of the data distribution. The synthesis process is able to produce many longer lengths before the target length which is difficult to obtain. For the Eureka experiment, the synthesized versions of each template H2-H5 contain deletions, but such versions may be utilized for decoding using diversity. If given enough synthesized versions of adequate lengths, the original template may be decoded with high probability.
Assuming a length of K nucleotides per oligonucleotide, there exist
different types of synthesized versions of length x if only deletion errors occur. In actual experiments with H2-H5 templates, deletion errors were the most prominent errors. Mathematically, the number of diverse types of synthesized versions is very large. For example, assuming a template of K nucleotides which undergoes deletions during synthesis, the number of types of versions of length K— 2 grows as 0(K2) due to a combinatorial number of ways in which 2 deletions may occur. Similarly, the number of types of versions of length K— 3 grows as 0(K3). Due to the binomial theorem, a total of
diverse types of all possible lengths exist for 0 < x < K. Thus, the mathematical diversity of types provides hope to synthesize different versions for decoding purposes.
In biological experiments, the biological diversity in types for an input length of K = 16 nucleotides is graphed in Figure 13. Although the biological diversity is far less than the combinatorial amount predicted by mathematics, a Gaussian distribution in visible in which the number of diverse types of versions is highest at approximately
nucleotides in length. In the tail of the Gaussian bell curve, a few versions may be
synthesized perfectly if enough reads are performed during sequencing. However, synthesizing a perfect oligonucleotide has a very low probability of occurrence which becomes even less as K increases. As the input length K increases, it is more and more likely to read versions which have a few deletions, and to read versions which are diverse.
The diversity in the types of versions may be harnessed for decoding the original template correctly. In Table 3, decoding the H4 template is analyzed. Based on the original data set of 107034 sequenced reads, multiple bootstrapped data sets are created depending on a target number of total sequenced reads. The versions which have high read counts also have a high probability of appearing in bootstrapped data sets. For a particular bootstrapped data sample for the H4 template, Table 3 lists the top 10 filtered versions. Dynamic programming is utilized to decode the top 5 versions and the bottom 5 versions separately, followed by a product rulefor consensus. Such an algorithm is successful for the particular example. In general, the probability of correctly decoding each of the H2-H5 templates for a different number of target reads is plotted in Figure 14. The overall decoding illustrates that a decreased number of total reads is necessary to decode the original templates since synchronization with diversity and consensus is able to compensate for hyper-lossy conditions.
TABLE ΙII
In regard to the H4 template and the oligo patterns shown in Figure 12, the modulation scheme or mapping applied in Figure 12 differs from that described above and illustrated in Figure 4. The modulation scheme of Figure 12 may be understood from the diagrams and tables in Figures 18 A, 18B, and 19 through 21. Figure 18A illustrates a nucleotide template for the modulation used for the oligos of shown in Figure 12. In Figure 18 A, there are 16 blocks representing 16 nucleotides of oligos produced according to the scheme. The four blocks labeled C, A, T, C are encoded as these types of nucleotides to provide a synchronization pattern as described herein. The eight unshaded and unlabeled blocks, i.e., blocks 1 through 5, 7, 10, and 13 from left to right, correspond to nucleotides encoding trits rather than bits. That is, 12 bits are encoded into 8 nucleotides or approximately 1.5 bits per nucleotide, i.e., one trit per nucleotide. The four shaded blocks, i.e., blocks 8, 11, 14 and 16, correspond to nucleotides that will encode one bit per nucleotide. Therefore, the 12 nucleotides other than the synchronization nucleotides encode a total of 16 bits., i.e., 12 bits encoded using trits into 8 nucleotides plus 4 bits encoded into 4 nucleotides. The described approach for encoding 16 bits into 12 nucleotides advantageously provides greater data encoding density than bits alone. However, note that although it may appear that each nucleotide could encode two bits because there are four possible types of nucleotides, i.e., A, T, C, G, which might appear to enable encoding 24 bits into 12 nucleotides, not all transitions are permitted as explained previously in regard to constraints associated with the modulation scheme of Figure 4. Similar constraints apply to the modulation scheme of Figure 12. For example, two adjacent nucleotides of the same type are not permitted.
Figure 18B shows the modulation mapping to produce the oligo H4 shown in Figure 12 and based on the overall approach explained above in regard to Figure 18A. Details of the mapping shown in Figure 18B may be better understood by referring to the tables shown in Figures 19, 20 and 21. The table on the left side of Figure 19 shows a mapping of bits to trits used to encode 12 bits into the eight nucleotides encoding trits. The table on the right side shows the inverse mapping, i.e., trits to bits, that may be used for decoding. The table in Figure 20 shows the mapping of bits to nucleotides for the four nucleotides encoding bits. Figure 20 defines the current nucleotide to write given the previous
nucleotide (left most column in Figure 20), the next nucleotide (second column from the left), and the current value of the bit to be encoded, i.e., 0 or 1 (two columns on the right side of the table). The table in Figure 20 takes into account the transitions forbidden by the constraints. Figure 21 illustrates the encoding of trits to nucleotides. That is, using the mapping of bits to trits shown in Figure 19, the resulting trits are then encoded into nucleotides using the table of Figure 21. The mapping associated with the table shown in Figure 21 takes into account the transitions forbidden by the constraints. Thus, the nucleotide to be written depends on both the trit to be encoded and the previous nucleotide as shown in Figure 21.
As the modulation mappings of Figure 4 and Figure 12 demonstrate, the present principles envision and encompass various embodiments of modulation approaches or mappings. As another example, Figure 22 shows another exemplary embodiment of modulation mapping using a striped synchronization pattern and 24 nucleotides total of which 7 nucleotides form a synchronization pattern (e.g., C, T, A, C, T, A, C in Figure 22), 13 nucleotides encode bits, and 4 nucleotides encode trits (6 bits).
Figure 15 illustrates a high-level block diagram of another embodiment of the overall encoding and decoding process, apparatus or system for storage of data in DNA and retrieval of that data. The system illustrated in Figure 15 includes certain functions that are similar to those illustrated in Figure 1. That is, certain blocks of Figure 15 correspond to blocks of Figure 1. Exceptions are explained below. Those that correspond have like reference designators. As in Figure 1, Figure 15 depicts a complete digital data storage and recovery system. Also as in Figure 1, a digital data management and processing portion 101 indicated in phantom includes encoding, decoding and error management of digital data and a DNA storage and recovery portion 102 also indicated in phantom includes DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing. The system of Figure 15 is comprised of the following components. Block 110 indicates digital data or information existing in digital data bits both for input data to be stored and output data recovered from storage. Encoding mechanism 120 encodes the input information in bits including all error-correction codes and modulation from bits provided by block 110 to oligonucleotides referred to as oligos in Figure 15. A DNA synthesis function 130 synthesizes or writes the oligos encoded at 120 to multiple
DNA segments. In Figure 1, DNA synthesis 130 is followed by PCR amplification of the DNA segments 140 and storage of the amplified DNA segments 150. Recovery or retrieval of the data in the embodiment of Figure 1 begins by DNA sequencing of the amplified DNA strands at 160. The embodiment of Figure 15 differs from that of Figure 1 in that synthesis of the DNA strands 130, i.e., writing the encoded oligo information to DNA molecules, is followed by storage of the synthesized strands at 1550. Then, recovery of data from the DNA stored at 1550 begins by selecting the DNA to be recovered and PCR amplification of the selected DNA at 1540. Storage immediately following synthesis and PCR amplification of only DNA selected from storage as in Figure 15 is referred to as random-access amplification. DNA sequencing occurs in Figure 15 at 160 as in Figure 1 and is followed by clustering and filtering of the DNA reads at 170 to produce recovered oligos (corresponds to merging at 170 in Figure 1). The recovered oligos produced at 170 are decoded at 180 in Figure 15 (and in Figure 1) to produce recovered data bits or payload information at 110 in both Figure 15 and Figure 1.
Figure 16 shows a block diagram of another embodiment of an overall encoding and decoding process, apparatus or system in accordance with the present principles and provides additional information regarding each step of the process or component of the system. Input data, also referred to herein as information or payload, is shown in Figure 16 as two rows or words of digital data bits 1605. As will be apparent to one skilled in the art, input information to the system might also be provided in analog form and undergo analog to digital conversion, e.g., processed through an analog to digital converter (ADC), prior to processing for storage in DNA by the system in Figure 16. Input data 1605 is first processed in an encoder 1601 that encodes the input digital bits to oligos. Encoding 1601 begins at 1610 with addition of address bits to the digital data words or rows. The address bits enable identification of the sequence of words when decoded. The result of adding address bits is shown at 1612 where two address bits are prepended to each digital input or payload row or word. Although shown Figure 16 as prepended to the digital words, it will be apparent to one skilled in the art that the added bits may be added at other locations in the digital words. At 1615, an additional row or word is added to provide block error correction capability, e.g., Reed-Solomon or Fountain, as illustrated at 1617. Adding a row enables recovery of a missing row if two of three rows are recovered. The data from 1615
is processed at 1620 to add bits to each row to implement error correction within one word or row, e.g., BCH or LDPC error correction. The added bits in each row are illustrated at 1622 as three bits appended to each row. Although shown in Figure 16 as appended to the digital words, it will be apparent to one skilled in the art that the added bits may be added at other locations in the digital words. The rows of bits ouput from 1620 are mapped to nucleotides at 1625 according to a modulation map such as that described above in Figure 4 or that shown in Figure 12. In addition to modulation, processing at 1625 adds nucleotides beyond those produced by the modulation to provide for synchronization as described herein. The output of encoder 1601, i.e., the oligo codes output from 1625, are illustrated at 1627 where each row provides an oligo code to be input to an enzymatic DNA synthesis process for synthesis or writing of a DNA strand for the oligo code. The rows shown at 1627 illustrate the synchronization feature added at 1625. In particular, in the example of 1627, each row or oligo code includes C, A and T nucleotides at the fifth, eighth, and eleventh positions. This pattern of CAT nucleotides at specific positions in each row enables implementation of synchronizing or alignment as described herein of the oligos that are recovered or retrieved from the pool of stored DNA strands. Other synchronization patterns may be used as described herein.
Continuing with Figure 16, decoding occurs in decoder 1602. DNA oligos to be decoded are illustrated at 1652. The oligos available for decoding at 1652 are produced by selecting and retrieving DNA strands from the DNA pool or solution, PCR amplifying the selected DNA (e.g., random access amplification as described in regard to 1540 of Figure 15), and sequencing the amplified DNA. The result from sequencing that is illustrated at 1652 indicates that a plurality of versions of the first oligo and of the third oligo (e.g., ten of each as indicated by "10X" at 1652) have been selected from the amplified material. The second oligo was not available or is missing, e.g., as a result of the lossy characteristics of the low-cost enzymatic DNA synthesis process. The available oligos are processed at 1660 applying coded alignment, consensus and diversity techniques as described herein to the plurality of versions of each of the first and third oligos. The processing produces one version of each of the first and third oligos at 1662 that statistically are the closest match to the correct oligos. However, typically errors remain within each oligo. For example, in the first oligo at 1662 the third, ninth and twelfth (final) nucleotides may be erroneous.
Also, the second oligo continues to be missing, i.e., there is no information regarding the second oligo at 1662. Demodulation occurs at 1665 using the synchronization information provided during encoding and by applying the inverse of the modulation mapping used during encoding at 1625. That is, 1665 maps the nucleotides of the oligos at 1662 to data bits creating data words or rows as illustrated at 1667. Due to the errors remaining in the rows or oligos available at 1662, corresponding errors exist in the words or rows of data bits produced by demodulation. For example, if the third, ninth and twelfth nucleotides of the first oligo are erroneous as in the example mentioned above then the data bits of the binary words produced during demodulation corresponding to the erroneous nucleotides will also be erroneous. These errors within the binary rows or words at 1667 are corrected at 1670 by applying the oligo error correction feature implemented during encoding at 1620, e.g., BCH/LDPC codes. The error corrected binary rows produced by 1670 are illustrated at 1672. That is, the first and third rows at 1672 are correct or bit accurate. However, the second row remains missing at 1672. The missing second row at 1672 is recovered at 1675 by using the block error correction information of the third row added during encoding at 1615, e.g., RS/Fountain coding. The result of 1675 is two bit accurate rows of digital information or data illustrated at 1677. Each row of binary data at 1677 includes the addressing information added to the payload data during encoding at 1610. The payload data is extracted at 1680 to produce the two output binary words at 1682 that are bit accurate reproductions of the original input data at 1605.
Figure 17 shows charts illustrating the multiple diverse versions of an oligo that may be produced by a lossy DNA synthesis procedure such as enzymatic synthesis. The top of Figure 17 illustrates a correct intended version of an oligonucleotide designated "H4" that might be produced by encoder 1601 of Figure 16 as described above. Nucleotides inserted into H4 as synchronization markers during modulation (e.g., 1625 in Figure 16) are illustrated in the upper chart of Figure 17 by the highlighted sixth, ninth, twelfth and fifteenth nucleotides in H4. The chart in the lower portion of Figure 17 illustrates ten versions of oligo H4 that may be available in the DNA pool or solution following enzymatic synthesis and PCR amplification as described herein (e.g., following 130, 1550 and 1540 in Figure 15). The ten versions illustrated in Figure 17 corrrespond to the ten versions listed in Table III above. The versions further illustrate various errors that
may be produced by the lossy enzymatic synthesis process, e.g., nucleotide deletions and/or replacements and/or insertions. That is, one or more versions may be shorter or longer than H4 due to deletions or insertions; one or more versions may have incorrect nucleotides in certain positions due to replacments. A diverse group of erroneous oligo versions as illustrated in Figure 17 may be processed in accordance with the present principles using coded multiple sequence alignment and consensus techniques as described herein, e.g., the description above regarding Table III, to recover a sufficiently accurate version of H4 that may be used to recover the original payload data.
In accordance with an aspect of the present principles, an embodiment comprises a method for correcting hyper-lossy DNA storage systems as described herein. An aspect of the method comprises harnessing the diversity of enzymatic synthesis to assemble each oligonucleotide. The exemplary DNA storage method is robust to the various types of errors introduced by low-cost biological processes. In accordance with other aspects, an embodiment comprises: (1) coded multiple sequence alignment; (2) harnessing diversity in sequences; (3) harnessing consensus and combining classifiers; (4) harnessing error- correction codes; (5) synchronization codes; (6) specialized modulation from bits to nucleotides; (7) dynamic programming methods for efficient synchronization. In accordance with another aspect, an embodiment decreases the number of reads necessary for reliable decoding, and allows for enzymatic synthesis to write longer oligonucleotide segments. Thus, the present principles enable use of low-cost, lossy or error-prone DNA systems for reliable data storage enabling use of such systems for various data storage applications in any system processing and storing data to improve the operation and data storage capabilities of such systems.
In accordance with another aspect, in an embodiment coded alignment as described herein provides one way of coding to at least partially address use of a losy or error-prone DNA system, e.g., low cost, enzymatic synthesis and/or nanopore sequencing. In another embodiment, replication (duplication) or redundancy of nucleotides may be used to address high error rates that persist. Duplication can be detected in stored oligos similar to the detection of synchronization markers described herein. Duplication may be combined with consensus techniques as described herein to mitigate lossy or error-prone processing, e.g.,
synthesis such as low-cost enzymatic synthesis and/or sequencing such as nanopore sequencing.
The present description illustrates the present principles. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the present principles and are included within its spirit and scope.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the present principles and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. For example, use in the description when referring to the drawings of "top", "bottom", "left", "right" and other such terms indicating an orientation or relative relationship between areas of the Figures are illustrative only and not limiting as to the present principles.
Moreover, all statements herein reciting principles, aspects, and embodiments of the present principles, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Reference in the specification to "one embodiment" or "an embodiment" of the present principles, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the present principles are not limited to those precise embodiments, and that various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the
scope or spirit of the present principles. All such changes and modifications are intended to be included within the scope of the present principles.
APPENDIX I: Proof of Theorem 1
Define the random variable T as the number of nucleotides written starting from state sk and ending at state sk+1 in the Markov chain. Define random variable U as the number of nucleotides written starting from either the WRITE state or the WRITE ERROR state and ending at the state sk+1. The random variable U has a geometric distribution:
generating function of U is given by
for n > 1 with a base case of PT(0) = Paei- The probabilities Pj(n) for n > 1 are defined recursively. The generating function associated to random variable T is
The moments of T may be generated from the function GT(ω). For example, the mean and variance of the number of nucleotides written starting at state sk and ending at state sk+1 are
Higher-order moments of T may also be computed from the generating function GT(ω).
APPENDIX II: Proof of Theorem 2
Consider an upper bound due to the permutation channel. A related mathematical problem is first presented. Let set {xj for contain non-negative integers. Let
permutations destroy ordering of oligonucleotide segments. Information is only transmitted in the number of different types of segments possible. In this case, there are 4K unique oligonucleotide segments possible of length K , and M total segments. This problem corresponds to letting / = 4K and Y = M. The total capacity in bits of the permutation channel follows directly, and is as claimed in Eqn. (5).
Consider an upper bound due to individual nucleotide deletions with probability In addition, there exist a fraction γ of full oligonucleotide segments which are omitted.
An upper bound on the deletion channel is possible by relaxing the channel to an erasure channel. Assuming an overall fraction °f nucleotides are erased, the upper
bound is as claimed in Equation (6).
Consider an upper bound due to a quaternary symmetric channel with input and output alphabet Let X be a random input to the channel, and Y be a
random output. The conditional distribution of the quaternary channel is as follows. If a nucleotide is transmitted, then with probability 1— psub , the correct nucleotide is received.
is the z-ary generalization of the entropy function. An upper bound on the overall storage capacity is equivalent to MK uses of this noisy discrete memoryless quaternary channel as claimed in Equation (7).
APPENDIX III: /β Recursions
To compute (14) and (15) recursively, the Markov chain as explained above may be approximated by "unwrapping" it. The assumed modulation is unconstrained with 2 bits mapping to 1 nucleotide. A symbol is assumed to be 2 bits. Insertions, deletions, or substitutions of nucleotides correspond to insertions, deletions, or substitutions of symbols.
To simplify the Markov model further, assume pbur = 0, and consider a model in which only a maximum of two insertions is allowed. Consequently, defining the probability of symbol transmission be defined as there exist six possible error
events in the model:
• A symbol deletion occurs with probability
• A symbol is transmitted with probability
• A random symbol is inserted followed by a symbol transmission, with probability
• Two random symbols are inserted followed by a deletion, with probability
· Two random symbols are inserted followed by a symbol transmission, with probability
Based on the six possibilities of the DNA channel, the /β probabilities may be computed recursively. Denote by
the probability of a uniformly random symbol. In the case of symbols of 2 bits, Furthermore, define the following function involving the
symbol substitution error probability
Then the forward probabilities defined in Equation (14) may be determined recursively as follows,
Similarly the probabilities of Eqn. (15) may be determined recursively as follows,
The posterior probabilities may be computed based on the α/β recursions.
The posterior probability
channel, it is possible to estimate the posterior probability via dynamic programming. A simple estimator is able to decode each input 7^ by selecting the symbol with the highest posterior probability. In addition, an outer code could utilize the full sequence of raw posterior probabilities to jointly decode the input symbols.
APPENDIX IV: Coded MSA - Harnessing the Diversity of Two Parallel Versions
To compute posterior probabilities with two parallel versions of an oligonucleotide, define the event to represent that j symbols were received of the S version after i
symbols were transmitted of the input T sequence. Similarly, define the event to
represent that p symbols were received of the V sequence after i symbols were transmitted of the input T sequence. As in Appendix III, the alpha and beta probabilities for the S version are given by,
To compute the posterior probability of an input symbol and a specific position, to simplify exposition of the formulas, assume that pins = 0, i.e., no insertions occur. In this case, tallying all probabilities for both received sequences S and V results in the following computation.
where the probabilities are defined as follows.
The probability represents a deletion event in the S sequence and a deletion event in the V sequence. The η2 probability represents a deletion event in the S sequence and a transmission event in the V sequence. Similarly, the η3 probability represents a transmission event in the S sequence and a deletion event in the V sequence. The η probability represents transmission events in both sequences. After computing the alpha and beta probabilities, the computation for the posterior requires 0(J X P) time complexity.
APPENDIX V: Chernoff Bound for Majority Voting Rule
The following binary divergence function is defined for
Consider n independent and identically distributed Bernoulli random variables
where Chernoff s bound states that for a > p,
To prove Theorem 3, note that the probability of majority error over all - versions is given by,
Note that Zi r are independent Bernoulli-^ random variables and Chernoff s bound may be applied. Setting and p = η in Eqn. (28) proves the result. The result is
estimate must be correct greater than 50% of the time in order for consensus to improve decoding accuracy.
Claims
1. A method of encoding digital data comprising:
adding address information to the digital data;
converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable:
- synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and
- decoding the digital data from the DNA oligonucleotide using coded alignment.
2. An encoder for encoding digital data comprising at least one processor configured for
adding address information to the digital data;
converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable:
- synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and
- decoding the digital data from the DNA oligonucleotide using coded alignment.
3. The method of claim 1 or the encoder of claim 2 wherein the synchronization information comprises a synchronization marker.
4. The method of claim 1 or 3 the encoder of claim 2 or 3 wherein the plurality of codes including the synchronization information is further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotide produced by the lossy DNA synthesis process during decoding.
5. The method of claim 1, 3 or 4 or the encoder of claim 2, 3 or 4 wherein the plurality of codes including the synchronization information is further configured to enable the use of diversity to include consensus.
6. The method of any of claims 1, 3, 4, or 5 or the encoder of any of claims 2,
3, 4, or 5 further comprising adding error correction information to the digital data following adding the address information and before converting, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes.
7. The method or encoder of claim 6 wherein the error correction information includes a block error correction code and a word error correction code.
8. The method or encoder of claim 7 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
9. The method or encoder of claim 7 or 8 wherein the word error correction code includes one of a BCH code and a LDPC code.
10. A method of encoding digital data comprising:
adding address information to the digital data;
adding error correction information to the digital data including the address information;
converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a
deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
11. A device for encoding digital data comprising at least one processor configured for adding address information to the digital data;
adding error correction information to the digital data including the address information;
converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide using coded alignment.
12. The method of claim 10 or the device of claim 11 wherein the
synchronization information comprises a synchronization marker.
13. The method of claim 10 or 12 or the device of claim 11 or 12 wherein the plurality of codes including the synchronization information is further configured to enable a use of diversity among a plurality of versions of the DNA oligonucleotide produced by the lossy DNA synthesis process during the subsequent decoding.
14. The method of claim 10, 12, or 13 or the device of claim 11, 12, or 13 wherein the plurality of codes including the synchronization information is further configured to enable the use of diversity to include a use of consensus.
15. The method of any of claim 10, 12, 13, or 14 or the device of claim 11, 12, 13 or 14 wherein the error correction information includes a block error correction code and a word error correction code.
16. The method or device of claim 15 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
17. The method or device of claim 15 or 16 wherein the word error correction code includes one of a BCH code and a LDPC code.
18. The method of any of claims 1, 3 to 10, and 12 to 17 or device of claims 2 to 9 and 11 to 17 wherein converting comprises a mapping of the digital data to nucleotides according to a modulation map.
19. The method or device of claim 18 wherein the modulation map is in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
20. The method or device of claim 18 or 19 wherein the modulation map comprises writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
21. A method of decoding digital data stored in DNA comprising:
aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides;
demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and
extracting the digital data from the plurality of binary words.
22. A device for decoding digital data stored in DNA comprising at least one processor configured for:
aligning a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the aligning uses coded alignment based on synchronization information included in each of the plurality of oligonucleotides;
demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and
extracting the digital data from the plurality of binary words.
23. The method of claim 21 or the device of claim 22 wherein the synchronization information comprises a synchronization marker.
24. The method of claim 21 or 23 or the device of claim 22 or 23 wherein aligning further comprises using a diversity among the plurality of oligonucleotides.
25. The method or device of claim 24 wherein aligning using the diversity further includes a use of consensus.
26. The method of any of claims 21 or 23 to 25 or the device of any of claims
22 to 25 wherein each of the plurality of binary words comprises error correction information, and extracting the digital data includes error correction based on the error correction information.
27. The method or device of claim 26 wherein the error correction information includes a block error correction code and a word error correction code.
28. The method or device of claim 27 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
29. The method or device of claim 27 or 28 wherein the word error correction code includes one of a BCH code and a LDPC code.
30. The method of any of claims 21 or 23 to 29 or the device of any of claims 22 to 29 wherein the modulation map is in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
31. The method of any of claims 21 or 23 to 30 or the device of any of claims 22 to 30 wherein the modulation map comprises writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
32. The method or device of any of the preceding claims wherein the lossy synthesis process comprises an enzymatic molecular synthesis process.
33. A non-transitory computer-readable medium storing computer-executable instructions executable to perform a method according to any of claims 1, 3 to 10, 12 to 21, or 23 to 32.
A method of decoding digital data stored in DNA molecules comprising: sing a plurality of DNA molecules storing encoded digital data, wherein: the encoded digital data includes a code component comprising an address, synchronization information, and error correction
information, and
the DNA molecules were synthesized using a lossy DNA synthesis process;
sequencing the plurality of DNA molecules;
merging and assembling the plurality of DNA molecules to form a plurality of
DNA oligonucleotides; and
decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes
applying a coded multiple sequence alignment (MSA) procedure to the plurality of DNA segments using the marker symbol in each of the plurality of segments to align the plurality of segments,
processing the aligned plurality of segments using a diversity scheme to extract digital information from the aligned plurality of segments, and
performing error correction on the extracted digital information using the error correction component to produce the decoded digital data.
34. The method of claim 33 wherein the lossy DNA synthesis process comprises an enzymatic molecular synthesis process.
35. Apparatus comprising:
an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map;
a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process;
a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and
an archive to store the plurality of DNA segments.
36. Apparatus comprising:
a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and
a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder:
aligns the plurality of oligonucleotides using coded alignment based on synchronization information included in the plurality of oligonucleotides;
demodulates the aligned plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and
extracts the digital data from the binary words.
37. A system comprising:
an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map;
a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process;
a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments;
an archive to store the plurality of DNA segments;
a sequencer sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and
a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder:
aligns the plurality of oligonucleotides using coded alignment based on synchronization information included in the plurality of oligonucleotides;
demodulates the aligned plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and
- extracts the digital data from the binary words.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762458093P | 2017-02-13 | 2017-02-13 | |
US62/458,093 | 2017-02-13 | ||
US201762565142P | 2017-09-29 | 2017-09-29 | |
US201762565138P | 2017-09-29 | 2017-09-29 | |
US62/565,142 | 2017-09-29 | ||
US62/565,138 | 2017-09-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018148260A1 true WO2018148260A1 (en) | 2018-08-16 |
Family
ID=61244788
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2018/017193 WO2018148260A1 (en) | 2017-02-13 | 2018-02-07 | Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) |
PCT/US2018/017188 WO2018148257A1 (en) | 2017-02-13 | 2018-02-07 | Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2018/017188 WO2018148257A1 (en) | 2017-02-13 | 2018-02-07 | Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) |
Country Status (1)
Country | Link |
---|---|
WO (2) | WO2018148260A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10650312B2 (en) | 2016-11-16 | 2020-05-12 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
WO2020239806A1 (en) * | 2019-05-27 | 2020-12-03 | Vib Vzw | A method of storing digital information in pools of nucleic acid molecules |
WO2021045816A1 (en) * | 2019-09-05 | 2021-03-11 | Microsoft Technology Licensing, Llc | Reverse concatenation of error-correcting codes in dna data storage |
WO2021045830A1 (en) | 2019-09-06 | 2021-03-11 | Microsoft Technology Licensing, Llc | Array-based enzymatic oligonucleotide synthesis |
US10956806B2 (en) | 2019-06-10 | 2021-03-23 | International Business Machines Corporation | Efficient assembly of oligonucleotides for nucleic acid based data storage |
KR20210055071A (en) * | 2018-09-07 | 2021-05-14 | 이리디아, 인크. | Improved systems and methods for recording and reading data stored on polymers |
US11227219B2 (en) | 2018-05-16 | 2022-01-18 | Catalog Technologies, Inc. | Compositions and methods for nucleic acid-based data storage |
US11286479B2 (en) | 2018-03-16 | 2022-03-29 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
US11306353B2 (en) | 2020-05-11 | 2022-04-19 | Catalog Technologies, Inc. | Programs and functions in DNA-based data storage |
EP3904527A4 (en) * | 2018-12-26 | 2022-08-10 | BGI Shenzhen | METHOD AND APPARATUS FOR FIXED-POINT EDITING OF A DATA-STORED NUCLEOTIDE SEQUENCE |
US11535842B2 (en) | 2019-10-11 | 2022-12-27 | Catalog Technologies, Inc. | Nucleic acid security and authentication |
KR20230022510A (en) * | 2021-08-09 | 2023-02-16 | 서울대학교산학협력단 | Method, program and apparatus for encoding and decoding dna data using low density parity check code |
US11610651B2 (en) | 2019-05-09 | 2023-03-21 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
US11763169B2 (en) | 2016-11-16 | 2023-09-19 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
WO2025058684A1 (en) * | 2023-09-11 | 2025-03-20 | Western Digital Technologies, Inc. | Multi-tier error correction codes for dna data storage |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11781169B2 (en) | 2018-10-05 | 2023-10-10 | Microsoft Technology Licensing, Llc | Enzymatic DNA repair |
CN111680797B (en) * | 2020-05-08 | 2023-06-06 | 中国科学院计算技术研究所 | A DNA movable type printing machine, DNA-based data storage device and method |
CN112079893B (en) * | 2020-09-23 | 2022-05-03 | 南京原码科技合伙企业(有限合伙) | Method for synthesizing text required by DNA storage based on solid phase chemical synthesis method |
CN116226049B (en) * | 2022-12-19 | 2023-11-10 | 武汉大学 | Method, system and equipment for storing information by using DNA based on large and small fountain codes |
-
2018
- 2018-02-07 WO PCT/US2018/017193 patent/WO2018148260A1/en active Application Filing
- 2018-02-07 WO PCT/US2018/017188 patent/WO2018148257A1/en active Application Filing
Non-Patent Citations (5)
Title |
---|
BLAWAT MEINOLF ET AL: "Forward Error Correction for DNA Data Storage", PROCEDIA COMPUTER SCIENCE, ELSEVIER, AMSTERDAM, NL, vol. 80, 1 June 2016 (2016-06-01), pages 1011 - 1022, XP029565777, ISSN: 1877-0509, DOI: 10.1016/J.PROCS.2016.05.398 * |
ERIC L ANSON ET AL: "ReAligner", PROCEEDINGS OF THE FIRST ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY, SANTA FE, NEW YORK, ACM, US, 19 January 1997 (1997-01-19), pages 9 - 16, XP058359793, ISBN: 978-0-89791-882-4, DOI: 10.1145/267521.267524 * |
GOELA NAVEEN ET AL: "Advances in DNA storage", PROC. 2017 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), IEEE, 12 February 2017 (2017-02-12), pages 1, XP033146241, DOI: 10.1109/ITA.2017.8023453 * |
GOELA NAVEEN ET AL: "Encoding movies and data in DNA storage", PROC. 2016 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), IEEE, 31 January 2016 (2016-01-31), pages 1, XP033082410, DOI: 10.1109/ITA.2016.7888163 * |
MEINOLF BLAWAT ET AL: "Storing Movies in DNA", FOCUS ON FUTURE DISRUPTIVE TECHNOLOGIES - INNOVATION IN MOTION SUMMER 2015, 1 January 2015 (2015-01-01), pages 38 - 42, XP055462207, Retrieved from the Internet <URL:https://www.afcinema.com/IMG/pdf/technicolor_storing_movies_in_dna.pdf> [retrieved on 20180323] * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11763169B2 (en) | 2016-11-16 | 2023-09-19 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
US12236354B2 (en) | 2016-11-16 | 2025-02-25 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
US10650312B2 (en) | 2016-11-16 | 2020-05-12 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
US12001962B2 (en) | 2016-11-16 | 2024-06-04 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
US11379729B2 (en) | 2016-11-16 | 2022-07-05 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
US12006497B2 (en) | 2018-03-16 | 2024-06-11 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
US11286479B2 (en) | 2018-03-16 | 2022-03-29 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
US11227219B2 (en) | 2018-05-16 | 2022-01-18 | Catalog Technologies, Inc. | Compositions and methods for nucleic acid-based data storage |
KR102705160B1 (en) | 2018-09-07 | 2024-09-09 | 이리디아, 인크. | Improved system and method for recording and reading data stored in polymers |
KR20210055071A (en) * | 2018-09-07 | 2021-05-14 | 이리디아, 인크. | Improved systems and methods for recording and reading data stored on polymers |
EP3904527A4 (en) * | 2018-12-26 | 2022-08-10 | BGI Shenzhen | METHOD AND APPARATUS FOR FIXED-POINT EDITING OF A DATA-STORED NUCLEOTIDE SEQUENCE |
US12002547B2 (en) | 2019-05-09 | 2024-06-04 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
US11610651B2 (en) | 2019-05-09 | 2023-03-21 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
WO2020239806A1 (en) * | 2019-05-27 | 2020-12-03 | Vib Vzw | A method of storing digital information in pools of nucleic acid molecules |
US10956806B2 (en) | 2019-06-10 | 2021-03-23 | International Business Machines Corporation | Efficient assembly of oligonucleotides for nucleic acid based data storage |
WO2021045816A1 (en) * | 2019-09-05 | 2021-03-11 | Microsoft Technology Licensing, Llc | Reverse concatenation of error-correcting codes in dna data storage |
US12260937B2 (en) | 2019-09-05 | 2025-03-25 | Microsoft Technology Licensing, Llc | Reverse concatenation of error-correcting codes in DNA data storage |
WO2021045830A1 (en) | 2019-09-06 | 2021-03-11 | Microsoft Technology Licensing, Llc | Array-based enzymatic oligonucleotide synthesis |
US11535842B2 (en) | 2019-10-11 | 2022-12-27 | Catalog Technologies, Inc. | Nucleic acid security and authentication |
US11306353B2 (en) | 2020-05-11 | 2022-04-19 | Catalog Technologies, Inc. | Programs and functions in DNA-based data storage |
KR20230022510A (en) * | 2021-08-09 | 2023-02-16 | 서울대학교산학협력단 | Method, program and apparatus for encoding and decoding dna data using low density parity check code |
WO2023018157A1 (en) * | 2021-08-09 | 2023-02-16 | 서울대학교 산학협력단 | Dna data encoding and decoding method using low-density parity check code, program, and device |
KR102574250B1 (en) | 2021-08-09 | 2023-09-06 | 서울대학교산학협력단 | Method, program and apparatus for encoding and decoding dna data using low density parity check code |
WO2025058684A1 (en) * | 2023-09-11 | 2025-03-20 | Western Digital Technologies, Inc. | Multi-tier error correction codes for dna data storage |
Also Published As
Publication number | Publication date |
---|---|
WO2018148257A1 (en) | 2018-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018148260A1 (en) | Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) | |
Chandak et al. | Improved read/write cost tradeoff in DNA-based data storage using LDPC codes | |
Chandak et al. | Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes | |
Heckel et al. | Fundamental limits of DNA storage systems | |
Shomorony et al. | DNA-based storage: Models and fundamental limits | |
Shomorony et al. | Information-theoretic foundations of DNA data storage | |
US10027347B2 (en) | Methods for storing and reading digital data on a set of DNA strands | |
US20200211677A1 (en) | Dna-based data storage and retrieval | |
Cai et al. | Coding for sequence reconstruction for single edits | |
CN110569974B (en) | Hierarchical representation and interleaving encoding method for DNA storage that can contain artificial bases | |
Song et al. | Sequence-subset distance and coding for error control in DNA-based data storage | |
Shomorony et al. | Torn-paper coding | |
Heinis et al. | Survey of information encoding techniques for dna | |
Schoeny et al. | Novel combinatorial coding results for DNA sequencing and data storage | |
US20240184666A1 (en) | Preprocessing for Correcting Insertions and Deletions in DNA Data Storage | |
Yan et al. | A segmented-edit error-correcting code with re-synchronization function for DNA-based storage systems | |
Marcovich et al. | Reconstruction of strings from their substrings spectrum | |
Sabary et al. | Survey for a Decade of Coding for DNA Storage | |
Sokolovskii et al. | Coding over coupon collector channels for combinatorial motif-based DNA storage | |
CN116707541B (en) | A DNA storage method with pseudo-noise sequence encoding | |
CN116564424A (en) | DNA data storage method, reading method and terminal based on erasure codes and assembly technology | |
Tang et al. | Error-correcting codes for short tandem duplications and at most $ p $ substitutions | |
Tang et al. | Low-redundancy codes for correcting multiple short-duplication and edit errors | |
CN118841094A (en) | Space layered DNA storage method of large-scale oligonucleotide pool | |
Schwarz et al. | Optimizing fountain codes for DNA data storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18706118 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18706118 Country of ref document: EP Kind code of ref document: A1 |