WO2009005564A2 - Cellulose- and hemicellulose-degradation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same - Google Patents
Cellulose- and hemicellulose-degradation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same Download PDFInfo
- Publication number
- WO2009005564A2 WO2009005564A2 PCT/US2008/006379 US2008006379W WO2009005564A2 WO 2009005564 A2 WO2009005564 A2 WO 2009005564A2 US 2008006379 W US2008006379 W US 2008006379W WO 2009005564 A2 WO2009005564 A2 WO 2009005564A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nucleotides
- replaced
- amino acids
- seq
- codon
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/14—Hydrolases (3)
- C12N9/24—Hydrolases (3) acting on glycosyl compounds (3.2)
- C12N9/2402—Hydrolases (3) acting on glycosyl compounds (3.2) hydrolysing O- and S- glycosyl compounds (3.2.1)
- C12N9/2405—Glucanases
- C12N9/2434—Glucanases acting on beta-1,4-glucosidic bonds
- C12N9/2437—Cellulases (3.2.1.4; 3.2.1.74; 3.2.1.91; 3.2.1.150)
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/11—DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
- C12N15/52—Genes encoding for enzymes or proenzymes
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/0004—Oxidoreductases (1.)
- C12N9/0055—Oxidoreductases (1.) acting on diphenols and related substances as donors (1.10)
- C12N9/0057—Oxidoreductases (1.) acting on diphenols and related substances as donors (1.10) with oxygen as acceptor (1.10.3)
- C12N9/0061—Laccase (1.10.3.2)
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/14—Hydrolases (3)
- C12N9/24—Hydrolases (3) acting on glycosyl compounds (3.2)
- C12N9/2402—Hydrolases (3) acting on glycosyl compounds (3.2) hydrolysing O- and S- glycosyl compounds (3.2.1)
- C12N9/2477—Hemicellulases not provided in a preceding group
- C12N9/248—Xylanases
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Y—ENZYMES
- C12Y302/00—Hydrolases acting on glycosyl compounds, i.e. glycosylases (3.2)
- C12Y302/01—Glycosidases, i.e. enzymes hydrolysing O- and S-glycosyl compounds (3.2.1)
- C12Y302/01004—Cellulase (3.2.1.4), i.e. endo-1,4-beta-glucanase
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Y—ENZYMES
- C12Y302/00—Hydrolases acting on glycosyl compounds, i.e. glycosylases (3.2)
- C12Y302/01—Glycosidases, i.e. enzymes hydrolysing O- and S-glycosyl compounds (3.2.1)
- C12Y302/01091—Cellulose 1,4-beta-cellobiosidase (3.2.1.91)
Definitions
- the present invention relates to refining the translational kinetics of an mRNA into polypeptide, and polypeptide-encoding nucleotide sequences which have refined translational properties.
- Saccharomyces yeasts have proven to be safe, effective and user- friendly microorganisms for large-scale production of industrial ethanol from glucose- based feedstocks. Recently, efforts have been made to use cellulosic biomass as feedstock for producing ethanol.
- the major fermentable sugars from hydrolysis of these feedstocks such as rice and wheat straw, sugarcane bagasse, corn stover, corn fibre, softwood, hardwood and grasses
- lignin a major component of such feedstocks.
- Lignin minimizes the accessibility of cellulose and hemicellulose to microbial enzymes.
- lignin is generally associated with reduced digestibility of the overall plant biomass.
- yeast and other microorganisms that can degrade cellulose, hemicellulose and lignin. Many such pathways have been identified in organism such as white-rot fungi.
- Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation and poor expression. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pause structures coded for by specific di-codon nucleotide sequences in the open reading frame (ORF) can improve protein expression.
- ORF open reading frame
- hydrolysis enzyme-encoding nucleotide sequences with refined translational kinetics and methods of designing and synthesizing the same.
- a hydrolysis enzyme-encoding nucleotide sequence wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme polypeptide, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing original codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the resultant hydrolysis enzyme-encoding nucleotide is predicted to be translated rapidly along its entire length.
- Expression of the resultant hydrolysis enzyme-encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
- expression of the resultant hydrolysis enzyme- encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression products in cases where inappropriate or excessive translation pauses cause expression of inactive, insoluble or aggregated enzyme.
- hydrolysis enzyme-encoding nucleotide sequences wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme -encoding nucleotide sequence and is adapted for expression in a heterologous host organism, wherein at least 1 , 2, or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
- the host organism is not human, E. coli or S. cerevisiae.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CCCTCT (nucleotides 463-468); GGCCAA (nucleotides 94- 99); CAGTTT (nucleotides 565-570); GATATC (nucleotides 703-708); GTGGAA (nucleotides 691-696); GGATTT (nucleotides 1 192-1197); GGTATT (nucleotides 1198- 1203).
- CCCTCT nucleotides 463-468
- GGCCAA nucleotides 94- 99
- CAGTTT nucleotides 565-570
- GATATC nucleotides 703-708
- GTGGAA nucleotides 691-696
- GGATTT nucleotides 1 192-1197
- GGTATT nucleotides 1198- 1203
- CCCTCT nucleotides 463-4608 replaced with CCTTCT
- GGCCAA nucleotides 94-99 replaced with GGTCAA
- CAGTTT nucleotides 565-570 replaced with CAATTT
- GATATC nucleotides 703-708 replaced with GACATT
- GTGGAA nucleotides 691- 696 replaced with GTTGAA
- GGATTT nucleotides 1 192-1 197) replaced with GGTTTC
- GGTATT nucleotides 1 198-1203 replaced with GGAATT.
- the nucleotide sequence is optimized for expression in S. cerevisiae.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTCGGT (nucleotides 760-765); ATTGCC (nucleotides 631-636); GACAGC (nucleotides 1285-1290); GTCTGG (nucleotides 88-93); GTCTGG (nucleotides 1246-1251); TTGCTG (nucleotides 1231-1236); GTGGTG (nucleotides 571-576); ACGCTG (nucleotides 22-27); ACGCTG (nucleotides 31-36); GACTGG (nucleotides 1168-1173); GCCGGA (nucleotides 559-564); CTGGTG (nucleotides 748- 753).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTCGGT (nucleotides 760-765) replaced with CTGGGT; ATTGCC (nucleotides 631- 636) replaced with ATTGCG; GACAGC (nucleotides 1285-1290) replaced with GACTCT; GTCTGG (nucleotides 88-93) replaced with GTTTGG; GTCTGG (nucleotides 1246-1251) replaced with GTTTGG; TTGCTG (nucleotides 1231-1236) replaced with CTGCTG; GTGGTG (nucleotides 571-576) replaced with GTTGTT; ACGCTG (nucleotides 22-27) replaced with ACCCTC; ACGCTG (nucleotides 3
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CAGTTT (nucleotides 565-570); TTTGAC (nucleotides 1303-1308); TCGTTT (nucleotides 1240-1245); GGCCAA (nucleotides 94-99); AAGAAT (nucleotides 541-546); AAGAAT (nucleotides 934-939); GCCAAA (nucleotides 649-654); GTCAAG (nucleotides 1252-1257); GGTATT (nucleotides 1 198- 1203); ATCAAC (nucleotides 808-813); GGCCAT (nucleotides 865-870); CTTCCA (nucleotides 835-840); GATATC (nucleotides 703-708); TCGTTG (nucleotides 1228- 1233).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CAGTTT (nucleotides 565-570) replaced with CAATTT; TTTGAC (nucleotides 1303- 1308) replaced with TTTGAT; TCGTTT (nucleotides 1240-1245) replaced with TCTTTT; GGCCAA (nucleotides 94-99) replaced with GGACAA; AAGAAT (nucleotides 541-546) replaced with AAAAAT; AAGAAT (nucleotides 934-939) replaced with AAAAAC; GCCAAA (nucleotides 649-654) replaced with GCTAAA; GTCAAG (nucleotides 1252-1257) replaced with GTTAAA; GGTATT
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GGCCAA (nucleotides 94-99); CAGTTT (nucleotides 565- 570); GATATC (nucleotides 703-708); TATTTG (nucleotides 853-858); GGCCAT (nucleotides 865-870); TCGTTG (nucleotides 1228-1233); TTTGTC (nucleotides 1243- 1248); TTCCAA (nucleotides 1363-1368).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGCCAA (nucleotides 94-99) replaced with GGTCAA; CAGTTT (nucleotides 565-570) replaced with CAATTC; GATATC (nucleotides 703- 708) replaced with GACATT; TATTTG (nucleotides 853-858) replaced with TATTTA; GGCCAT (nucleotides 865-870) replaced with GGACAT; TCGTTG (nucleotides 1228- 1233) replaced with TCTTTA; TTTGTC (nucleotides 1243-1248) replaced with TTCGTT; TTCCAA (nucleotides 1363-1368) replaced with TTCCAG.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GTGCCT (nucleotides 55-60); GCCAAT (nucleotides 370- 375); GCTATT (nucleotides 406-41 1); GCCGGA (nucleotides 559-564); GCCAAT (nucleotides 778-783); TTGGCA (nucleotides 967-972); AAGCTG (nucleotides 1051- 1056); GCTATT (nucleotides 1066-1071); GCCAAT (nucleotides 1084-1089); ACCGGA (nucleotides 1 147-1 152); ACCGGA (nucleotides 1189-1 194); GGTATT (nucleotides 1198 - 1203); GACAGC (nucleotides 1285-1290); GATGCC (nucleotides 1327-1332); GCCTTG (nucleotides 1285-1290
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GTGCCT (nucleotides 55-60) replaced with GTTCCG; GCCAAT (nucleotides 370-375) replaced with GCTAAT; GCTATT (nucleotides 406-411) replaced with GCCATT; GCCGGA (nucleotides 559-564) replaced with GCTGGT;GCCAAT (nucleotides 778- 783) replaced with GCGAAT; TTGGCA (nucleotides 967-972) replaced with TTGGCT; AAGCTG (nucleotides 1051-1056) replaced with AAATTG; GCTATT (nucleotides 1066-1071) replaced with GCCATT; GCCAAT (nucleotides
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S. cerevisiae.
- a cellobiohydrolase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Or ⁇ ctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for degrading cellulose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4- ⁇ -glucanase, exo-l,4- ⁇ -D- glucanase, and ⁇ -D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizo saccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the exo-l,4- ⁇ -D-glucanase retains at least 75% of the enzymatic activity of wild-type TrCBH-II (SEQ ID NO: 2) under normal physiological conditions.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 27-62 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 27-62 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 27-62 when expressed in the native organism.
- no replacement codon encoding amino acids 27-62 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAC when expressed in the native organism.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 107- 471 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 107-471 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 107-471 when expressed in the native organism.
- no replacement codon encoding amino acids 107-471 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GCAAAG when expressed in the native organism.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 62-107 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 62- 107 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 62-107 when expressed in the native organism.
- At least one replacement codon encoding amino acids 62-107 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TCTACT when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 1474 - 1479); TTGAAT (nucleotides 802 - 807); ATCAAG (nucleotides 1477 - 1482); GCCAAG (nucleotides 526 - 531).
- GATATC nucleotides 1474 - 1479
- TTGAAT nucleotides 802 - 807
- ATCAAG nucleotides 1477 - 1482
- GCCAAG nucleotides 526 - 531.
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- At least 3 of the following codon pair replacements have been made: GATATC (nucleotides 1474 - 1479) replaced with GATATA; TTGAAT (nucleotides 802 - 807) replaced with TTAAAT; ATCAAG (nucleotides 1477 - 1482) replaced with ATAAAA; GCCAAG (nucleotides 526 - 531) replaced with GCAAAA.
- the nucleotide sequence is optimized for expression in S.cerevisiae.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TTCCTC (nucleotides 1405 - 1410); ATCCTC (nucleotides 892 - 897); TTCCAG (nucleotides 190 - 195); TTCCAG (nucleotides 265 - 270); GACAGC (nucleotides 1360 - 1365); TTCCCG (nucleotides 544 - 549); CAGGCG (nucleotides 457 - 462); GCGGCA (nucleotides 589 - 594); TTCCGC (nucleotides 1327 - 1332).
- TTCCTC nucleotides 1405 - 1410 replaced with TTCCTG
- ATCCTC nucleotides 892 - 897 replaced with ATCCTG
- TTCCAG nucleotides 190 - 195 replaced with TTCCAA
- TTCCAG nucleotides 265 - 270 replaced with TTTCAG
- GACAGC nucleotides 1360 - 1365 replaced with GATTCT
- TTCCCG nucleotides 544 - 549) replaced with TTCCCA
- CAGGCG nucleotides 457 - 462 replaced with CAAGCG
- GCGGCA nucleotides 589 -
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 1474 - 1479); ATCAAG (nucleotides 1477 - 1482); TTCAAC (nucleotides 1051 - 1056); ATCAAC (nucleotides 205 - 210); ATCAAC (nucleotides 571 - 576); ATCAAC (nucleotides 880 - 885); ATCAAC (nucleotides 1078 - 1083).
- GATATC nucleotides 1474 - 1479
- ATCAAG nucleotides 1477 - 1482
- TTCAAC nucleotides 1051 - 1056
- ATCAAC nucleotides 205 - 210
- ATCAAC nucleotides 571 - 576
- ATCAAC nucleotides 880 - 885
- ATCAAC nucleotides 1078 - 1083
- At least 3 of the following codon pair replacements have been made: GATATC (nucleotides 1474 - 1479) replaced with GACATT; ATCAAG (nucleotides 1477 - 1482) replaced with ATTAAA; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; ATCAAC (nucleotides 205 - 210) replaced with ATTAAT; ATCAAC (nucleotides 571 - 576) replaced with ATTAAT; ATCAAC (nucleotides 880 - 885) replaced with ATTAAT; ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT.
- the nucleotide sequence is optimized for expression in P. pastoris.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: AAGAAG (nucleotides 175 - 180 ); TTCCAT (nucleotides 349 - 354 ); GCCAAG (nucleotides 526 - 531 ); TTCCAT (nucleotides 1426 - 1431 ); GATATC (nucleotides 1474 - 1479 ).).
- AAGAAG nucleotides 175 - 180
- TTCCAT nucleotides 349 - 354
- GCCAAG nucleotides 526 - 531
- TTCCAT nucleotides 1426 - 1431
- GATATC nucleotides 1474 - 1479 .
- at least 3 of the following codon pair replacements have been made: AAGAAG (nucleotides 175 - 180 ) replaced with AAAAAG; TTCCAT (nucleotides 349
- nucleotide sequence is optimized for expression in K.lactis.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TCCGGT (nucleotides 7 - 12 ); ATCGGG (nucleotides 64 - 69 ); CACAGC (nucleotides 385 - 390 ); GCCAAG (nucleotides 526 - 531 ); AAGCTG (nucleotides 529 - 534 ); CGCTAT (nucleotides 643 - 648 ); GTCGAT (nucleotides 727 - 732 ); AACAGC (nucleotides 739 - 744 ); GATGCC (nucleotides 916 - 921 ); GCACCG (nucleotides 940
- GTGCCT nucleotides 1000 - 1005
- GTCGAT nucleotides 1027 - 1032
- GCAGGG nucleotides 1 165 - 1170
- CACAGC nucleotides 1192 - 1197
- GACAGC nucleotides 1360 - 1365 .
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- TCCGGT codon pair replacements
- ATCGGG nucleotides 64 - 69
- CACAGC nucleotides 385 - 390
- CATTCT CATTCT
- GCCAAG nucleotides 526 - 531
- AAGCTG nucleotides 529 - 534
- AAATTG AAATTG
- CGCTAT nucleotides 643 - 648
- GTCGAT nucleotides 727 - 732
- AACAGC nucleotides 739 - 744
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- a laccase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Otyctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 26) under normal physiological conditions.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28-152 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism, hi certain aspects, no replacement codon encoding amino acids 28-152 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 28-152 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 161-305 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 161-305 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 161-305 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 364-493 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 364-493 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-28 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-28 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-28 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 152-161 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 152-161 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 152-161 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 305-364 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 305-364 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 305-364 when expressed in the native organism.
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 901 - 906); CTTTCT (nucleotides 19 - 24); GACCGT (nucleotides 547 - 552); TTCCCC (nucleotides 301 - 306); TTCCCC (nucleotides 730 - 735); TTCCCC (nucleotides 988 - 993); TTCCCC (nucleotides 1051 - 1056).
- CTTTCC nucleotides 901 - 906
- CTTTCT nucleotides 19 - 24
- GACCGT nucleotides 547 - 552
- TTCCCC nucleotides 301 - 306
- TTCCCC nucleotides 730 - 735
- TTCCCC nucleotides 988 - 993
- TTCCCC nucleotides 1051 - 1056.
- At least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 901 - 906) replaced with TTGTCT; CTTTCT (nucleotides 19 - 24) replaced with TTGTCT; GACCGT (nucleotides 547 - 552) replaced with GATAGA; TTCCCC (nucleotides 301 - 306) replaced with TTTCCA; TTCCCC (nucleotides 730 - 735) replaced with TTTCCA; TTCCCC (nucleotides 988 - 993) replaced with TTTCCA; TTCCCC (nucleotides 1051 - 1056) replaced with TTTCCA.
- the nucleotide sequence is optimized for expression in S.cerevisiae.
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 901 - 906); TTCCTC (nucleotides 700 - 705); CTCGAC (nucleotides 340 - 345); CTTTCT (nucleotides 19 - 24); TTCCAG (nucleotides 880 - 885); GTCTGG (nucleotides 595 - 600); TTCCCG (nucleotides 1042 - 1047); ATCGCC (nucleotides 229 - 234); ATCGCC (nucleotides 373 - 378).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 901 - 906) replaced with CTGTCT; TTCCTC (nucleotides 700 - 705) replaced with TTCTTG; CTCGAC (nucleotides 340 - 345) replaced with CTGGAC; CTTTCT (nucleotides 19 - 24) replaced with CTGTCT; TTCCAG (nucleotides 880 - 885) replaced with TTCCAA; GTCTGG (nucleotides 595 - 600) replaced with GTTTGG ;TTCCCG (nucleotides 1042 - 1047) replaced with TTCCCA; ATCGCC (nucleotides 229 - 234) replaced with ATTGC
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TTCAAG (nucleotides 7 - 12); ATCAAC (nucleotides 922 - 927); GACGAA (nucleotides 343 - 348); CTTTCC (nucleotides 901 - 906).
- TTCAAG nucleotides 7 - 12
- ATCAAC nucleotides 922 - 927
- GACGAA nucleotides 343 - 348
- CTTTCC nucleotides 901 - 906
- TTCAAG nucleotides 7 - 12
- ATCAAC nucleotides 922 - 927) replaced with ATTAAT
- GACGAA nucleotides 343 - 3448
- CTTTCC nucleotides 901 - 906 replaced with TTGTCT.
- the nucleotide sequence is optimized for expression in P. pastoris.
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTTTCT (nucleotides 19 - 24 ); TTTGTC (nucleotides 25 - 30 ); TTCCCC (nucleotides 301 - 306 ); GACCGT (nucleotides 547 - 552 ); TTCCCC (nucleotides 730 - 735 ); CTTTCC (nucleotides 901 - 906 ); TTCCCC (nucleotides 988 - 993 ); TTCCCC (nucleotides 1051 - 1056 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT; TTTGTC (nucleotides 25 - 30 ) replaced with TTCGTT; TTCCCC (nucleotides 301 - 306 ) replaced with TTCCCT; GACCGT (nucleotides 547 - 552 ) replaced with GATAGA; TTCCCC (nucleotides 730 - 735 ) replaced with TTCCCT; CTTTCC (nucleotides 901 - 906 ) replaced with TTGTCT; TTCCCC (nucleotides 988 - 993 ) replaced with TTTCCT; TTCCCC (nucleotides 1988 - 993 ) replaced with TTTC
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTTTCT (nucleotides 19 - 24 ); ACGGCT (nucleotides 184 - 189 ); CTGACC (nucleotides 211 - 216 ); GCCCGT (nucleotides 376 - 381 ); ATCGGT (nucleotides 424 - 429 ); CTGACC (nucleotides 604 - 609 ); AAGGCT (nucleotides 865 - 870 ); CTTTCC (nucleotides 901 - 906 ); CCCGGA (nucleotides 1063 - 1068 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT; ACGGCT (nucleotides 184 - 189 ) replaced with ACCGCT; CTGACC (nucleotides 21 1 - 216 ) replaced with TTGACC; GCCCGT (nucleotides 376 - 381 ) replaced with GCTCGT; ATCGGT (nucleotides 424 - 429 ) replaced with ATTGGA; CTGACC (nucleotides 604 - 609 ) replaced with TTGACA; AAGGCT (nucleotides 865 - 870 ) replaced with AAAGCC; CTTTCC (nucleot
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- a lignin peroxidase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, .Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the lignin peroxidase retains at least 75% of the enzymatic activity of wild-type LIP (SEQ ID NO: 50) under normal physiological conditions.
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 46- 287 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 46-287 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 46-287 when expressed in the native organism.
- a lignin peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 1 -46 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-46 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -46 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TTCCCC (nucleotides 130 - 135); TTCCCC (nucleotides 721 - 726); TTCCCC (nucleotides 979 - 984); TTCCCC . (nucleotides 1033 - 1038); GCCAAG (nucleotides 247 - 252). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- TTCCCC nucleotides 130 - 135) replaced with TTTCCG
- TTCCCC nucleotides 721 - 726) replaced with TTCCCA
- TTCCCC nucleotides 979 - 984 replaced with TTTCCG
- TTCCCC nucleotides 1033 - 1038 replaced with TTCCCA
- GCCAAG nucleotides 247 - 252 replaced with GCGAAG.
- the nucleotide sequence is optimized for expression in S.cer ⁇ visiae.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: ATTGCC (nucleotides 289 - 294); CAGGCG (nucleotides 358 - 363); CAGGCG (nucleotides 850 - 855); CAGGCG (nucleotides 1012 - 1017); CTCTCC (nucleotides 991 - 996); ATCGCC (nucleotides 244
- ATCGCC nucleotides 370 - 375
- ATCGCC nucleotides 610 - 615.
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- At least 3 of the following codon pair replacements have been made: ATTGCC (nucleotides 289 - 294) replaced with ATCGCT; CAGGCG (nucleotides 358 - 363) replaced with CAGGCT; CAGGCG (nucleotides 850 - 855) replaced with CAGGCT; CAGGCG (nucleotides 1012 - 1017) replaced with CAGGCT; CTCTCC (nucleotides 991 - 996) replaced with CTGTCT; ATCGCC (nucleotides 244 - 249) replaced with ATTGCG; ATCGCC (nucleotides 370 - 375) replaced with ATCGCT; ATCGCC (nucleotides 610 - 615) replaced with ATTGCT.
- the nucleotide sequence is optimized for expression in E.coli.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are as follows: TTCAAG (nucleotides 7 - 12 ); GACGAG (nucleotides 340 - 345 ); ACCAAG (nucleotides 532 - 537 ); GAGCTG (nucleotides 670
- TCTCCC nucleotides 757 - 762
- GTCAAC nucleotides 841 - 846
- TTCAAG nucleotides 871 - 876 .
- at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- TTCAAG nucleotides 7 - 12
- GACGAG nucleotides 340 - 345
- ACCAAG nucleotides 532 - 537
- GAGCTG nucleotides 670 - 675
- TCTCCC nucleotides 757 - 762
- GTCAAC nucleotides 841 - 846
- GTTAAT TTCAAG
- TTCAAG nucleotides 871 - 876
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 2 codon pairs to be replaced are as follows: TTCCCC (nucleotides 130 - 135 ); GCCAAG (nucleotides 247 - 252 ); TTCCCC (nucleotides 721 - 726 ); TTCCCC (nucleotides 979 - 984 ); TTCCCC (nucleotides 1033 - 1038 ).In some such nucleotide sequences, at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- TTCCCC nucleotides 130 - 135 replaced with TTTCCA
- GCCAAG nucleotides 247 - 252 replaced with GCTAAA
- TTCCCC nucleotides 721 - 726 replaced with TTTCCA
- TTCCCC nucleotides 979 - 984 replaced with TTTCCA
- TTCCCC nucleotides 1033 - 1038 replaced with TTCCCT.
- the nucleotide sequence is optimized for expression in K. lactis.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 2 codon pairs to be replaced are as follows: GCCAAG (nucleotides 247 - 252 ); GCCGGT (nucleotides 412 - 417 ); ATCGGT (nucleotides 421 - 426 ); GATGCC (nucleotides 556 - 561 ); GGAACG (nucleotides 646 - 651 ); CCCGGA (nucleotides 1054 - 1059 ). In some such nucleotide sequences, at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- GCCAAG nucleotides 247 - 252
- GCCGGT nucleotides 412 - 417
- ATCGGT nucleotides 421 - 426
- ATAGGT nucleotides 421 - 426
- GATGCC nucleotides 556 - 561
- GATGCT nucleotides 556 - 561
- GGAACG nucleotides 646 - 651
- the nucleotide sequence is optimized for expression in Z mobilis.
- Mn-dependent peroxidase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- Mn-dependent peroxidase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pasto ⁇ s; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the Mn-dependent peroxidase retains at least 75% of the enzymatic activity of wild-type MnP (SEQ ID NO: 74) under normal physiological conditions.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1- 45 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 45 when expressed in the native organism.
- a Mn-dependent peroxidase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 45 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GGGTTC (nucleotides 1246 - 1251); GCAAGA (nucleotides 1834 - 1839); TTGAAC (nucleotides 1540 - 1545); TCTCCA (nucleotides 193 - 198); GACCGT (nucleotides 694 - 699); TTCCCC (nucleotides 1795 - 1800); GCCAAG (nucleotides 763 - 768); GCCAAG (nucleotides 1585 - 1590).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGGTTC (nucleotides 1246 - 1251) replaced with GGTTTT; GCAAGA (nucleotides 1834 - 1839) replaced with GCTAGA; TTGAAC (nucleotides 1540 - 1545) replaced with TTAAAT; TCTCCA (nucleotides 193 - 198) replaced with TCACCA; GACCGT (nucleotides 694 - 699) replaced with GATAGA; TTCCCC (nucleotides 1795 - 1800) replaced with TTTCCA; GCCAAG (nucleotides 763
- nucleotide sequence is optimized for expression in S.cerevisiae.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTGGTG (nucleotides 877 - 882); CTCGAC (nucleotides 1240 - 1245); ATCCTC (nucleotides 1462 - 1467); CTCGGC (nucleotides 652 - 657); CTCGGC (nucleotides 952
- GTCTGG nucleotides 1252 - 1257
- GACAGC nucleotides 940 - 945
- AGCCAG nucleotides 1495 - 1500
- TTCCCG nucleotides 661 - 666
- ATTGCC nucleotides 16 - 21
- ATTGCC nucleotides 1651 - 1656
- CTCGGT nucleotides 58 - 63
- CTCGGT nucleotides 1465 - 1470
- GCCTGG nucleotides 1654 - 1659
- TCGCTG nucleotides 874 - 879
- GTGATG nucleotides 1312 - 1317
- TTCCGC nucleotides 1609 - 1614
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTGGTG (nucleotides 877 - 882) replaced with CTGGTT; CTCGAC (nucleotides 1240 - 1245) replaced with CTGGAC; ATCCTC (nucleotides 1462 - 1467) replaced with ATCCTG; CTCGGC (nucleotides 652 - 657) replaced with CTGGGT ;CTCGGC (nucleotides 952 - 957) replaced with CTGGGT; GTCTGG (nucleotides 1252
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: AAACTG (nucleotides 403 - 408); TTCAAC (nucleotides 202 - 207); TTCAAC (nucleotides 751 - 756); ATCAAC (nucleotides 208 - 213); ATCAAC (nucleotides 397 - 402); ATCAAC (nucleotides 616 - 621); ATCAAC (nucleotides 841 - 846); ATCAAC (nucleotides 1276 - 1281); ATCAAC (nucleotides 1282 - 1287); GTCAAG (nucleotides 1828 - 1833); GGGTTC (nucleotides 1246 - 1251); TTGAAC (nucleotides 1540 - 1545); TTTGAC (nucleotides 1513 - 1518).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 403 - 408) replaced with AAATTA; TTCAAC (nucleotides 202 - 207) replaced with TTTAAC; TTCAAC (nucleotides 751 - 756) replaced with TTTAAT; ATCAAC (nucleotides 208 - 213) replaced with ATTAAT; ATCAAC (nucleotides 397 - 402) replaced with ATTAAT; ATCAAC (nucleotides 616 - 621) replaced with ATTAAC; ATCAAC (nucleotides 841 - 846) replaced with ATTAAT; ATCAAC (nucleotides 1276 - 1281) replaced with ATTA
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GACCGT (nucleotides 694 - 699 ); GCCAAG (nucleotides 763 - 768 ); AAGAAG (nucleotides 820 - 825 ); TTCCAA (nucleotides 865 - 870 ); GGTACC (nucleotides 1048
- GGGTTC nucleotides 1246 - 1251
- GTGTTT nucleotides 1510 - 1515
- TTGAAC nucleotides 1540 - 1545
- GCCAAG nucleotides 1585 - 1590
- AAGAAG nucleotides 1735 - 1740
- TTCCCC nucleotides 1795 - 1800 .
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- At least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 403 - 408) replaced with AAATTA; TTCAAC (nucleotides 202 - 207) replaced with GACCGT (nucleotides 694 - 699 ) replaced with GACAGA; GCCAAG (nucleotides 763 - 768 ) replaced with GCTAAA; AAGAAG (nucleotides 820 - 825 ) replaced with AAAAAG; TTCCAA (nucleotides 865 - 870 ) replaced with TTTCAG; GGTACC (nucleotides 1048
- nucleotide sequence is optimized for expression in K. lactis.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GCCAAG (nucleotides 763 - 768 ); GACAGC (nucleotides 940 - 945 ); AACAGC (nucleotides 1198 - 1203 ); GCCTTT (nucleotides 1414 - 1419 ); GCCAAG (nucleotides 1585 - 1590 ); GCCTTT (nucleotides 1741 - 1746 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- GCCAAG nucleotides 763 - 768
- GACAGC nucleotides 940 - 945
- AACAGC nucleotides 1 198 - 1203
- GCCTTT nucleotides 1414 - 1419
- GCCAAG nucleotides 1585 - 1590
- GCCTTT nucleotides 1741 - 1746
- the nucleotide sequence is optimized for expression in Z.mobilis.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- a laccase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizo saccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 98) under normal physiological conditions.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 90-212 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 90-212 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 90-212 when expressed in the native organism.
- no replacement codon encoding amino acids 90-212 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GTCAAC when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 216-367 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 216-367 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 216-367 when expressed in the native organism.
- no replacement codon encoding amino acids 216-367 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GCCGAC when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 426-570 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 426-570 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 426-570 when expressed in the native organism.
- no replacement codon encoding amino acids 426-570 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TTCCGC when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 1-90 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-90 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-90 when expressed in the native organism, hi certain aspects, at least one replacement codon encoding amino acids 1-90 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GGTGGT when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 212-216 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 212-216 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 212-216 when expressed in the native organism.
- At least one replacement codon encoding amino acids 212-216 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCCAAC when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 367-426 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 367-426 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 367-426 when expressed in the native organism.
- At least one replacement codon encoding amino acids 367-426 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CTCGAC when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 235 - 240); CTTTCT (nucleotides 670 - 675); TTTGCC (nucleotides 778 - 783); TTCCCC (nucleotides 1240 - 1245); ATCAAG (nucleotides 625 - 630); GCCAAG (nucleotides 529 - 534).
- TTGAAA nucleotides 235 - 240
- CTTTCT nucleotides 670 - 675
- TTTGCC nodeoxyribon
- TTCCCC TTCCCC
- ATCAAG nucleotides 625 - 630
- GCCAAG nucleotides 529 - 534
- TTGAAA nucleotides 235 - 240
- CTTTCT nucleotides 670 - 675
- TTGTCT TTGTCT
- TTTGCC nucleotides 778 - 783
- TTCCCC nucleotides 1240 - 1245
- ATCAAG nucleotides 625 - 630
- ATTAAA nucleotides 529 - 534 replaced with GCTAAA.
- the nucleotide sequence is optimized for expression in S.cerevisiae.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: TTCCTC (nucleotides 1405 - 1410); CTCGAC (nucleotides 1432 - 1437); CTTTCT (nucleotides 670 - 675); TTTGCC (nucleotides 778 - 783); ATCCTC (nucleotides 1126 - 1131); ACGCTG (nucleotides 502 - 507); TTCCAG (nucleotides 10 - 15); TTCCAG (nucleotides 193 - 198); TTCCAG (nucleotides 268 - 273); GTGGTG (nucleotides 139 - 144); GTCAGC (nucleotides 106 - 1 1 1); GTCAGC (nucleotides 1339 - 1344); AGCCAG (nucleotides 814 - 819); GCCGGG (nucleotides 1405
- TTCCTC nucleotides 1405 - 1410 replaced with TTCCTG
- CTCGAC nucleotides 1432 - 1437) replaced with CTGGAT
- CTTTCT nucleotides 670 - 675 replaced with CTGTCT
- TTTGCC nucleotides 778 - 783 replaced with TTCGCT
- ATCCTC nucleotides 1 126 - 1 131) replaced with ATTCTG
- ACGCTG nucleotides 502 - 507 replaced with ACCCTC
- TTCCAG nucleotides 10 - 15 replaced with TTTCAG
- TTCCAG nucleotides 193 - 198
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: ATCAAG (nucleotides 625 - 630); TTTGCC (nucleotides 778 - 783); TTGAAA (nucleotides 235 - 240); TTCAAC (nucleotides 1051 - 1056); TTCAAC (nucleotides 1057 - 1062); ATCAAC (nucleotides 739 - 744); ATCAAC (nucleotides 1078 - 1083); GGTATC (nucleotides 148 - 153).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: ATCAAG (nucleotides 625 - 630) replaced with ATTAAA; TTTGCC (nucleotides 778 - 783) replaced with TTTGCA; TTGAAA (nucleotides 235 - 240) replaced with TTAAAA; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; TTCAAC (nucleotides 1057 - 1062) replaced with TTTAAC; ATCAAC (nucleotides 739 - 744) replaced with ATTAAT; ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT; GGTATC (nucleotides 625 - 630) replaced with ATTAAA; TTTGCC (
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 148 - 153 ); TTGAAA (nucleotides 235 - 240 ); GCCAAG (nucleotides 529 - 534 ); TTCCCA (nucleotides 547 - 552 ); CTTTCT (nucleotides 670 - 675 ); TTTGCC (nucleotides 778 - 783 ); TTTGCT (nucleotides 871 - 876 ); TTTGTC (nucleotides 1093 - 1098 ); TTCCCC (nucleotides 1240 - 1245 ); TTTGCT (nucleotides 1444 - 1449 ).In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino
- GGTATC nucleotides 148 - 153
- TTGAAA nucleotides 235 - 240
- GCCAAG nucleotides 529 - 534
- TTCCCA nucleotides 547 - 552
- CTTTCT nucleotides 670 - 675
- TTTGCC nucleotides 778 - 783
- TTCGCT nucleotides 871 - 876
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 148 - 153 ); GCAGGG (nucleotides 370 - 375 ); GCCAAG (nucleotides 529 - 534 ); ATCAAT (nucleotides 574 - 579 ); GCACCG (nucleotides 604 - 609 ); TTGGCA (nucleotides 616 - 621 ); ATCAAT (nucleotides 883 - 888 ); GTGCCT (nucleotides 1000 - 1005 ); GCGGCT (nucleotides 1144 - 1 149 ); GCCAAT (nucleotides 1225 - 1230 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 148 - 153 ) replaced with GGCATT; GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGA; GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAA; ATCAAT (nucleotides 574 - 579 ) replaced with ATTAAT; GCACCG (nucleotides 604 - 609 ) replaced with GCCCCA; TTGGCA (nucleotides 616 - 621 ) replaced with TTGGCT; ATCAAT (nucleotides 883 - 888 ) replaced with ATAAAT; GTGCCT
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- a laccase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 122) under normal physiological conditions.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 29-153 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 29-153 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 29-153 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 162-306 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 162-306 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 162-306 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 364-493 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 364-493 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 1-30 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-30 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-30 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 153-162 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 153-162 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 153-162 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 306-364 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 306-364 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 306-364 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 397 - 402); TTGAAG (nucleotides 235 - 240); GGGTTC (nucleotides 868 - 873); ATCAAA (nucleotides 625 - 630); ACTTTG (nucleotides 502 - 507); GACCGT (nucleotides 187 - 192); GGCCAA (nucleotides 148 - 153); AGCGAT (nucleotides 1546 - 1551).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 397 - 402) replaced with CTGTCT; TTGAAG (nucleotides 235 - 240) replaced with CTGAAA; GGGTTC (nucleotides 868 - 873) replaced with GGTTTC; ATCAAA (nucleotides 625 - 630) replaced with ATCAAA; ACTTTG (nucleotides 502 - 507) replaced with ACCCTG; GACCGT (nucleotides 187 - 192) replaced with GACCGT; GGCCAA (nucleotides 148 - 153) replaced with GGTCAA; AGCGAT (nucleotides 1546 - 1551) replaced
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GCCAGC (nucleotides 811 - 816); CTTTCC (nucleotides 397 - 402); TTCCTC (nucleotides 1405 - 1410); ATCCTC (nucleotides 895 - 900); TTCCAG (nucleotides 10 - 15); TTCCAG (nucleotides 193 - 198); TTCCAG (nucleotides 268 - 273); TTCCAG (nucleotides 1378 - 1383); CTCTCT (nucleotides 670 - 675); GTCAGC (nucleotides 106
- GTCAGC nucleotides 1339 - 1344
- AGCCAG nucleotides 814 - 819
- TTCCCG nucleotides 547 - 552
- ATTGCC nucleotides 169 - 174
- GATCTC nucleotides 1549 - 1554
- CTCGGT nucleotides 583 - 588
- TTCCGC nucleotides 655
- TTCCGC nucleotides 1327 - 1332
- TTCTGG nucleotides 379 - 384
- CTCTCC nucleotides 22 - 27.
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- GCCAGC nucleotides 81 1 - 816) replaced with GCTTCT; CTTTCC (nucleotides 397 - 402) replaced with CTGTCT; TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG; ATCCTC (nucleotides 895 - 900) replaced with ATTCTG; TTCCAG (nucleotides 10 - 15) replaced with TTCCAA; TTCCAG (nucleotides 193 - 198) replaced with TTTCAG; TTCCAG (nucleotides 268 - 273) replaced with TTTCAG; TTCCAG (nucleotides 1378 - 1383) replaced with TTCCAA; CTCTCT (nucleotides 670 - 675) replaced with CTGTCT; GTCAGC (nucleotides 106 - 1 1 1
- nucleotide sequence is optimized for expression in E.coli.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: AAACTG (nucleotides 532 - 537); TTCAAC (nucleotides 1051 - 1056); ATCAAC (nucleotides 307 - 312); TCAAC (nucleotides 1078 - 1083); TCAAA (nucleotides 625 - 630); GGCCGT (nucleotides 1006 - 1011); GGGTTC (nucleotides 868 - 873); GGCCAA (nucleotides 148 - 153); CTTTCC (nucleotides 397 - 402).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 532 - 537) replaced with AAATTG; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; ATCAAC (nucleotides 307 - 312) replaced with ATTAAT; ATCAAC (nucleotides 1078
- nucleotide sequence is optimized for expression in P. pastoris.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: GGCCAA (nucleotides 148 - 153 ); GACCGT (nucleotides 187 - 192 ); TTGAAG (nucleotides 235 - 240 ); CTTTCC (nucleotides 397 - 402 ); ATCAAA (nucleotides 625 - 630 ); GGGTTC (nucleotides 868 - 873 ); GGCCGT (nucleotides 1006 - 101 1 ); TTTGCT (nucleotides 1444 - 1449 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGCCAA (nucleotides 148 - 153 ) replaced with GGTCAA; GACCGT (nucleotides 187 - 192 ) replaced with GATAGA; TTGAAG (nucleotides 235 - 240 ) replaced with TTAAAA; CTTTCC (nucleotides 397 - 402 ) replaced with TTGTCT; ATCAAA (nucleotides 625 - 630 ) replaced with ATTAAA; GGGTTC (nucleotides 868 - 873 ) replaced with GGTTTC; GGCCGT (nucleotides 1006
- nucleotide sequence is optimized for expression in K. lactis.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the at least 3 codon pairs to be replaced are selected from the following: AGCCGT (nucleotides 124 - 129 ); GCCGGT (nucleotides 172 - 177 ); GGCCCC (nucleotides 295 - 300 ); TCCGGT (nucleotides 328 - 333 ); GCAGGG (nucleotides 370
- CACAGC nucleotides 388 - 393
- CTCTAT nucleotides 469 - 474
- ACTTTG nucleotides 502 - 507
- ATCAAT nucleotides 574 - 579
- GCGGCT nucleotides 607 - 612
- GATGCC nucleotides 808 - 813
- GCCAAT nucleotides 844 - 849
- GCCGGT nucleotides 874 - 879
- GTGCCT nucleotides 1000 - 1005
- GCCAAT nucleotides 1225 - 1230
- GATGCC nucleotides 1435 - 1440 .
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: AGCCGT (nucleotides 124 - 129 ) replaced with TCTCGT; GCCGGT (nucleotides 172 - 177 ) replaced with GCTGGT; GGCCCC (nucleotides 295 - 300 ) replaced with GGACCT; TCCGGT (nucleotides 328 - 333 ) replaced with TCTGGT; GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGT; CACAGC (nucleotides 388 - 393 ) replaced with CATTCT; CTCTAT (nucleotides 469 - 474 ) replaced with TTGTAT; ACTTTG
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S.cerevisiae.
- a laccase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for metabolizing lignin comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 146) under normal physiological conditions.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 29-153 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 29-153 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 29-153 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 162-306 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 162-306 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 162-306 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 364-493 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 364-493 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:145 and which encode amino acids 1-29 of SEQ ID NO:146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1 -29 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -29 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 153-162 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 153-162 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 153-162 when expressed in the native organism.
- a laccase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 306-364 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 306-364 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 306-364 when expressed in the native organism.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTGAAC (nucleotides 421 - 426 ); GCCAAG (nucleotides 496 - 501 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); GCCAAG (nucleotides 1243 - 1248 ); ATCAAG (nucleotides 1264 - 1269 ); GGTATT (nucleotides 1411 - 1416 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT; GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA; GATATC (nucleotides 643 - 648 ) replaced with GACATT; AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG; GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAG; ATCAAG (nucleotides 1264 - 1269 ) replaced with ATTAAA; GGTATT (nucleotides 141 1 - 1416 ) replaced with GGAATA.
- the following codon pair replacements have been made: TTGAAC (nucleo
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CTCTCC (nucleotides 274 - 279 ); GACAGC (nucleotides 520 - 525 ); AGCCAG (nucleotides 523 - 528 ); GACTGG (nucleotides 787
- TTCCAG nucleotides 934 - 939
- GCCAGC nucleotides 1441 - 1446 .
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- At least 3 of the following codon pair replacements have been made: CTCTCC (nucleotides 274 - 279 ) replaced with TTATCT; GACAGC (nucleotides 520 - 525 ) replaced with GATTCT; AGCCAG (nucleotides 523 - 528 ) replaced with TCTCAA; GACTGG (nucleotides 787 - 792 ) replaced with GATTGG; TTCCAG (nucleotides 934 - 939 ) replaced with TTCCAG; GCCAGC (nucleotides 1441 - 1446 ) replaced with GCTTCG.
- the nucleotide sequence is optimized for expression in E. coli.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTGAAC (nucleotides 421 - 426 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); ATCAAC (nucleotides 901
- TTCAAG nucleotides 1057 - 1062
- ATCAAG nucleotides 1264 - 1269
- GGTATT nucleotides 141 1 - 1416 .
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- TTGAAC nucleotides 421 - 426
- GATATC nucleotides 643 - 648
- GACATT nucleotides 643 - 648
- AAGAAA nucleotides 859 - 864
- AAAAAG AAAAAG
- ATCAAC nucleotides 901 - 906
- TTCAAG nucleotides 1057 - 1062
- ATCAAG nucleotides 1264 - 1269
- GGTATT nucleotides 141 1 - 1416 replaced with GGAATT.
- the nucleotide sequence is optimized for expression in P. pastoris.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTTGTC (nucleotides 286 - 291 ); TTGAAC (nucleotides 421 - 426 ); GCCAAG (nucleotides 496 - 501 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); AAGAAG (nucleotides 1060 - 1065 ); GCCAAG (nucleotides 1243 - 1248 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: TTTGTC (nucleotides 286 - 291 ) replaced with TTCGTT; TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT; GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA; GATATC (nucleotides 643 - 648 ) replaced with GACATT; AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG; AAGAAG (nucleotides 1060 - 1065 ) replaced with AAAAAG; GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAA.
- TTTGTC nucleotides 286 - 291
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: ACATGG (nucleotides 46 - 51 ); AACAGC (nucleotides 136 - 141 ); AACAGC (nucleotides 268 - 273 ); CTTTAC (nucleotides 325 - 330 ); GCCAAG (nucleotides 496 - 501 ); GACAGC (nucleotides 520 - 525 ); ATCAAT (nucleotides 550 - 555 ); CTCGAT (nucleotides 847 - 852
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: ACATGG (nucleotides 46 - 51 ) replaced with ACCTGG; AACAGC (nucleotides 136 - 141 ) replaced with AATAGT; AACAGC (nucleotides 268
- nucleotide sequence is optimized for expression in Z.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S. cerevisiae.
- a cellobiohydrolase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicula ⁇ s (Long-tailed monkey); M.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for degrading cellulose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4- ⁇ -glucanase, exo-l,4- ⁇ -D- glucanase, and ⁇ -D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the exo-l,4- ⁇ -D-glucanase retains at least 75% of the enzymatic activity of wild-type TrCBH-I (SEQ ID NO: 170) under normal physiological conditions.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 465-493 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 465-493 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 465-493 when expressed in the native organism.
- no replacement codon encoding amino acids 465-493 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair ATTGGC when expressed in the native organism.
- a cellobiohydrolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 435-464 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 435-464 of SEQ ID NO: 170 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 435-464 when expressed in the native organism.
- At least one replacement codon encoding amino acids 62-107 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CCTACC when expressed in the native organism.
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CAGTTT (nucleotides 445 - 450 ); CAGTAC (nucleotides 571 - 576 ); CAGTAC (nucleotides 685 - 690 ); AAGGGC (nucleotides 793 - 798 ); GAGTTT (nucleotides 808 - 813 ).
- nucleotide sequences at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CAGTTT (nucleotides 445 - 450 ) replaced with CAATTT; CAGTAC (nucleotides 571 - 576 ) replaced with CAATAT; CAGTAC (nucleotides 685 - 690 ) replaced with CAATAT; AAGGGC (nucleotides 793 - 798 ) replaced with AAGGGA; GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT.
- the nucleotide sequence is optimized for expression in S. cerevisiae.
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CTCGGC (nucleotides 7 - 12 ); AGCCAG (nucleotides 142 - 147 ); CTGGCA (nucleotides 301 - 306 ); GATCTC (nucleotides 307 - 312 ); TTCCAG (nucleotides 415 - 420 ); TTCTGG (nucleotides 424 - 429 ); GCCGGA (nucleotides 556 - 561 ); GTCTGG (nucleotides 886
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made:CTCGGC (nucleotides 7 - 12 ) replaced with CTGGGT; AGCCAG (nucleotides 142 - 147 ) replaced with AGCCAA; CTGGCA (nucleotides 301 - 306 ) replaced with CTCGCG; GATCTC (nucleotides 307 - 312 ) replaced with GACCTG; TTCCAG (nucleotides 415 - 420 ) replaced with TTCCAA; TTCTGG (nucleotides 424 - 429 ) replaced with TTTTGG; GCCGGA (nucleotides 556 - 561 ) replaced with GCGGGT; GTCTGG (nucleot
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTCT (nucleotides 10 - 15 ); ACCAAG (nucleotides 82 - 87 ); CTTCCA (nucleotides 151 - 156 ); GGCTCT (nucleotides 280 - 285 ); CAGTTT (nucleotides 445 - 450 ); CACGAT (nucleotides 493 - 498 ); AAGAAG (nucleotides 790
- nucleotide sequences at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- GGCTCT nucleotides 10 - 15
- ACCAAG nucleotides 82 - 87
- CTTCCA nucleotides 151 - 156
- GGCTCT nucleotides 280 - 285
- CAGTTT nucleotides 445 - 450
- CACGAT nucleotides 493 - 498
- CACGAT nucleotides 493 - 498
- CACGAT nucleotides 493 - 498
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTCT (nucleotides 10 - 15 ); ACCAAG (nucleotides 82 - 87 ); CTTCCA (nucleotides 151 - 156 ); GGCTCT (nucleotides 280 - 285 ); CAGTTT (nucleotides 445 - 450 ); CACGAT (nucleotides 493 - 498 ); AAGAAG (nucleotides 790
- nucleotide sequences at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- GGCTCT nucleotides 10 - 15
- ACCAAG nucleotides 82 - 87
- CTTCCA nucleotides 151 - 156
- GGCTCT nucleotides 280 - 285
- CAGTTT nucleotides 445 - 450
- CACGAT nucleotides 493 - 498
- CACGAT nucleotides 493 - 498
- CACGAT nucleotides 493 - 498
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TCCGGT (nucleotides 124 - 129 ); GTCGAT (nucleotides 358 - 363 ); GCCGGA (nucleotides 556 - 561 ); GGGGCA (nucleotides 604 - 609 ); GCATGG (nucleotides 607 - 612 ).
- nucleotide sequences at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: TCCGGT (nucleotides 124 - 129 ) replaced with TCTGGT; GTCGAT (nucleotides 358 - 363 ) replaced with GTTGAT; GCCGGA (nucleotides 556 - 561 ) replaced with GCTGGT; GGGGCA (nucleotides 604 - 609 ) replaced with GGCGCG; GCATGG (nucleotides 607 - 612 ) replaced with GCGTGG.
- the nucleotide sequence is optimized for expression in Z. mobilis.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S. cerevisiae.
- a endoglucanase -encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a system for degrading cellulose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4- ⁇ -glucanase, exo-l ,4- ⁇ -D- glucanase, and ⁇ -D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
- the endo-l ,4- ⁇ -glucanase retains at least 75% of the enzymatic activity of wild-type endoglucanase (SEQ ID NO: 182) under normal physiological conditions.
- a endoglucanase -encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 32- 276 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 32-276 of SEQ ID NO: 182 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 32-276 when expressed in the native organism.
- no replacement codon encoding amino acids 32-276 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair with the highest z score when expressed in the native organism.
- a endoglucanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 1- 32 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-32 of SEQ ID NO: 182 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-32 when expressed in the native organism.
- At least one replacement codon encoding amino acids 1-32 of SEQ ID NO: 182 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair with the highest z score when expressed in the native organism.
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: AGTGAC (nucleotides 58 - 63 ); AAGGGC (nucleotides 148 - 153 ); GCAAGA (nucleotides 172 - 177 ); GACCAA (nucleotides 406 - 411 ); AGCGGT (nucleotides 442 - 447 ); TTGAAT (nucleotides 493 - 498 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT; AAGGGC (nucleotides 148 - 153 ) replaced with AAAGGT; GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA; GACCAA (nucleotides 406 - 411 ) replaced with GATCAA; AGCGGT (nucleotides 442 - 447 ) replaced with TCTGGA; TTGAAT (nucleotides 493 - 498 ) replaced with TTAAAC.
- the nucleotide sequence is optimized for expression in 5". cerevisiae.
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTGG (nucleotides 25 - 30 ); CTGGAA (nucleotides 91 - 96 ); GGCGGT (nucleotides 127 - 132 ); GGCTGG (nucleotides 151 - 156 ); CTCGGC (nucleotides 352 - 357 ); TACTGG (nucleotides 412 - 417 ); CGCCAG (nucleotides 424 - 429 ); ACCAGC (nucleotides 4
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG; CTGGAA (nucleotides 91 - 96 ) replaced with CTGGAG; GGCGGT (nucleotides 127 - 132 ) replaced with GGCGGC; GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG; CTCGGC (nucleotides 352 - 357 ) replaced with CTGGGT; TACTGG (nucleotides 412 - 417 ) replaced with TATTGG; CGCCAG (nucleotides 424 - 429 ) replaced with CGTCAG; ACCAGC (nucleotides 4
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CACGAT (nucleotides 31 - 36 ); AGTGAC (nucleotides 58 - 63 ); GAGTAT (nucleotides 259 - 264 ); AACTTT (nucleotides 277 - 282 ); GTCAAC (nucleotides 370 - 375 ); GTCAAC (nucleotides 499 - 504 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: CACGAT (nucleotides 31 - 36 ) replaced with CATGAT; AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT; GAGTAT (nucleotides 259 - 264 ) replaced with GAATAT; AACTTT (nucleotides 277 - 282 ) replaced with AATTTC; GTCAAC (nucleotides 370 - 375 ) replaced with GTTAAT; GTCAAC (nucleotides 499 - 504 ) replaced with GTGAAT.
- the nucleotide sequence is optimized for expression in P. pastoris.
- a A xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTGG (nucleotides 25 - 30 ); GGCTGG (nucleotides 151 - 156 ); GCAAGA (nucleotides 172 - 177 ); GGTGTT (nucleotides 193 - 198 ); AACTTT (nucleotides 277 - 282 ); GACCAA (nucleotides 406 - 41 1 ); GGTACC (nucleotides 445 - 450 ); TTGAAT (nucleotides
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG; GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG; GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA; GGTGTT (nucleotides 193 - 198 ) replaced with GGTGTT; AACTTT (nucleotides 277 - 282 ) replaced with AATTTC; GACCAA (nucleotides 406 - 411 ) replaced with GATCAA; GGTACC (nucleotides 445 - 450 ) replaced with GGTACA; TTGAAT (nucleotides 4
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GAAGGC (nucleotides 94 - 99 ); GCAAGA (nucleotides 172 - 177 ); AACAGC (nucleotides 214 - 219 ); ACCTAT (nucleotides 286 - 291 ); TCCGGT (nucleotides 301 - 306 ); GCAACG (nucleotides 529 - 534 ); GGCTAT (nucleotides 553 - 558 ).
- At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3 of the following codon pair replacements have been made: GAAGGC (nucleotides 94 - 99 ) replaced with GAAGGA; GCAAGA (nucleotides 172 - 177 ) replaced with GCTCGT; AACAGC (nucleotides 214 - 219 ) replaced with AATTCT; ACCTAT (nucleotides 286 - 291 ) replaced with ACGTAT; TCCGGT (nucleotides 301 - 306 ) replaced with TCTGGT; GCAACG (nucleotides 529 - 534 ) replaced with GCCACC; GGCTAT (nucleotides 553 - 558 ) replaced with GGTTAT.
- the nucleotide sequence is optimized for expression in Z. mobilis.
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
- the host organism is not human, E. coli or S. cerevisiae.
- a xylanase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaco, fascicularis (Long-tailed monkey); M.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 31-221 of SEQ ED NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
- no replacement codon encoding amino acids 31-221 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 31-221 when expressed in the native organism.
- no replacement codon encoding amino acids 31-221 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair with the highest z score when expressed in the native organism.
- a xylanase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 1-31 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
- at least one replacement codon encoding amino acids 1-31 of SEQ ID NO: 194 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-31 when expressed in the native organism.
- At least one replacement codon encoding amino acids 1-31 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair highest z score when expressed in the native organism.
- isolated polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51 , 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203.
- isolated polypeptides encoded by the any of the nucleotide sequences provided herein, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194.
- expression systems comprising: an expression vector in a host organism, wherein the expression vector includes the any of the polynucleotides provided herein operably linked to an expression control sequence. Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides provided herein, each polynucleotide being operably linked to the same or different expression control sequences.
- expression systems for degrading cellulose comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes: endo-l ,4- ⁇ -glucanase, exo-l,4- ⁇ -D-glucanase, and ⁇ -D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- expression systems for metabolizing lignin comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
- one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21, 23, 171 , 173, 175, 177, 179, 183, 185, 187, 189 or 191.
- Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21, 23, 171, 173, 175, 177, 179, 183, 185, 187, 189 or 191.
- one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53, 55, 57, 59, 6L 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 11 1, 113, 1 15, 1 17, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167.
- Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 1 11, 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme.
- each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194) under normal physiological conditions.
- cells comprising any of the polynucleotides provided herein.
- the cell expresses the polypeptide encoded by said polynucleotide.
- Also provided herein are methods of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with any of the polynucleotides provided herein under conditions that permit the polynucleotide to be introduced into the host cell.
- Also provided herein are methods of expressing a polypeptide comprising: providing a cell comprising any of the polynucleotides provided herein; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
- Also provided herein are methods of hydrolyzing a carbohydrate comprising: providing a carbohydrate comprising at least one glycosidic bond; providing a polypeptide encoded by any of the polynucleotides provided herein; and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one covalent bond of said carbohydrate, whereby at least one covalentbond of said carbohydrate is hydrolyzed.
- integrable polynucleotides for modifying an endogenous nucleotide sequence in a cell comprising: a removable selectable marker cassette comprising a selectable marker flanked by a 5' site-specific recombinase recognition site and a 3' site-specific recombinase recognition site, wherein said removable selectable marker cassette is flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
- integrable polynucleotides further comprise a heterologous nucleic acid flanked by said 5' nucleic acid sequence with homology to an endogenous sequence and said 3' nucleic acid sequence with homology to an endogenous sequence.
- the heterologous nucleic acid comprises a sequence encoding a polypeptide.
- the heterologous nucleic acid comprises a regulatory sequence.
- the sequence encoding a polypeptide is operatively linked to said regulatory sequence.
- the regulatory sequence comprises a promoter sequence and a terminator sequence.
- the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some embodiments, the heterologous nucleic acid encodes a polypeptide that degrades cellulose and/or lignin.
- the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 11 1, 1 13, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203.
- the selectable marker can be selected for or can be selected against. In some such integrable polynucleotides, the selectable marker can be selected for and can be selected against. In some such integrable polynucleotides, the selectable mark is selected from the group consisting of URA3, TRPl, CANl, KIURA3, CYH2, LYS2 and METl 5. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises a genomic repetitive element. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises TyI DNA or Ty3 DNA.
- the site- specific recombinase recognition site comprises a loxP sequence. In some such integrable polynucleotides, the site-specific recombinase recognition site comprises a frt sequence. In some such integrable polynucleotides, the integrable polynucleotide comprises a PCR product.
- cells comprising any of the integrable polynucleotides provided herein. Some such cells comprise a gene encoding a site- specific recombinase. In some such cells, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. Some such cells are S. cerevisiae cells.
- Also provided herein are methods of modifying an endogenous sequence in a cell comprising: providing a cell with at least one of the integrable polynucleotides provided; and selecting for a cell comprising said at least one integrable polynucleotide integrated therein to the genome of the cell. Some such methods further comprise excising at least one selectable marker from said at least one cell comprising said at least one integrable polynucleotide integrated therein; and selecting for a cell in which said at least one selectable marker has been excised. In some such methods, the excising said selectable marker comprises providing said cell with a site-specific recombinase.
- the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. In some such methods, the site-specific recombinase is expressed from an endogenous gene or from a heterologous nucleic acid.
- the providing a cell with at least one integrable polynucleotide comprises providing a cell with a plurality of integrable polynucleotides, wherein said plurality of integrable polynucleotides comprises at least a first integrable polynucleotide comprising a first selectable marker and a second integrable polynucleotide comprising a second selectable marker.
- the plurality comprises 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
- cells comprising an endogenous sequence modified by any of such methods provided herein.
- the modified endogenous sequence comprises an insertion, a deletion or a mutation.
- cells comprising a removable selectable marker cassette integrated into said cell comprising a selectable marker flanked by a 5' site- specific recombinase recognition site and a 3' site-specific recombinase recognition site; and a heterologous nucleic acid integrated into said cell, wherein said removable selectable marker is juxtaposed to said heterologous nucleic.
- cells comprising: a heterologous nucleic acid integrated into said cell, and a site-specific recombinase recognition site integrated into said cell, wherein said site-specific recombinase recognition site is juxtaposed to said heterologous nucleic acid.
- the site-specific recombinase recognition site comprises a loxP or frt sequence.
- the cell is a S. cerevisae cell.
- the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some such cells, the heterologous nucleic acid encodes a polypeptide that degrades cellulose and/or lignin.
- the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 111, 113, 1 15, 1 17, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151 , 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203.
- Figure 1 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in T.
- Reesei of nucleic acid sequences encoding the cellobiohydrolase-II enzyme of T. Reesei (TrCBH-II), plotted as a function of codon pair position.
- Figures 2-6 depicts effects of Translational EngineeringTM on protein expression levels. Each of Figures 2-6 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding TrCBH-II, plotted as a function of codon pair position.
- Figure 2 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TrCBH-II protein.
- Figure 2B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 3A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TrCBH-II protein.
- Figure 3B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 4A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TrCBH-II protein.
- Figure 4B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 5A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TrCBH-II protein.
- Figure 5B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 6A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the TrCBH-II protein.
- Figure 6B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
- Figures 7-11 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 7-11 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the laccase enzyme of P. sanguineus (LCC), plotted as a function of codon pair position.
- Figure 7A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 7B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 8A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 8B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 9A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 9B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 1OA depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 1OB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure HA depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure HB depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Figures 12-16 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 12-16 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the lignin peroxidase enzyme of T. versicolor (LIP), plotted as a function of codon pair position.
- LIP T. versicolor
- Figure 12A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LIP protein.
- Figure 12B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 13A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LIP protein.
- Figure 13B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 14A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LIP protein.
- Figure 14B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 15A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LIP protein.
- Figure 15B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 16A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LIP protein.
- Figure 16B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
- Figures 17-21 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 17-21 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the Mn-dependent peroxidase enzyme of T. versicolor (MnP), plotted as a function of codon pair position.
- MnP Mn-dependent peroxidase enzyme of T. versicolor
- Figure 17A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the MnP protein.
- Figure 17B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 18A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the MnP protein.
- Figure 18B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 19A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the MnP protein.
- Figure 19B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 2OA depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the MnP protein.
- Figure 2OB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 21 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the MnP protein.
- Figure 21 B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Figure 22 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in N. crassa of nucleic acid sequences encoding the laccase enzyme of TV. crassa (LCC), plotted as a function of codon pair position.
- Figures 23-27 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 23-27 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LCC, plotted as a function of codon pair position.
- Figure 23A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 23B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 24A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 24B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 25A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 25B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 26A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 26B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 27A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 27B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
- Figures 28-32 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 28-32 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the laccase enzyme of P. cinnabarinus (LCC), plotted as a function of codon pair position.
- Figure 28A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 28B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 29A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 29B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 30A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 30B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 31 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 3 IB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 32A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 32B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
- Figures 33-37 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 33-37 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the laccase enzyme of P. coccineus (LCC), plotted as a function of codon pair position.
- Figure 33A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 33B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 34A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 34B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 35A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 35B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 36A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 36B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 37A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LCC protein.
- Figure 37B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
- Figure 38 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in T. Reesei of nucleic acid sequences encoding the cellobiohydrolase-I enzyme of T. Reesei (TrCBH-I), plotted as a function of codon pair position.
- Figures 39-43 depict effects of Translational EngineeringTM on protein expression levels.
- Each of Figures 39-43 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding TrCBH-II, plotted as a function of codon pair position.
- Figure 39A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TrCBH-I protein.
- Figure 39B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 4OA depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TrCBH-I protein.
- Figure 4OB depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 41 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TrCBH-I protein.
- Figure 41 B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 42A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TrCBH-I protein.
- Figure 42B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 43 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the TrCBH-I protein.
- Figure 43B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Figures 44-48 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 1-3 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the endoglucanase enzyme of T. aurantiacus (EGl), plotted as a function of codon pair position.
- EGl T. aurantiacus
- Figure 44A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the EGl protein.
- Figure 44B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 45A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the EGl protein.
- Figure 45B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 46A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the EGl protein.
- Figure 46B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 47A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the EGl protein.
- Figure 47B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 48A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the EGl protein.
- Figure 48B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Figures 49-53 depict effects of Translational eEngineeringTM on protein expression levels.
- Each of Figures 1-3 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the xylanase enzyme of T. lanuginosis (XynA), plotted as a function of codon pair position.
- XynA T. lanuginosis
- Figure 49A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 49B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
- Figure 5OA depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 50B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
- Figure 51 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 51B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
- Figure 52A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 52B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
- Figure 53A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 53B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Figure 54A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein.
- Figure 54B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
- Biomass is the earth's most attractive alternative among fuel sources and most sustainable energy resource and is reproduced by the bioconversion of carbon dioxide.
- Ethanol produced from biomass is today the most widely used biofuel when blended with gasoline.
- the use of biofuels can significantly reduce the accumulation of greenhouse gas.
- Ethanol is just one example of the uses of biomass harvesting using industrial enzymes. The technologies associated with biomass harvesting are similarly applicable in the production of other biofuels, fine chemicals as well as other diverse applications.
- a variety of highly specialized microorganisms have evolved to produce enzymes that either synergistically or in complexes can carry out the complete hydrolysis of cellulose.
- the anaerobic bacteria Clostridium thermocellum and Clostridium cellulovorans and the filamentous fungus Trichoderma reesei are known as cellulolytic and xylanolytic microorganisms.
- the bacteria C. thermocellum and C cellulovorans produce a cellulosome complex consisting of cellulase and hemicellulase organized on the cell surface (Doi and Tamaru (2001) Chem. Rec. 1 :24-32; Shoham et al. (1999) Trends Microbiol. 7:275-281).
- T. reesei three types of cellulolytic enzyme are extracellularly secreted, including five endoglucanases (EG [EC 3.2.1.4]) (Okada et al (1998) Appl. Environ. Microbiol. 64:555- 563), two cellobiohydrolases (CBH [EC 3.2.1.91]) (Henrissat et al. (1985) Bio/Technology 3:722-726; Teeri et al. (1987) Gene 51 :43-52), and two ⁇ -glucosidases (BGL [EC 3.2.1.21]) (Chen et al. (1992) Biochim. Biophys.
- EG [EC 3.2.1.4] endoglucanases
- CBH [EC 3.2.1.91] two cellobiohydrolases
- BGL [EC 3.2.1.21] two ⁇ -glucosidases
- Endoglucanases act randomly against the amorphous region of the cellulose chain to produce reducing and nonreducing ends for cellobiohydrolases, which produce cellobiose from reducing or nonreducing ends of crystalline cellulose.
- Exoglucanase enzymes including CBH-I and CBH-II, liberate the disaccharide D-cellobiose from 1 ,4- ⁇ -glucans.
- Cellulose chains are thus efficiently degraded to soluble cellobiose and cellooligosaccharides by the endo-exo synergism of EG and CBH (Henrissat et al. (1985) Bio/Technology 3:722-726).
- the predominant polysaccharide in the primary cell wall of biomass is cellulose, the second most abundant is hemi-cellulose, and the third is pectin.
- the secondary cell wall produced after the cell has stopped growing, also contains polysaccharides and is strengthened through polymeric lignin covalently cross-linked to hemicellulose.
- Cellulose is a homopolymer of anhydrocellobiose and thus a linear ⁇ -(l- 4)-D-glucan, while hemicelluloses include a variety of compounds, such as xylans, xyloglucans, arabinoxylans, and mannans in complex branched structures with a spectrum of substituents.
- cellulose is found in plant tissue primarily as an insoluble crystalline matrix of parallel glucan chains. Hemicelluloses usually hydrogen bond to cellulose, as well as to other hemicelluloses, which helps stabilize the cell wall matrix.
- DNA constructs encoding cellulase enzymes are known in the art.
- U.S. Patent No. 5,686,593 relates to cellulose- or hemicellulose-degrading enzymes that are derivable from a fungus other than Trichoderma or Phanerochaete, and which comprise a carbohydrate binding domain homologous to a terminal A region of T. reesei cellulases.
- Lignocellulosic biomass is composed predominantly of cellulose, hemicellulose, and lignin.
- Lignin is a complex, highly cross-linked polyphenolic heteropolymer, and is naturally resistant to chemical and biologic conversion.
- An economical biomass-to-ethanol process critically depends on the rapid and efficient conversion of all of the sugars present in both its cellulose and hemicellulose fractions.
- lignin Although cellulose and hemicellulose are readily degraded by fungal and bacterial pathways, lignin is extremely recalcitrant. Furthermore, because of its cross-linking with the other cell wall components, lignin minimizes the accessibility of cellulose and hemicellulose to microbial enzymes. Hence, lignin is generally associated with reduced digestibility of the overall plant biomass.
- White rot fungi are believed to be the most effective lignin-degrading microbes in nature. These white-rot fungi secrete one or more of three extracellular enzymes that are essential for lignin degradation. They are often referred to as lignin-modifying enzymes or LMEs.
- the three enzymes comprise two glycosylated heme-containing peroxidases: lignin peroxidase (LIP); Mn-dependent peroxidase (MNP); and, a copper-containing phenoloxidase Laccase (LCC).
- LIP lignin peroxidase
- MNP Mn-dependent peroxidase
- LCC copper-containing phenoloxidase Laccase
- Laccases are copper containing oxidase enzymes that are found in many plants, fungi and microorganisms. Laccases are enzymatically active on phenols and similar molecules and perform a one electron oxidation. Laccases can be polymeric and the enzymatically active form can be a dimer or trimer.
- Mn-dependent peroxidase The enzymatic activity of Mn-dependent peroxidase (MnP) in is dependent on Mn 2+ . Without being bound by theory, it has been suggested that the main role of this enzyme is to oxidize Mn 2+ to Mn 3+ (Glenn et al. (1986) Arch. Biochem. Biophys. 251 :688-696). Subsequently, phenolic substrates are oxidized by the Mn 3+ generated.
- Lignin peroxidase is an extracellular heme that catalyses the oxidative depolymerization of dilute solutions of polymeric lignin in vitro.
- Some of the substrates of LiP most notably 3,4-dimethoxybenzyl alcohol (veratryl alcohol, VA), are active redox compounds that have been shown to act as redox mediators.
- VA is a secondary metabolite produced at the same time as LiP by ligninolytic cultures of P.
- hydrolysis enzymes do not express well in host organisms such as E. coli or S. cerevisiae. Accordingly, provided herein are hydroysis enzyme-encoding nucleotide sequences and methods of making the same for improved expression of hydrolysis enzymes.
- Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pauses can improve protein expression.
- the pause(s) can serve to facilitate proper polypeptide folding, post-translational modification, re-organization/folding at protein domain boundaries, or other steps toward arriving at the native, active wild type protein. Accordingly, in some embodiments provided herein, one or more pauses that are predicted to be present in native translation of hydrolysis enzymes is/are preserved in a modified hydrolysis-encoding polynucleotide provided in accordance with the teachings herein.
- a codon pair in the modified hydrolysis enzyme-encoding polynucleotide can be selected to have a predicted translational kinetics value that is at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 99% that of the native codon pair whose predicted pause is to be preserved; further, the codon pair in the modified hydrolysis enzyme -encoding polynucleotide can be selected to be located within 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 codons of the native codon pair whose predicted pause is to be preserved.
- Translation EngineeringTM refers to a process used to modify the translational kinetics of a polypeptide-encoding nucleic sequence.
- Translation EngineeringTM can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism.
- Translation EngineeringTM can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism.
- this process alters the polypeptide-encoding nucleic sequence to optimize codon usage and codon pair optimization in the organism in which the polypeptide-encoding nucleic sequence is expressed.
- sequence modifications can be made to place or prevent restriction sites in the sequence, eliminate strong RNA secondary structures and avoid inadvertent Shine-Delgarno sequences.
- Translation EngineeringTM involves modifying the translational kinetics of a polypeptide-encoding nucleic sequence by removing, preserving, and/or inserting translational pauses into the polypeptide-encoding nucleic sequence.
- hydrolysis enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same.
- a hydrolysis enzyme -encoding DNA sequence wherein the encoded sequence has amino acid sequence identity with wild-type hydrolysis enzyme, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing input-sequence codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the resultant hydrolysis enzyme -encoding nucleotide is predicted to be translated rapidly along its entire length.
- expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
- expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where inappropriate or excessive translation pauses causes expression of inactive, insoluble or aggregated hydrolysis enzyme .
- expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where one or more predicted pauses are preserved from the native expression profile or are added to preserve expression of active and/or soluble hydrolysis enzyme .
- the hydrolysis enzyme -encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels; higher enzymatic activity; greater protein stability and resistance to degradation; and increased solubility.
- hydrolysis enzyme refers to the enzymes encoded by the nucleotide sequences provided herein, and includes cellobiohydrolase-II, laccase, lignin peroxidase, Mn-dependent peroxidase, cellobiohydrolase-I, endoglucanase and xylanase enzymes.
- nucleic acid sequences encoding the cellobiohydrolase-II enzyme of T. Reesei are provided.
- the nucleotide sequences provided herein include the native sequence from T. Reesei shown in the sequence listing (SEQ ID NO: 1) which encodes the TrCBH-II amino acid sequence (SEQ ID NO: 2).
- nucleic acid sequences encoding the laccase enzyme of P. sanguineus are provided.
- the nucleotide sequences provided herein include the native sequence from P. sanguineus shown in the sequence listing (SEQ ID NO: 25) which encodes the LCC amino acid sequence (SEQ ID NO: 26).
- nucleic acid sequences encoding the lignin peroxidase enzyme of T. versicolor are provided.
- the nucleotide sequences provided herein include the native sequence from T. versicolor shown in the sequence listing (SEQ ID NO: 49) which encodes the LIP amino acid sequence (SEQ ID NO: 50).
- nucleic acid sequences encoding the Mn-dependent peroxidase enzyme of T. versicolor (MnP) are provided.
- the nucleotide sequences provided herein include the native sequence from T. versicolor shown in the sequence listing (SEQ ID NO: 73) which encodes the MnP amino acid sequence (SEQ ID NO: 74).
- nucleic acid sequences encoding the laccase enzyme of N. crassa are provided.
- the nucleotide sequences provided herein include the native sequence from N. crassa shown in the sequence listing (SEQ ID NO: 1) which encodes the LCC amino acid sequence (SEQ ID NO: 98).
- nucleic acid sequences encoding the laccase enzyme of P. cinnabarinus are provided.
- the nucleotide sequences provided herein include the native sequence from P. cinnabarinus shown in the sequence listing (SEQ ID NO: 121) which encodes the LCC amino acid sequence (SEQ ID NO: 122).
- nucleic acid sequences encoding the laccase enzyme of P. coccineus are provided.
- the nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 145) which encodes the LCC amino acid sequence (SEQ ID NO: 146).
- nucleic acid sequences encoding the cellobiohydrolase-I enzyme of T. Reesei are provided.
- the nucleotide sequences provided herein include the native sequence from T. Reesei shown in the sequence listing (SEQ ID NO: 169) which encodes the TrCBH-I amino acid sequence (SEQ ID NO: 170).
- nucleic acid sequences encoding the endoglucanase enzyme of T. aurantiacus are provided.
- the nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 181) which encodes the LCC amino acid sequence (SEQ ID NO: 182).
- nucleic acid sequences encoding the xylanase enzyme of T. lanuginosus are provided.
- the nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 193) which encodes the LCC amino acid sequence (SEQ ID NO: 194).
- nucleic acid sequences encoding hydrolysis enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 3, 27, 51, 75, 99, 123, 147, 171, 183 and 195), E. coli (SEQ ID NOS: 9, 33, 57, 81, 105, 129, 153, 173, 185 and 197), P. pastoris (SEQ ID NOS: 15, 39, 63, 87, 1 1 1 , 135, 159, 175, 187 and 199), K. lactis (SEQ ID NOS: 21 , 45, 69, 93, 1 17, 141, 165, 177, 189 and 201.
- nucleotide sequences may be added 3' or 5' of any nucleic acid, for example, to facilitate hybridization of PCR primers, to add cloning restriction sites or other sites that facilitate cloning and/or expression. Accordingly, provided in the sequence listing are nucleic acid sequences with additional 5' and 3' cloning and/or PCR sequences, and which encode hydrolysis enzymes with refined translational kinetics for expression in S.
- hydrolysis enzyme amino acid sequences encoded by the nucleotide sequences with refined translational kinetics described herein are hydrolysis enzyme amino acid sequences encoded by the nucleotide sequences with refined translational kinetics described herein.
- hydrolysis enzyme nucleic acid sequences with refined translational kinetics SEQ ID NOS: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 1 1 1, 113, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157
- hydrolysis enzyme-encoding DNA sequences wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme polypeptide and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly- overrepresented therein.
- the host organism is not human, E. coli or S. cerevisiae.
- a laccase nucleotide sequences encodes a polypeptide having laccase activity.
- Laccase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin.
- a method for measuring laccase activity is exemplified by a known method in which an enzymatic reaction is carried out using 2,6- dimethoxyphenol (DMP) as a substrate and 2,2',6,6'-demethoxydiphenoquinone absorbance at 468nm is monitored by spectrophotometry, as described in de Jong et al. ((1992) Mycol. Res. 96:1098-1 104), hereby incorporated by reference in its entirety.
- DMP 2,6- dimethoxyphenol
- a cellobiohydrolase nucleotide sequences encodes a polypeptide having cellobiohydrolase activity.
- Cellobiohydrolase, exoglucanase, exo- 1 ,4- ⁇ -D-glucanase and like terms refers to the enzymatic hydrolysis of a glucoside bond in a polysaccharide or an oligosaccharide containing D-glucose subunits bonded through ⁇ -1 ,4 bonds, to release cellobiose, a disaccharide in which D-glucose is bonded through a ⁇ -1,4 bond.
- a method for measuring the cellobiohydrolase activity is exemplified by a known method in which an enzymatic reaction is carried out using phosphoric acid- swollen cellulose as a substrate and the existence of cellobiose in the reaction is confirmed by thin-layer silica gel chromatography, as described in U.S. Patent No. 6,566,113, hereby incorporated by reference in its entirety.
- a lignin peroxidase nucleotide sequences encodes a polypeptide having lignin peroxidase activity.
- Lignin peroxidase, diarylpropane peroxidase, ligninase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin.
- a method for measuring lignin peroxidase activity is exemplified by a known method in which an enzymatic reaction is carried out and veratryl alcohol absorbance at 310 nm is monitored by spectrophotometry, as described by Linko and Haapala. ((1993) Biotechnol. Techniques. 7:75-80), hereby incorporated by reference in its entirety.
- Mn-dependent peroxidase nucleotide sequences encodes a polypeptide having Mn-dependent peroxidase activity.
- Mn-dependent peroxidase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin.
- a method for measuring Mn-dependent peroxidase activity is exemplified by a known method in which an enzymatic reaction is carried out and production of oxidized 3-methyl-2-benzothiazolinone hydrazone hydrachloride (MBTH) plus 3-dimethylaminobenzoic acid (DMAB) absorbance at 590 nm is monitored by spectrophotometry, as described in Daniel et al.
- an endoglucanase nucleotide sequence encodes an endo-l,4- ⁇ -glucanase polypeptide having endo-l,4- ⁇ -glucanase activity.
- Endoglucanase and like terms refer to the enzymes involved in the enzymatic hydrolysis of a glucoside bond in a polysaccharide or an oligosaccharide containing D-glucose subunits bonded through ⁇ -1,4 bonds, to release cellobiose, a disaccharide in which D-glucose is bonded through a ⁇ -1 ,4 bond.
- Endoglucanases act randomly against the amorphous region of the cellulose chain to produce reducing and nonreducing ends for cellobiohydrolases, which produce cellobiose from reducing or nonreducing ends of crystalline cellulose.
- a xylanase nucleotide sequence encodes a xylanase polypeptide having xylanase activity.
- Xylanase and like terms refer to a class of enzymes which degrade the linear polysaccharide beta-l,4-xylan into xylose, thus breaking down hemi cellulose, which is a major component of the cell wall of plants.
- polypeptides provided herein encode polypeptides that have hydrolysis activity.
- a hydrolysis enzyme-encoding polynucleotide comprising any of the DNA sequences provided herein can be transcribed and the resulting RNA translated to produce a polypeptide with hydrolysis enzyme activity.
- nucleotide sequence is used to refer to any polynucleotide sequence.
- DNA sequence is used herein to refer to the nucleotide sequences presented herein.
- RNA equivalent nucleotide sequences are also described by DNA sequences presented herein.
- an equivalent RNA sequence can be substituted for a DNA sequecne by a T to U substitution, (i.e., replacing thymine in the DNA sequence with uracil in the RNA sequence).
- the hydrolysis enzyme-encoding DNA sequence is adapted for expression in a heterologous host organism.
- a DNA sequence that has been adapted for expression is a DNA sequence that has been inserted into an expression vector or otherwise modified to contain regulatory elements necessary for expression of the DNA in the host cell, positioned in such a manner as to permit expression of the DNA in the host cell.
- regulatory elements required for expression include promoter sequences, transcription initiation sequences and, optionally, enhancer sequences.
- a DNA sequence may be inserted into a plasmid vector adapted for expression in a bacterial cell, such as E. coli, or a eukaryotic cell, such as S. cerevisiae or other yeast, or any other host organism.
- a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism. In certain aspects, the host organism is not human, E. coli or S. cerevisiae.
- polynucleotides provided herein also encode polypeptides that have other lignin-metabolizing activities such as a lignin peroxidase and a Mn-dependent peroxidase activity.
- translational kinetics of an mRNA into polypeptide can be changed in order to achieve any of a variety of expression profiles. For example, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all translational pauses predicted to occur within an autonomous folding unit of a nascent protein. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all over-represented codon pairs.
- a pause or translation slowing codon pair can queue ribosomes back to the beginning of the coding sequence, thereby inhibiting further ribosome attachment to the message which can result in down- regulation of protein expression levels as the rate of translation initiation readily saturates and the slowest translation step time becomes rate limiting. It is also proposed herein that the presence of a pause or translational slowing codon pair can stall or detach a ribosome. It is also proposed herein that the presence of a pause or translational slowing codon pair can expose naked mRNA, which is then subject to message degradation.
- Organism-specific codon usage and codon pair usage, and the presence of organism-specific pause sites result in gene translation that is highly adapted to the original host organism.
- ribosomal pausing sites that may be functional in a human cell will typically be scrambled, random, or not appropriate or not recognized in the proper context in a bacterium or other non-native host.
- a heterologous cDNA or synthetic polynucleotide has a random but high probability of inadvertently encoding a pause site somewhere, often leading to protein expression and/or activity failure.
- Methods for refining translational kinetics of an mRNA into polypeptide can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2008/0046192, published on February 21, 2008, which is incorporated by reference herein in its entirety.
- a polypeptide-encoding nucleotide can be designed to be predicted to be translated rapidly along its entire length.
- some polypeptide-encoding nucleotides provided herein are those that have been engineered to remove all predicted pauses. Expression of such a polypeptide-encoding nucleotide can result in improved protein expression levels and improved levels of active and/or natively folded polypeptide expression.
- a test of translation pausing or slowing as a result of codon pair usage can be performed by comparing a series of genes that have random pauses with modified genes where codon pairs predicted to cause translational pauses are replaced. Unmodified genes moved from their source organism and expressed in a heterologous host can have an altered set of codon pairs predicted to cause a translational pause or ribosomal slowing (e.g., an altered set of over-represented codon pairs), resulting in altered configuration and location of presumed pause sites.
- translational kinetics of an mRNA into hydrolysis enzyme-encoding polypeptide can be changed in order to remove some or all translational pauses or replace other codon pairs that cause translational slowing, message instability and degradation, and poor protein translation, expression, and functional properties. While not intending to be limited to the following, it is believed that, for at least some proteins, reduction or elimination of translational pauses can serve to increase the expression level and/or quality and characteristics of the protein. Accordingly, by removing some or all translational pauses or replacing other codon pairs that cause translational slowing, the expression levels and/or quality of an expressed protein can be increased.
- hydrolysis enzyme-encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels, higher enzymatic activity, greater protein stability, resistance to degradation, and increased solubility compared to the original native gene when expressed in a heterologous host.
- hydrolysis enzyme -encoding nucleotide sequences that have been modified to have one or more transcriptional pauses or slowing sites removed by modifying one or more codon pairs to a corresponding codon pair that is less likely to cause a translational pause or slowing. While in some embodiments it is preferred to replace all codon pairs predicted to cause a translational pause or slowing, in other embodiments, it is sufficient to replace a subset of codon pairs predicted to cause a translational pause or slowing. For example, expression levels can be increased by replacing at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more codon pairs predicted to cause a translational pause or slowing.
- At least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs predicted to cause a translational pause or slowing are replaced by, for example, substituting different codon pairs that encode the same amino acids.
- translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses predicted to occur within an autonomous folding unit of a protein.
- an autonomous folding unit of a protein refers to an element of the overall protein structure that is self- stabilizing and often folds independently of the rest of the protein chain. Such autonomous folding units typically correspond to a protein domain.
- expression of a gene in a heterologous host organism can result in translational pauses located in regions that inhibit protein expression and/or protein folding.
- preserving or inserting a translational pause in a region predicted to separate autonomous folding units of a protein can result in improved folding and/or solubility of expressed proteins.
- methods of changing translational kinetics of an mRNA into polypeptide by preserving, relative to native, or inserting one or more translational pauses in one or more regions predicted to separate autonomous folding units of a protein, thereby increasing improving the folding and/or solubility of the expressed protein.
- one step can include identifying predicted autonomous folding units of a protein.
- Methods for identifying predicted autonomous folding units of a protein or protein domains are known in the art, and include alignment of amino acid sequences with protein sequences having known structures, and threading amino acid sequences against template protein domain databases. Such methods can employ any of a variety of software algorithms in searching any of a variety of databases known in the art for predicting the location of protein domains. The results of such methods will typically include an identification of the amino acids predicted to be present in a particular domain, and also can include an identification of the domain itself, and an identification of the secondary structural element, if any, in which each amino acid sequence of a domain is located.
- the polypeptide- encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to remove a translational pause not present in the expression profile of the polypeptide in the native host organism. For example, there may be no codon pairs that are not predicted to cause a translational pause or slowing and that encode a corresponding pair of amino acids. In such instances, several options are available: the codon pair that is least likely to cause a translational pause or slowing can be selected; an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
- One option in a computational method is to request human input in order to resolve the issue.
- the computational method may, for example, involve the use of a computer that is programmed to request human input.
- the computer may be programmed to make a selection, or combination of selections, such that multiple genes, or Ordered Gene Sets or small permutation libraries are designed and synthetically produced for use in expression analysis.
- an amino acid insertion, deletion or mutation is made in order to change translational kinetics, it is preferable to select a change that is predicted not to substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
- Such an amino acid insertion, deletion or mutation can include, for example, a conservative amino acid substitution such as the conservative substitutions shown in Table 1.
- the substitutions shown are based on amino acid physical-chemical properties, and as such, are independent of organism.
- the conservative amino acid substitution is a substitution listed under the heading of exemplary substitutions.
- codon pairs predicted to cause a translational pause or slowing are treated equally
- one or more different threshold levels can be established for differential treatment of codon pairs, where codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing, and succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing.
- codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing
- succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing.
- different numbers or percentages of codon pairs can be replaced for each of these different threshold-based groups. For example, 95% or more codon pairs above a highest threshold level can be replaced, while 90% or less of all codon pairs between that level and an intermediate threshold level are replaced.
- codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, three or more different threshold-based groups, four or more different threshold-based groups, five or more different threshold-based groups, six or more different threshold-based groups, or more. Discussion of specific thresholds are provided elsewhere herein; however, typically the higher the threshold, the higher the likelihood of a translational pause or slowing caused by a codon pair with a translational kinetics value greater than the threshold. In embodiments in which codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, different numbers or percentages of codon pairs can be replaced for each codon pair group.
- codon pairs above a highest threshold are replaced, while the same or a lower percentage of codon pairs are replaced from codon pair groups corresponding to one or more lower thresholds.
- the same or a lower percentage of codon pairs are replaced.
- all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair is located within an autonomous folding unit.
- all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair can be replaced without requiring a change in the encoded polypeptide sequence.
- all codon pairs above a highest threshold are replaced, while a codon pair above a first higher intermediate threshold is replaced only if the codon pair can be replaced without changing the encoded polypeptide sequence or with only a conservative change to the encoded polypeptide sequence, while a codon pair above a second lower intermediate threshold is replaced only if the codon pair can be replaced without requiring any change in the encoded polypeptide sequence.
- an evaluation method can be used that determines the degree to which a codon pair should be replaced according to the translational kinetics value of the codon pair, where the degree to which the codon pair should be replaced can be counterbalanced by any of a variety of user-determined factors such as, for example, presence of the codon pair within or between autonomous folding units, and degree of change to the encoded polypeptide sequence.
- a translational kinetics value of a codon pair is a representation of the degree to which it is expected that a codon pair is associated with a translational pause. Methods of determining the translational kinetics value of a codon pair are discussed elsewhere herein. Such translational kinetics values can be normalized to facilitate comparison of translational kinetics values between species. In some embodiments, the translational value can be the degree of over-representation of a codon pair. An over-represented codon pair is a codon pair which is present in a protein-encoding sequence in higher abundance than would be expected if all codon pairs were statistically randomly abundant.
- a codon pair predicted to cause a translational pause or slowing is a codon pair whose likelihood of causing a translational pause or slowing is at least one standard deviation above the mean translational kinetics value, where a particular translational kinetics value above the mean translational kinetics value in this context refers to a translational kinetics value indicative of a greater likelihood of causing translational pausing or slowing, relative to a mean translational kinetics value, and is not strictly limited to a particular mathematical relationship (e.g., greater than the mean) since the depiction of propensity to cause a translational pause by a translational kinetics value can be selected to be negative or positive, based on the selected implementation by one skilled in the art.
- over-represented codon pairs may be graphically displayed as a positive function in a SpeedPlotTM, as depicted in Figure 1, where a positive deflection or peak above a selected threshold describes a translational pause or slowing at the exact nucleotide location as defined by the abscissa.
- a threshold for the translational kinetics value of codon pairs that are predicted to cause a translational pause or slowing can be set in accordance with the method and level of stringency desired by one skilled in the art.
- a threshold value can be set to 5, or 3, or 2, or 1.5 standard deviations or more above the mean.
- Typical threshold values can be at least 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean.
- a plurality of thresholds can be applied in the herein-provided methods in segregating codon pairs into a plurality of groups. Each threshold of such a plurality can be a different value selected from 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean.
- translational kinetics of an mRNA into polypeptide can be changed to add or retain one or more translational pauses predicted to occur before, after or within an autonomous folding unit of a protein, or between autonomous folding units. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure in the domain prior to further downstream translation and reorganization or reconfiguration of the growing polypeptide or domain. By modifying the translational kinetics of complex multi-domain proteins it may be possible to experimentally alter the time each domain has available to organize.
- Folding of a heterologously-expressed gene having two or more independent domains can be altered by the presence of pause sites between the domains. Refolding studies indicate that the time it takes for a protein to settle into its final configuration may take longer than the translation of the protein. Pausing may allow each domain to partially organize and commit to a particular, independent fold. Other co- translational events, such as those associated with co-factors, protein subunits, protein complexes, membranes, chaperones, secretion, or proteolysis complexes, also can depend on the kinetics of the emerging nascent polypeptide. Pauses can be introduced by engineering one codon pair predicted to cause a translational pause or slowing, or two or more such codon pairs into the sequence to facilitate these co-translational interactions.
- typically a translational pause is preserved, which refers to maintaining the same codon pair for a polypeptide-encoding nucleotide sequence that is expressed in the native host organism, or, when the polypeptide-encoding nucleotide sequence is heterologously expressed, changing the codon pair as appropriate to have a translational kinetics value comparable to or closest to the translational kinetics value of the native codon pair in the native host organism.
- proximal codon pairs can be selected to be replaced in order to introduce a translational pause or slowing.
- one of the 1, 2, 3, 4 or 5 most proximal codon pairs upstream (5' of the desired pause site) or one of the 1, 2, 3, 4 or 5 most proximal codon pairs downstream (3' of the desired pause site) can be chosen for replacement to introduce the translational pause or slowing.
- the selected codon pair for replacement to introduce the translational pause or slowing is the codon pair closest to the originally desired codon pair location of the translational pause or slowing, provided the desired translational pause or slowing can be attained (e.g., 1 codon pair upstream or downstream is typically selected instead of 2 codon pairs upstream or downstream, provided the desired translational pause or slowing can be attained).
- a translational pause or slowing can be introduced by selecting a replacement codon pair encoding a conservative amino acid substitution, such as the conservative substitutions shown in Table 1.
- replacement of a proximal codon pair to introduce a translational pause or slowing is preferred over replacement of a codon pair resulting in a change in the encoded amino acid sequence.
- graphical displays of translational kinetics values of one or more proteins can be used to provide information to assist in the selection of a translational pause or slowing to preserve or insert in a redesigned polypeptide-encoding nucleotide sequence.
- graphical displays of translational kinetics values can permit, for example, alignment of homologous proteins from different species and an identification, based on this alignment, of predicted translational pause or slowing sites that are conserved in the aligned proteins.
- Such predicted translational pause or slowing sites can be preserved or inserted in a redesigned polypeptide-encoding nucleotide sequence.
- regions between autonomous folding units in one or more proteins within a particular species can be graphically examined for the presence or absence of predicted pause sites.
- Such graphical display methods can result in an identification of a region between autonomous folding units in which a translational pause or slowing is desirably preserved in a redesigned polypeptide-encoding sequence.
- Methods for identifying and selecting conserved translational pauses can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007.
- the codon pair translation kinetics values can be compared with a database of related gene sequences and conserved pause sites can be identified.
- a synthetic gene can be designed wherein at least one conserved pause site is maintained to provide a synthetic gene with modified translation kinetics.
- codon pairs are associated with translational pauses, and can thereby influence translational kinetics of an mRNA into polypeptide.
- the methods of changing translational kinetics provided herein will typically be performed by modifying or designing one or more nucleotide sequences encoding a polypeptide to be expressed.
- methods of modifying a gene or designing a synthetic nucleotide sequence encoding the polypeptide encoded by the gene collectively referred to herein as redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence.
- redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence.
- Also included in the various embodiments provided herein are redesigned gene sequences encoding polypeptides that are not identical to the original gene.
- a hydrolysis enzyme-encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type hydrolysis polypeptide sequence as set forth in SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194.
- At least 1, 2 or 3 codon pairs of a polynucleotide sequence encoding the hydrolysis enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the DNA sequence is optimized for expression in S. cerevisiae, E. coli, P. pastoris, K. lactis or Z mobilis.
- a hydrolysis enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the a functional domain of the hydrolysis enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for functional domains are known in the art.
- the replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. That is, the embodiments in which one or more codon pairs encoding amino acids of the a functional domain of one of the encoded polypeptides provided herein have been replaced include embodiments in which the nucleotide sequence encoding the functional domain is changed to increase the predicted translational kinetics of translation of the functional domain. As provided herein, incomplete translation, improper folding, or other protein expression shortcomings can result from the presence of one or more translational pauses in a heterologously-expressed polypeptide. In some embodiments, removal of one or more of these pauses can increase the speed of translation of the functional domain, and thereby increase the quantity of protein produced and/or increase the amount of stable, properly folded, active, and/or soluble protein produced.
- the replacement codons i.e., the codons added as replacements for the wild type codons
- the replacement codon are typically predicted to be less likely to cause a translational pause.
- the replacement codon can have a translational kinetics value in the heterologous host organism that is 95%, 90%, 85%, 80%, 75%, 70%, or less, than the translational kinetics value of the wild type codon pair when expressed in the heterologous host organism.
- the replacement codon is selected to have a translational kinetics value similar to the translational kinetics value of the wild type codon pair in the native organism.
- the z score of at least one replacement codon pair when expressed in the heterologous host organism can be no more than 250%, 200%, 150%, 125% or 100% of the z score for the wild type codon pair when expressed in the native organism.
- a hydrolysis enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between domains of the hydrolysis enzyme, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the domains are known in the art and are described in detail below.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the cellulose binding domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for cellulose binding domains are known in the art.
- the cellulose binding domain includes at least amino acids 35-58, 30- 61 or 27-62.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art.
- the glycosyl hydrolase domain includes at least amino acids 124-437, 1 15-450 or 107-471.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art.
- the Cu-oxidase-3 domain includes at least amino acids 29-151 or 28-152.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art.
- the Cu-oxidase domain includes at least amino acids 162-304 or 161-305.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art.
- the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
- a lignin peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the haem peroxidase domain of the lignin peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for haem peroxidase domains are known in the art.
- the haem peroxidase domain includes at least amino acids 47-286 or 46- 287.
- a Mn-dependent peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the haem peroxidase domain of the Mn-dependent peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for haem peroxidase domains are known in the art.
- the haem peroxidase domain includes at least amino acids 46-283 or 45-284.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art.
- the Cu- oxidase-3 domain includes at least amino acids 91-211 or 90-212.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art.
- the Cu-oxidase domain includes at least amino acids 217-366 or 216-367.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art.
- the Cu-oxidase-2 domain includes at least amino acids 427-569 or 426-570.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art.
- the Cu-oxidase-3 domain includes at least amino acids 30-152 or 29-153.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art.
- the Cu-oxidase domain includes at least amino acids 163-305 or 162-306.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art.
- the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art.
- the Cu-oxidase-3 domain includes at least amino acids 30-152 or 29-153.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art.
- the Cu-oxidase domain includes at least amino acids 163-305 or 162-306.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art.
- the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the cellulose binding domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for cellulose binding domains are known in the art.
- the cellulose binding domain includes at least amino acids 465-493.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art.
- the glycosyl hydrolase domain includes at least amino acids 1-434.
- a endoglucanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the endoglucanase domain of the endoglucanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for endoglucanase domains are known in the art.
- the endoglucanase domain includes at least amino acids 32-276.
- a xylanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the xylanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art. hi the case of the xylanase of SEQ ID NO: 193, the glycosyl hydrolase domain includes at least amino acids 31-221.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the cellulose binding domain and the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the cellulose binding domain and glycosyl hydrolase domain are described hereinabove.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-3 domain are described hereinabove.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase-3 and the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase domain are described hereinabove.
- a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase and the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-2 domain are described hereinabove.
- a lignin peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the haem peroxidase domain of the lignin peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the haem peroxidase domain are described hereinabove.
- a Mn-dependent peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the haem peroxidase domain of the Mn-dependent peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the haem peroxidase domain are described hereinabove.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-3 domain are described hereinabove.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase domain are described hereinabove.
- the conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-2 domain are described hereinabove.
- a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the cellulose binding domain and the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the cellulose binding domain and glycosyl hydrolase domain are described hereinabove.
- a endoglucanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the endoglucanase domain of the endoglucanse enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the endoglucanase domain are described hereinabove.
- a xylanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the glycosyl hydrolase domain of the xylanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
- the conserved amino acid sequence pattern and domain boundaries for the glycosyl hydrolase domain are described hereinabove.
- polypeptide-encoding nucleotide sequence provided herein to modify the translational kinetics of the polypeptide-encoding nucleotide sequence, where the polypeptide-encoding nucleotide sequence is altered such that one or more codon pairs have a decreased likelihood of causing a translational pause or slowing relative to the unaltered polypeptide-encoding nucleotide sequence.
- one or more nucleotides of a polypeptide-encoding nucleotide sequence can be changed such that a codon pair containing the changed nucleotides has a translational kinetics value indicative of a decreased likelihood of causing a translational pause or slowing relative to the unchanged polypeptide-encoding nucleotide sequence.
- the redesigned polypeptide-encoding nucleotide sequence need not possess a high degree of identity to the polypeptide-encoding nucleotide sequence of the original gene, in some embodiments, the redesigned polypeptide-encoding nucleotide sequence will have at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% nucleotide identity with the polypeptide-encoding nucleotide sequence of the original gene.
- an original gene refers to a gene for which codon pair refinement is to be performed; such original genes can be, for example, wild type genes, native genes, naturally occurring mutant genes, other mutant genes such as site-directed mutant genes or engineered or completely synthetic genes.
- the polynucleotide sequence will be completely synthetic, and will bear much lower identity with the original gene, e.g., no more than 90%, 80%, 70%, 60%, 50%, 40%, or lower.
- the resulting sequence can be designed to: (1) reduce or eliminate translational problems caused by inappropriate ribosome pausing, such as those caused by over- represented codon pairs or other codon pairs with translational values predictive of a translational pause; (2) have codon usage refined to avoid over-reliance on rare codons; (3) reduce in number or remove particular restriction sites, splice sites, internal Shine- Dalgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism; or (4) have controlled RNA secondary structure to avoid detrimental translational termination effects, translation initiation effects, or RNA processing, which can arise from, for example, RNA self-hybridization.
- this sequence also can be designed to avoid oligonucleotides that mis-hybridize, resulting in genes that can be assembled from refined oligonucleotides that by thermodynamic necessity only pair up in the desired manner, using methods known in the art, as exemplified in U.S. Patent Publication No. 2005/0106590, which is hereby incorporated by reference in its entirety.
- polypeptide-encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to suitably modify the translational kinetics of the mRNA into polypeptide without modifying the amino acid sequence of the encoded polypeptide.
- an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
- the change is preferably predicted to not substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
- Such non-identical polypeptides can vary by containing one or more insertions, deletions and/or mutations.
- polypeptide sequence can vary according to the purpose of the change, typically such a change results in a polypeptide that is at least 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical to the wild type polypeptide sequence.
- the sequence of the polynucleotide can be generated, optionally in conjunction with optimization of a plurality of parameters where one such parameter can be codon pair usage, where the resultant polynucleotide can be prepared by assembly of a plurality of oligonucleotides sufficiently small to be synthesized by known oligonucleotide synthetic methods.
- Methods known in the art for optimizing multiple parameters in synthetic nucleotide sequences can be applied to optimizing the parameters recited in the present claims. Such methods may advantageously include those exemplified in U.S. Patent App. Publication No. 2005/0106590, U.S. Patent App. Publication No. 2007/0009928, and R. H.
- an exemplary method for generating a sequence can also include dividing the desired sequence into a plurality of partially overlapping segments; optimizing the melting temperatures of the overlapping regions of each segment to disfavor hybridization to the overlapping segments which are non- adjacent in the desired sequence; allowing the overlapping regions of single stranded segments which are adjacent to one another in the desired sequence to hybridize to one another under conditions which disfavor hybridization of non-adjacent segments; and filling in, ligating, or repairing the gaps between the overlapping regions, thereby forming a double-stranded DNA with the desired sequence.
- This process can be performed manually or can be automated, e.g., in a general purpose digital computer.
- the search of possible codon assignments is mapped into an anytime branch and bound computerized algorithm developed for biological applications.
- a synthetic nucleotide sequence for the polynucleotides provided herein, where the synthetic nucleotide sequence also is typically designed to have desirable translational kinetics properties, such as the removal of some or all codon pairs predicted to result in a translational pause or slowing.
- Such design methods include determining a set of partially overlapping segments with optimized melting temperatures, and determining the translational kinetics of the synthetic sequence, where if it is desired to change the translational kinetics of the synthetic gene, the sequences of the overlapping segments are modified and refined in order to approximate the desired translational kinetics while still possessing acceptable hybridization properties. In some embodiments, this process is performed iteratively.
- a criterion is established for selecting codon pairs having high translational kinetics values to be replaced with codon pairs having lower the translational kinetics values unless a codon pair of this group is the site of a planned pause.
- the top 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% of codon pairs ranked by translational kinetics values can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value equal to or below the translational kinetics values of codon pairs not in the top selected percentage, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
- all codon pairs above a user-selected translational kinetics value such as more than 5, 4.5, 4, 3.5, 3, 2.5 or 2 standard deviations above the mean translational kinetics value can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value that is 4, 3.5, 3, 2.5, 2, 1.5 or 1 standard deviations less than the mean translational kinetics value, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
- polynucleotide sequences design methods provided herein can be employed where a plurality of properties of the polynucleotide sequences can be refined in addition to codon pair usage properties, where such properties can include, but are not limited to, melting temperature gap between oligonucleotides of synthetic gene, average codon usage, average codon pair chi-squared (e.g., z score), worst codon usage, worst codon pair (e.g., z score), maximum usage in adjacent codons, Shine-Dalgarno sequence (for E.
- coli expression occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's, long exactly repeated subsequences, cloning restriction sites, user-prohibited sequences (e.g., other restriction sites), codon usage of a specific codon above user-specified limit, and out-of-frame stop codons (framecatchers).
- additional properties that can be considered in a process of designing a polynucleotide sequence include, but are not limited to, occurrences of RNA splice sites, occurrences of polyA sites, and occurrence of ribosome binding sequence.
- a process of designing a poly nucleotide sequence can include constraints including, but not limited to, minimum melting temperature gap between oligonucleotides of synthetic gene, minimum average codon usage, maximum average codon pair chi-squared (z score), minimum absolute codon usage, maximum absolute codon pair (z score), minimum maximum usage in adjacent codons, no Shine-Dalgarno sequence (for E.
- additional constraints can include, but are not limited to, minimum occurrences of RNA splice sites, minimum occurrences of polyA sites, and occurrence of ribosome binding sequence.
- a process of designing a polynucleotide sequence can include preferences including, but not limited to, prefer high average codon usage, prefer low average codon pair chi-squared, prefer larger melting temperature gap, prefer more out of frame stop codons (framecatchers), and optionally prefer evenly distributed codon usage.
- Any of a variety of nucleotide sequence refinement/optimization methods known in the art can be used to refine the polynucleotide sequence according to the codon pair usage properties, and according to any of the additional properties specifically described above, or other properties that are refined in nucleotide sequence redesign methods known in the art.
- a branch and bound method is employed to refine the polynucleotide sequence according to codon pair usage properties and at least one additional property, such as codon usage.
- the methods provided herein can further include analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that stop codons are added to at least one said frame shift.
- the generating step further includes analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that one or more stop codons in one, two or three reading frames are added downstream of polypeptide-encoding region of the nucleotide sequence.
- methods for redesigning a polypeptide-encoding gene for expression in a host organism, by providing a data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
- Also provided herein are methods for redesigning a polypeptide- encoding gene for expression in a host organism by providing a first data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a second data set representative of at least one additional desired property of the synthetic gene, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, both (i) codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the first data set, and (ii) nucleotides that provide a desired property, with reference to the second data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
- a branch and bound method is employed to refine the polypeptide- encoding nucleotide sequence according to codon pair usage properties of the first data set and according to the properties of the second data set.
- the second data set contains codon preferences representative of codon usage by the host organism, including the most common codons used by the host organism for a given amino acid.
- a hydrolysis enzyme -encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%, 70%, 75%,80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type hydrolysis enzyme polypeptide sequence as set forth in the sequence listing.
- the polynucleotide provided herein is adapted for expression in a heterologous host organism.
- a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism.
- the host organism is not human, E. coli or S. cerevisiae.
- At least 1 , 2 or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
- the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
- a highly- overrepresented codon pair is a codon pair that has a translational kinetics value greater than a designated threshold, wherein a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
- a hydrolysis enzyme -encoding DNA sequence having at least a 75% sequence identity with an original hydrolysis enzyme polypeptide sequence as set forth in the sequence listing and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organisms are selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W31 10; E. coli UTI89; E.
- the methods provided herein can include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold.
- the likelihood that a particular codon pair will cause translational pausing or slowing in an organism can be represented by a translational kinetics value.
- the translational kinetics value can be expressed in any of a variety of manners in accordance with the guidance provided herein. In one example, a translational kinetics value can be expressed in terms of the mean translational kinetics value and the corresponding standard deviation for all codon pairs in an organism.
- the translational kinetics value for a particular codon pair can be expressed in terms of the number of standard deviations that separate the translational kinetics value of the codon pair from the mean translational kinetics value.
- a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
- the methods provided herein also include generating a candidate nucleotide sequence according to codon usage.
- codon usage As is known in the art, different organisms can have different preference for the three- nucleotide codon sequence encoding a particular amino acid. As a result, translation can often be improved by using the most common three-nucleotide codon sequence encoding a particular amino acid.
- some methods provided herein also include generating a candidate nucleotide sequence such that codon utilization is non-randomly biased in favor of codons most commonly used by the host organism. Codon usage preferences are known in the art for a variety of organisms and methods for selecting the more commonly used codons are well known in the art.
- the methods of redesigning a polypeptide- encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize the predicted translational kinetics. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, the conflict is resolved by selecting the nucleotide sequence predicted to be translated more rapidly, for example, due to fewer predicted translational pauses.
- the methods of redesigning a polypeptide-encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize codon pair usage preferences. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, codon pair usage will be accorded more weight in order to resolve the conflict between the more than one possible nucleotide sequences.
- the methods provided herein can include identifying at least one instance of a conflict between selecting common codons and avoiding codon pairs predicted to cause a translational pause; in such instances, the conflict is resolved in favor of avoiding codon pairs predicted to cause a translational pause.
- Some embodiments provided herein include generating a candidate polynucleotide sequence encoding the polypeptide sequence, the candidate polynucleotide sequence having a non-random codon pair usage, such that the codon pairs encoding any particular pair of amino acids have the lowest translational kinetics values.
- the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the encoded amino acid sequence is not altered.
- the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the three dimensional structure of the encoded polypeptide is not substantially altered.
- the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that no more than conservative amino acid changes are made to the encoded polypeptide.
- the methods provided herein can further include a step of refining or altering the candidate polynucleotide sequence in accordance with a second nucleotide sequence property to be refined.
- the methods further include generating or refining a candidate polynucleotide sequence encoding a polypeptide sequence such that the candidate polynucleotide sequence has a non-random codon usage, where the most common codons used by the host organism are over-represented in the candidate polynucleotide sequence.
- the methods can include refining or altering the candidate polynucleotide sequence in accordance with any of a variety of additional properties provided herein, including but not limited to, melting temperature gap between oligonucleotides of synthetic gene, Shine-Dai garno sequence, occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's long exactly repeated subsequences, cloning restriction sites, or any other user-prohibited sequences. Further, any of a variety of combinations of these properties can be additionally included in the nucleotide sequence refinement methods provided herein.
- the method provided herein can further include an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
- an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
- the candidate nucleotide sequence can be compared to each property considered in the refinement, and, if the values for all properties are deemed to be acceptable or desired, no further sequence alteration is required. If the values for fewer than all properties are deemed to be acceptable or desired, the candidate nucleotide sequence can be subjected to further sequence alteration and evaluation.
- sequence alteration steps of methods provided herein can be performed iteratively. That is, one or more steps of altering the nucleotide sequence can be performed, and the candidate nucleotide sequence can be evaluated to determine whether or not further sequence alteration is necessary and/or desirable. These steps can be repeated until values for all properties are deemed to be acceptable or desired, or until no further improvement can be achieved.
- the methods and sequences provided herein include determination and use of translational kinetics values for codon pairs. As provided herein, such a translational kinetics value can be calculated and/or empirically measured, and the final translational kinetics value used in graphical displays and methods of predicting translational kinetics can be a refined value resultant from two or more types of codon pair translational kinetics information.
- codon pair translational kinetics information that can be used in refining or replacing a translational kinetics value for a codon pair include, for example, values of observed versus expected codon pair frequencies in a particular organism, normalized values of observed versus expected codon pair frequencies in a particular organism, the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species, the degree to which observed versus expected codon pair frequency values are conserved at predicted pause sites such as boundaries between autonomous folding units in related proteins across two or more species, the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, and empirical measurement of translational kinetics for a codon pair.
- the values of observed versus expected codon pair frequencies in a host organism can be determined by any of a variety of methods known in the art for statistically evaluating observed occurrences relative to expected occurrences. Regardless of the statistical method used, this typically involves obtaining codon sequence data for the organism, for example, on a gene-by-gene basis. In some embodiments, the analysis is focused only on the coding regions of the genome. Because the analysis is a statistical one, a large database is preferred. Initially, the total number of codons is determined and the number of times each of the 61 non-terminating codons appears is determined.
- the expected frequency of each of the 3721 (61 2 ) possible non- terminating codon pairs is calculated, typically by multiplying together the frequencies with which each of the component codons appears.
- This frequency analysis can be carried out on a global basis, analyzing all of the sequences in the database together; however, it is typically done on a local basis, analyzing each sequence individually. This will tend to minimize the statistical effect of an unusually high proportion of rare codons in a sequence.
- the expected number of occurrences of each codon pair is calculated by, for example, multiplying the expected frequency by the number of pairs in the sequence. This information can then be added to a global table, and each next succeeding sequence can be analyzed in like manner.
- the values of observed versus expected codon pair frequencies are chi-squared values, such as chi-squared 2 (chisq2) values or chi- squared 3 (chisq3) values.
- Methods for calculating chi-squared values can be performed according to any method known in the art, as exemplified in U.S. Patent No. 5,082,767, which is incorporated by reference herein in its entirety.
- chisq2 (observed-expected) 2 / expected [0342]
- a new value chi-squared 2 (chisq2) can be calculated as follows. For each group of codon pairs encoding the same amino acid pair (i.e., 400 groups), the sums of the expected and observed values are tallied; any non-randomness in amino acid pairs is reflected in the difference between these two values.
- each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal.
- the new chi- squared, chisq2 is evaluated using these new expected values. Calculation methods for removing the contribution to chi-squared of non-randomness in amino acid pairs are known in the art, as exemplified in Gutman and Hatfield, Proc. Natl. Acad. Sci. USA, (1989) 86:3699-3703.
- a new value chi-squared 3 (chisq3) can be calculated. Correction is made only for those dinucleotides formed between adjacent codon pairs; any bias of dinucleotides within codons (codon triplet positions I-II and II-III) will directly affect codon usage and is, therefore, automatically taken into account in the underlying calculations.
- the sums of the expected and observed values are tallied; any non- randomness in dinucleotide pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal.
- the new chi-squared, chisq3, is evaluated using these new expected values.
- Dinucleotide bias represents a smaller effect in yeast, and only a very minor one in E. coll.
- the predominant dinucleotide bias in human is the well-known CpG deficit, other dinucleotides are also very highly biased. For example, there is a deficit of TA, as well as an excess of TG, CA and CT. Overall, the deficit of CpG contributes only 35% of the total dinucleotide bias in the human database, and 17% in yeast.
- the values of observed versus expected codon pair frequencies in a host organism herein can be normalized. Normalization permits different sets of values of observed versus expected codon pair frequencies to be compared by placing these values on the same numerical scale. For example, normalized codon pair frequency values can be compared between different organisms, or can be compared for different codon pair frequency value calculations within a particular organism (e.g., different calculations based on input sequence information or based on different calculations such as chisql or chisq2 or chisq3). Typically, normalization results in codon pair frequency values that are described in terms of their mean and standard deviation from the mean.
- An exemplary method for normalizing codon pair frequency values is the calculation of z scores.
- the z score for an item indicates how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation.
- the mathematics of the z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will have a mean of zero and a standard deviation of one.
- the z scores transformation can be especially useful when seeking to compare the relative standings of items from distributions with different means and/or different standard deviations, z scores are especially informative when the distribution to which they refer is normal. In a normal distribution, the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve.
- An exemplary method for determining z scores for codon pair chi- squared values is as follows: First, a list of all 3721 possible non-terminating codon pairs is generated. Second, for the i ⁇ codon pair, the I th chi-squared value is calculated, where the i ⁇ chi-squared value is denoted c,. The chi-squared value, C 1 , is given the sign of (observed - expected), so that over-represented codon pairs are assigned a positive c, and under-represented codon pairs are assigned a negative C 1 .
- c sgn(obs, - exp,) * (obs, - exp,) 2 / exp, [0349]
- m (I 1 C 1 ) / 3721 where ⁇ 1 means sum over i.
- s the standard deviation of the chi-squared values is calculated, where the standard deviation is denoted s.
- a z score is calculated by subtracting the mean then dividing by the standard deviation, wherein the i th z score is denoted z,.
- the formula for the z score is: s
- provided herein are methods of refining the predictive capability of a translational kinetics value of a codon pair in a host organism by providing an initial translational kinetics value based on the value of observed codon pair frequency versus expected codon pair frequency for a codon pair in a host organism, providing additional translational kinetics data for the codon pair in the host organism, and modifying the initial translational kinetics value according to the additional codon pair translational kinetics data to generate a refined translational kinetics value for the codon pair in the host organism.
- the translational kinetics data that can be used to refine translational kinetics values and methods of modifying translational kinetics values according to such additional translational kinetics data to generate a refined translational kinetics value for a codon pair in a host organism are provided below.
- translational kinetics data that can be used to refine translational kinetics values are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair.
- Recurrence-based refinement of translational kinetics values is based on the investigation of multiple polypeptide-encoding nucleotide sequences to determine whether or not there are multiple occurrences of either codon pairs or predicted translational kinetics values in those sequences.
- Recurrence-based refinement of translational kinetics can be performed using any of a variety of known sequence comparison methods consistent with the examples provided herein. For purposes of exemplification, and not for limitation, the following example of recurrence-based refinement of translational kinetics is provided.
- the predicted translational kinetics value for a codon pair can be refined according to the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species.
- related proteins are proteins having homologous amino acid sequences and/or similar three dimensional structures.
- Related proteins having homologous amino acid sequences will typically have at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% sequence identity.
- Related proteins having similar three dimensional structures will typically share similar secondary structure topology and similar relative positioning of secondary structural elements; exemplary related proteins having three dimensional structures are members of the same SCOP- classified Family (see, e.g., Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995).
- SCOP a structural classification of proteins database for the investigation of sequences and structures. J. MoI. Biol. 247, 536-540.).
- the observed versus expected codon pair frequency values for any given codon pair can vary from species to species. However, as provided herein, evolutionarily related proteins in different species will typically conserve some or all translational pause or slowing sites. Based on this, an observed conservation of one or more predicted translational pause or slowing sites in evolutionarily related proteins of different species can confirm or increase the likelihood that a translational pause or slowing site is a functional translational kinetics signal.
- the codon pair located at the position on a protein that is confirmed as, or considered to have an increased likelihood of, containing an actual translational pause or slowing can itself be confirmed as being, or considered to have an increased likelihood of being, a functional translational kinetics signal.
- a codon pair located at a position on a protein that is confirmed as not containing, or considered to have a decreased likelihood of containing, an actual translational pause or slowing, can itself be confirmed as not acting, or considered to have an decreased likelihood of acting, as a functional translational kinetics signal.
- initially predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
- the predicted translational kinetics value for a codon pair can be refined according to the presence of the codon pair at a location predicted by methods other than codon pair frequency methods to contain a translational pause or slowing site.
- a predicted location is a boundary location between autonomous folding units of a protein.
- translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a secondary structural element of a protein and/or a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure by the nascent protein prior to further downstream translation, and thereby allowing each domain to partially organize and commit to a particular, independent fold.
- codon pairs can be associated with translational pauses between autonomous folding units of a protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain.
- the presence of a codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likelihood that the codon pair acts to pause or slow translation.
- predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
- predicted translational kinetics data can be modified according to the presence of the codon pair at a boundary location between autonomous folding units of a protein, which can increase the likelihood of the codon pair acts to pause or slow translation.
- an over-represented codon pair that is present at a boundary location between autonomous folding units of a protein can be confirmed as acting as a translational pause or slowing codon pair.
- a single observation of the codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likely translational pause or slowing properties of a codon pair.
- typically a plurality of observations will be used to more accurately estimate the translational pause or slowing properties of a codon pair.
- methods of using, for example, predicted boundary locations can be combined with methods that are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair in methods of refining a predicted translational kinetics value for a codon pair.
- a protein present in two or more species can have conserved boundary locations between autonomous folding units of the protein, and recurrent presence of an over-represented codon pair at the boundary locations can confirm the likelihood of an actual translational pause at that boundary location, leading to confirmation, or increased likelihood, that the corresponding codon pair for the respective species acts as a translational pause or slowing codon pair.
- two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of an over-represented codon pair at the boundary locations can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
- Such recurrence-based methods also can be used to confirm or indicate increased likelihood that a non-over-represented codon pair (e.g., an under-represented codon pair or a represented-as-expected codon pair) acts as a translational pause or slowing codon pair.
- a non-over-represented codon pair e.g., an under-represented codon pair or a represented-as-expected codon pair
- two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of a non- over-represented codon pair at the boundary locations, particularly if no over-represented codon pair is present, can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
- Such recurrence-based methods also can be used to confirm or indicate the likelihood that a codon pair, such as an over-represented codon pair, does not act as a translational pause or slowing codon pair.
- a codon pair such as an over-represented codon pair
- two or more proteins of the same species can have boundary locations between autonomous folding units, and consistent absence of a non-over-represented codon pair at the boundary locations can confirm or indicate increased likelihood that the codon pair does not act as a translational pause or slowing codon pair.
- the predicted translational kinetics value for a codon pair can be refined according to empirical measurement of translational kinetics for a codon pair.
- the influence of a codon pair on translational kinetics can be experimentally measured, and these experimental measurements can be used to refine or replace the predicted translational kinetics values for a codon pair.
- Several methods of experimentally measuring the translational kinetics of a codon pair are known in the art, and can be used herein, as exemplified in Irwin et al., J. Biol. Chem., (1995) 270:22801.
- One such exemplary assay is based on the observation that a ribosome pausing at a site near the beginning of an mRNA coding sequence can inhibit translation initiation by physically interfering with the attachment of a new ribosome to the message, and, thus, the codon pair to be assayed can be placed at the beginning of a polypeptide-encoding nucleotide sequence and the effect of the codon pair on translational initiation can be measured as an indication of the ability of the codon pair to cause a translational pause.
- Another such exemplary assay is based on the fact that the transit time of a ribosome through the leader polypeptide coding region of the leader RNA of the trp operon sets the basal level of transcription through the trp attenuator, and, thus, the codon pair to be assayed can be placed into a trpLep leader polypeptide codon region, and level of expression can be inversely indicative of the translational pause properties of the codon pair, due to a faster translation causing formation of a stem-loop attenuator in the leader RNA, which results in transcriptional attenuation.
- the methods provided herein for calculation of translational kinetics values can be applied to the native organism of the polypeptide of SEQ ID NOS: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194, and also can be applied to a selected organism in which the polypeptide of SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194, or a modification thereof, is to be heterologously expressed.
- the nucleotide sequence information of an organism can be used to calculate chi-squared values in accordance with the methods provided herein, and the translational kinetics values can be based on these chi-squared values as well as on additional translational kinetics information provided herein, including, but not limited to, codon pairs conserved in domain boundaries and empirically measured translational kinetics for a codon pair.
- the translational kinetics data described herein can be combined in such a manner as to provide a refined translational kinetics value for a codon pair in a host organism.
- Methods of combining predictive data to arrive at a refined predictive value are known in the art and can be used herein.
- an hypothesis H is that a given sequence feature, e.g., a given codon pair, has utility for translational kinetics engineering, e.g., creates a translational pause site.
- H) P(Dl & D2 & D3 & D4
- H) P(Dl & D2 & D3 & D4
- H) P(Dl & D2 & D3 & D4
- P(Di is correct) and P(Di is not correct) can be estimated a priori by the correlation of Di with previous experimental measurements.
- H) are obtained by observing whether or not hypothesis H is consistent with observed data item Di. More complex and powerful Bayesian approaches are also well known to the art. The fully general approach rewrites P(D
- the translational kinetics values for a codon pair can be refined by consideration of, for example, chi-squared value of observed versus expected codon pair frequency and the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, for example, at protein structure domain boundaries.
- An over-represented codon pair which is present with above-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting higher predicted translational pause properties of the codon pair.
- an over- represented codon pair which is present with below-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting lower predicted translational pause properties of the codon pair.
- the translational kinetics values for a codon pair can be refined by consideration of, for example, experimentally measured translation step times in one species and the degree to which codon pairs that correspond to measured pause sites in the first species are conserved across homologous proteins in other species, for example, in a multiple sequence alignment.
- an over-represented codon pair in another species is aligned with above-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting higher predicted translational pause properties of that codon pair in the other species.
- an over-represented codon pair in another species when aligned with below-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting lower predicted translational pause properties of that codon pair in the other species.
- translational kinetics values for codon pairs can be determined.
- the translational kinetic values can be organized according to the likelihood of causing a translational pause or slowing based on any method known in the art.
- the translational kinetic values for two or more codon pairs, up to all codon pairs, in an organism are determined, and the mean translational kinetics value and associated standard deviation are calculated. Based on this, the translational kinetics value for a particular codon pair can be described in terms of the multiple of standard deviations the translational kinetics value for the particular codon pair differs from the mean translational kinetics value.
- Such a graphical display provides a visual display of the predicted translational influence, including translational pause or slowing for numerous or all codon pairs of a polypeptide-encoding nucleotide sequence.
- This visual display can be used in methods of modifying polypeptide-encoding nucleotide sequences in order to thereby modify the predicted translational kinetics of the mRNA into polypeptide in methods such as those provided herein.
- the graphical displays can be used to identify one or more codon pairs to be modified in a polypeptide-encoding nucleotide sequence.
- the graphical displays can be used in analyzing a polypeptide-encoding nucleotide sequence prior to modifying the polypeptide-encoding nucleotide sequence, or can be used in analyzing a modified polypeptide-encoding nucleotide sequence to determine, for example, whether or not further modifications are desired.
- Methods for creating and using graphical displays can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, which are incorporated by reference herein in their entireties.
- graphical displays as described therein can be created to illustrate the translational kinetics of an original or redesigned polypeptide- encoding nucleotide sequence in the native or a heterologous organism, or to illustrate differences and/or similarities of translation kinetic of a polypeptide-encoding nucleotide sequence in which one or more codon pairs have been modified.
- numerous normalized graphical displays can be created to illustrate differences and/or similarities of translation kinetics of a polypeptide-encoding nucleotide sequence when expressed in two or more different organisms.
- the graphical displays can be created using translational kinetics values based on any of the methods for determining translational kinetics values provided herein or otherwise known in the art. For example, chi-squared as a function of codon pair position, chi-squared 2 as a function of codon position, or chi-squared 3 as a function of codon pair position, translational kinetics values thereof, empirical measurement of translational pause of codon pairs in a host organism, estimated translational pause capability based on observed presence and/or recurrence of a codon pair at predicted pause site, and variations and combinations thereof as provided herein.
- the exact format of the graphical displays can take any of a variety of forms, and the specific form is typically selected for ease of analysis and comparison between plots.
- the abscissa typically lists the position along the nucleotide sequence or polypeptide sequence, and can be represented by nucleotide position, codon position, codon pair position, amino acid position, or amino acid pair position.
- the ordinate typically lists the translational kinetics value of the codon pair, such as, but not limited to, a translational kinetics value of codon pair frequency, including, but not limited to the z score of chisql , the z score of chisq2, the z score of chisq3, the empirically measured value, and the refined translational kinetics value.
- the sequence position can be plotted along the ordinate and the translational kinetics value can be plotted along the abscissa.
- a set of graphical displays including at least a first graphical display and a second graphical display, are prepared. These sets of displays can be compared in order to determine the difference in predicted translational efficiency or translational kinetics of the two plots.
- the plots can differ according to any of a variety of criteria. For example, each plot can represent a different polypeptide-encoding nucleotide sequence, each plot can represent a different host organism, each plot can represent differently determined translational kinetics values, or any combination thereof.
- any number of different graphical displays can be compared in accordance with the methods provided herein, for example, 2, 3, 4, 5, 6, 7, 8 or more different graphical displays can be compared.
- two plots will represent different polypeptide-encoding nucleotide sequences, the same sequence in different host organisms, or different sequences in different host organisms.
- Comparison of different graphical displays can be used to analyze the predicted change in translational kinetics as a result of the difference represented by the graphical displays. For example, comparison of the same polypeptide-encoding nucleotide sequence in different host organisms can be used to analyze any predicted transcriptional pauses that can be removed. Accordingly, provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide in a host organism by comparing two graphical displays to understand or predict the differences in translational kinetics of the mRNA into polypeptide, where the differences in the graphical displays can be as a result of, for example, a difference in the polypeptide-encoding nucleotide sequence or a difference in the host organism.
- a graphical display of the translational kinetics values of codon pairs for the original polypeptide- encoding nucleotide sequence in the heterologous host can be compared to a graphical display of the translational kinetics values of codon pairs for a modified polypeptide- encoding nucleotide sequence in the heterologous host, and it can be determined whether or not the modification to the polypeptide-encoding nucleotide sequence resulted in improved translational kinetics.
- the nucleic acid sequences provided herein can be present in a polynucleotide (e.g., DNA or RNA molecule).
- a polynucleotide e.g., DNA or RNA molecule.
- the polynucleotides can be inserted into a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
- a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
- Various vectors are publicly available and are known in the art.
- the vector can, for example, be in the form of a plasmid, cosmid, viral particle, or phage.
- the appropriate nucleic acid sequence can be inserted into the vector by any of a variety of procedures known in the art.
- Vector components can generally include, but are not limited to, one or more of a signal sequence, an origin of replication, one or more marker genes, an enhancer element, a promoter, and a transcription termination sequence. Construction of suitable vectors containing one or more of these components employs standard ligation techniques which are known to the skilled artisan.
- the encoded polypeptide can be produced recombinantly not only directly, but also as a fusion polypeptide with a heterologous polypeptide, which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide.
- a heterologous polypeptide which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide.
- the signal sequence can be a component of the vector, or it can be a part of the polynucleotide that is inserted into the vector.
- the signal sequence can be a prokaryotic signal sequence selected, for example, from the group of the alkaline phosphatase, penicillinase, lpp, or heat-stable enterotoxin II leaders.
- the signal sequence can be, e.g., the yeast invertase leader, alpha factor leader (including Saccharomyces and Kluyveromyces ⁇ -factor leaders, the latter described in U.S. Patent No. 5,010,182), or acid phosphatase leader, the C. albicans glucoamylase leader (EP 362,179 published 4 April 1990), or the signal described in WO 90/13646 published 15 November 1990.
- mammalian signal sequences can be used to direct secretion of the protein, such as signal sequences from secreted polypeptides of the same or related species, as well as viral secretory leaders.
- Both expression and cloning vectors contain a polynucleoitde that permits the vector to replicate in one or more selected host cells. Such sequences are well known for a variety of bacteria, yeast, and viruses.
- the origin of replication from the plasmid pBR322 is suitable for most Gram-negative bacteria, the 2 ⁇ plasmid origin is suitable for yeast, and various viral origins (SV40, polyoma, adenovirus, VSV or BPV) are useful for cloning vectors in mammalian cells.
- Expression and cloning vectors will typically contain a selection gene, also termed a selectable marker.
- Typical selection genes encode proteins that (a) confer resistance to antibiotics or other toxins, e.g., ampicillin, neomycin, methotrexate, or tetracycline, (b) complement auxotrophic deficiencies, or (c) supply critical nutrients not available from complex media, e.g., the gene encoding D-alanine racemase for Bacilli.
- Suitable selectable markers for mammalian cells are those that enable the identification of cells competent to take up the polynucleotide- containing vector, such as DHFR or thymidine kinase.
- An appropriate host cell when wild-type DHFR is employed is the CHO cell line deficient in DHFR activity, prepared and propagated as described by Urlaub et al., Proc. Natl. Acad. Sci. USA, 77:4216 (1980).
- a suitable selection gene for use in yeast is the trpl gene present in the yeast plasmid YRp7 [Stinchcomb et al., Nature, 282:39 (1979); Kingsman et al., Gene, 7:141 (1979); Tschemper et al., Gene, 10: 157 (1980)].
- the trpl gene provides a selection marker for a mutant strain of yeast lacking the ability to grow in tryptophan, for example, ATCC No. 44076 or PEP4-1 [Jones, Genetics, 85:12 (1977)].
- Expression and cloning vectors usually contain a promoter operably linked to the polynucleotide provided herein to direct mRNA synthesis. Promoters recognized by a variety of potential host cells are well known. Promoters suitable for use with prokaryotic hosts include the ⁇ -lactamase and lactose promoter systems [Chang et al., Nature, 275:615 (1978); Goeddel et al., Nature, 281 :544 (1979)], alkaline phosphatase, a tryptophan (trp) promoter system [Goeddel, Nucleic Acids Res., 8:4057 (1980); EP 36,776], and hybrid promoters such as the tac promoter [deBoer et al., Proc. Natl. Acad. Sci. USA, 80:21-25 (1983)]. Promoters for use in bacterial systems also will contain a Shine-Dalgarno (S. D.) sequence operably linked to the poly
- Suitable promoting sequences for use with yeast hosts include the promoters for 3-phosphoglycerate kinase [Hitzeman et al., J. Biol. Chem., 255:2073 (1980)] or other glycolytic enzymes [Hess et al., J. Adv.
- yeast promoters which are inducible promoters having the additional advantage of transcription controlled by growth conditions, are the promoter regions for alcohol dehydrogenase 2, isocytochrome C, acid phosphatase, degradative enzymes associated with nitrogen metabolism, metallothionein, glyceraldehyde-3- phosphate dehydrogenase, and enzymes responsible for maltose and galactose utilization. Suitable vectors and promoters for use in yeast expression are further described in EP 73,657.
- Transcription from vectors in mammalian host cells is controlled, for example, by promoters obtained from the genomes of viruses such as polyoma virus, fowlpox virus (UK 2,211,504 published 5 July 1989), adenovirus (such as Adenovirus T), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepatitis-B virus and Simian Virus 40 (SV40), from heterologous mammalian promoters, e.g., the actin promoter or an immunoglobulin promoter, and from heat-shock promoters, provided such promoters are compatible with the host cell systems.
- viruses such as polyoma virus, fowlpox virus (UK 2,211,504 published 5 July 1989), adenovirus (such as Adenovirus T), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus,
- Enhancers are cis-acting elements of DNA, usually about from 10 to 300 bp, that act on a promoter to increase its transcription.
- Many enhancer sequences are now known from mammalian genes (globin, elastase, albumin, ⁇ - fetoprotein, and insulin).
- an enhancer from a eukaryotic cell virus. Examples include the S V40 enhancer on the late side of the replication origin (bp 100-270), the cytomegalovirus early promoter enhancer, the polyoma enhancer on the late side of the replication origin, and adenovirus enhancers.
- the enhancer can be spliced into the vector at a position 5' or 3' to the polynucleotide provided herein, but is preferably located at a site 5' from the promoter.
- Expression vectors used in eukaryotic host cells will also contain sequences necessary for the termination of transcription and for stabilizing the mRNA. Such sequences are commonly available from the 5' and, occasionally 3', untranslated regions of eukaryotic or viral DNAs or cDNAs. These regions contain nucleotide segments transcribed as polyadenylated fragments in the untranslated portion of the mRNA transcribed from the polynucleotide provided herein.
- Host cells are transfected or transformed with expression or cloning vectors described herein for polypeptide production and cultured in conventional nutrient media modified as appropriate for inducing promoters, selecting transformants, or amplifying the genes encoding the desired sequences.
- the culture conditions such as media, temperature, pH and the like, can be selected by the skilled artisan without undue experimentation. In general, principles, protocols, and practical techniques for maximizing the productivity of cell cultures can be found in Mammalian Cell Biotechnology: a Practical Approach, M. Butler, ed. (IRL Press, 1991) and Sambrook et al., supra.
- Methods of eukaryotic cell transfection and prokaryotic cell transformation are known to the ordinarily skilled artisan, for example, CaCl 2 , CaPO 4 , liposome-mediated and electroporation. Depending on the host cell used, transformation is performed using standard techniques appropriate to such cells.
- the calcium treatment employing calcium chloride, as described in Sambrook et al., supra, or electroporation is generally used for prokaryotes.
- Infection with Agrobacterium tumefaciens is used for transformation of certain plant cells, as described by Shaw et al., Gene, 23:315 (1983) and WO 89/05859 published 29 June 1989.
- Suitable host cells for cloning or expressing the DNA in the vectors herein include prokaryote, yeast, or higher eukaryote cells.
- Suitable prokaryotes include but are not limited to eubacteria, such as Gram-negative or Gram-positive organisms, for example, Enterobacteriaceae such as E. coli.
- Various E. coli strains are publicly available, such as E. coli Kl 2 strain MM294 (ATCC 31,446); E. coli Xl 776 (ATCC 31,537); E. coli strain W31 10 (ATCC 27,325) and K5 772 (ATCC 53,635).
- suitable prokaryotic host cells include Enterobacteriaceae such as Escherichia, e.g., E. coli, Enterobacter, Erwinia, Klebsiella, Proteus, Salmonella, e.g., Salmonella typhimurium, Serratia, e.g., Serratia marcescans, and Shigella, as well as Bacilli such as B. subtilis and B. licheniformis (e.g., B. licheniformis 41P disclosed in DD 266,710 published 12 April 1989), Pseudomonas such as P. aeruginosa, and Streptomyces. These examples are illustrative rather than limiting.
- Strain W3110 is one particularly preferred host or parent host because it is a common host strain for recombinant DNA product fermentations. Preferably, the host cell secretes minimal amounts of proteolytic enzymes.
- strain W3110 can be modified to effect a genetic mutation in the genes encoding proteins endogenous to the host, with examples of such hosts including E. coli W31 10 strain 1A2, which has the complete genotype tonA ; E. coli W3110 strain 9E4, which has the complete genotype tonA ptr3; E.
- coli W31 10 strain 27C7 (ATCC 55,244), which has the complete genotype tonA ptr3 phoA El 5 (argF-lac)169 degP ompT kanr; E. coli W31 10 strain 37D6, which has the complete genotype tonA ptr3 phoA El 5 (argF- lac)169 degP ompT rbs7 ilvG kanr; E. coli W31 10 strain 40B4, which is strain 37D6 with a non-kanamycin resistant degP deletion mutation; and an E. coli strain having mutant periplasmic protease disclosed in U.S. Patent No. 4,946,783 issued 7 August 1990.
- in vitro methods of cloning e.g., PCR or other nucleic acid polymerase reactions, are suitable.
- eukaryotic microbes such as filamentous fungi or yeast are suitable cloning or expression hosts for polynucleoitide-containing vectors.
- Saccharomyces cerevisiae is a commonly used lower eukaryotic host microorganism.
- Others include Schizosaccharomyces pombe (Beach and Nurse, Nature, 290: 140 [1981]; EP 139,383 published 2 May 1985); Kluyveromyces hosts (U.S. Patent No. 4,943,529; Fleer et al., Bio/Technology, 9:968-975 (1991)) such as, e.g., K.
- lactis (MW98-8C, CBS683, CBS4574; Louvencourt et al., J. Bacterid., 154(2):737-742 [1983]), K. fragilis (ATCC 12,424), K. bulgaricus (ATCC 16,045), K. wickeramii (ATCC 24,178), K. waltii (ATCC 56,500), K. drosophilarum (ATCC 36,906; Van den Berg et al., Bio/Technology, 8:135 (1990)), K. thermotolerans, and K. marxianus; yarrowia (EP 402,226); Pichia pastoris (EP 183,070; Sreekrishna et al., J.
- Candida Trichoderma reesia (EP 244,234); Neurospora crassa (Case et al., Proc. Natl. Acad. Sci. USA, 76:5259-5263 [1979]); Schwanniomyces such as Schwanniomyces occidentalis (EP 394,538 published 31 October 1990); and filamentous fungi such as, e.g., Neurospora, Penicillium, Tolypocladium (WO 91/00357 published 10 January 1991), and Aspergillus hosts such as A. nidulans (Ballance et al., Biochem. Biophys. Res.
- Methylotropic yeasts are suitable herein and include, but are not limited to, yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula. A list of specific species that are exemplary of this class of yeasts can be found in C. Anthony, The Biochemistry of Methylotrophs, 269 (1982).
- Suitable host cells for the expression of glycosylated polypeptides are derived from multicellular organisms.
- invertebrate cells include insect cells such as Drosophila S2 and Spodoptera Sf9, as well as plant cells.
- useful mammalian host cell lines include Chinese hamster ovary (CHO) and COS cells. More specific examples include monkey kidney CVl line transformed by SV40 (COS-7, ATCC CRL 1651); human embryonic kidney line (293 or 293 cells subcloned for growth in suspension culture, Graham et al., J. Gen Virol., 36:59 (1977)); Chinese hamster ovary cells/-DHFR (CHO, Urlaub and Chasin, Proc. Natl. Acad. Sci.
- mice Sertoli cells TM4, Mather, Biol. Reprod., 23:243-251 (1980)
- human lung cells Wl 38, ATCC CCL 75
- human liver cells Hep G2, HB 8065
- mouse mammary tumor MMT 060562, ATCC CCL51. The selection of the appropriate host cell is deemed to be within the skill in the art.
- Gene amplification and/or expression can be measured in a sample directly, for example, by conventional Southern blotting, Northern blotting to quantitate the transcription of mRNA [Thomas, Proc. Natl. Acad. Sci. USA, 77:5201 5205 (1980)], dot blotting (DNA analysis), or in situ hybridization, using an appropriately labeled probe, based on the sequences provided herein.
- antibodies can be employed that can recognize specific duplexes, including DNA duplexes, RNA duplexes, and DNA RNA hybrid duplexes or DNA protein duplexes. The antibodies in turn can be labeled and the assay can be carried out where the duplex is bound to a surface, so that upon the formation of duplex on the surface, the presence of antibody bound to the duplex can be detected.
- Gene expression can be measured by immunological methods, such as immunohistochemical staining of cells or tissue sections and assay of cell culture or body fluids, to quantitate directly the expression of gene product.
- Antibodies useful for immunohistochemical staining and/or assay of sample fluids can be either monoclonal or polyclonal, and can be prepared in any mammal. Conveniently, the antibodies can be prepared against any polypeptide provided herein or against a synthetic peptide based on the sequences provided herein or against exogenous sequence fused to the polypeptide or fragment thereof and encoding a specific antibody epitope.
- Polypeptides can be recovered from culture medium or from host cell lysates. If membrane-bound, it can be released from the membrane using a suitable detergent solution (e.g. Triton-X 100) or by enzymatic cleavage. Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art. [0397] It may be desired to purify polyeptpides.
- the following procedures are exemplary of suitable purification procedures: by fractionation on an ion-exchange column; ethanol precipitation; reverse phase HPLC; chromatography on silica or on a cation-exchange resin such as DEAE; chromatofocusing; SDS-PAGE; ammonium sulfate precipitation; gel filtration using, for example, Sephadex G-75; protein A Sepharose columns to remove contaminants such as IgG; and metal chelating columns to bind epi tope-tagged forms of the polypeptide.
- Various additional known methods of protein purification can be employed; exemplary methods are described in Deutscher, Methods in Enzymology, 182 (1990); Scopes, Protein Purification: Principles and Practice, Springer- Verlag, New York (1982).
- the purification step(s) selected will depend, for example, on the nature of the production process used and the particular polypeptide produced.
- an expression system comprising an expression vector in a host organism, wherein the expression vector includes a DNA sequence of the embodiments provided herein operably linked to an expression control sequence.
- an expression vector is a DNA or RNA vector that is capable of transforming a host cell and of effecting expression of a specified nucleic acid molecule.
- the expression vector is also capable of replicating within the host cell.
- Expression vectors can be either prokaryotic or eukaryotic, and are typically viruses or plasmids.
- operably linked refers to functional linkage between a nucleic acid expression control sequence (such as a promoter, or array of transcription factor binding sites) and a second nucleic acid sequence, wherein the expression control sequence directs transcription of the nucleic acid corresponding to the second sequence.
- An operably linked expression vector can also include secretion signals and other modifying sequences, and can encode chaperones and proteins for a variety of organisms and systems.
- Methods of expressing polypeptides from polypeptide-encoding nucleotide sequences are known in the art, as exemplified, for example, by the techniques described in Maniatis et al., 1989, Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, N.Y. and Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y.
- the methods include inserting a polypeptide- encoding nucleotide sequence designed by the methods provided herein into a cell, and expressing the polypeptide-encoding nucleotide sequence under conditions suitable for gene expression. Additionally provided expression methods include cell-free expression systems as known in the art, where such methods include providing a polypeptide- encoding nucleotide sequence designed by the methods provided herein and contacting the polypeptide-encoding nucleotide sequence with a cell-free expression system under conditions suitable for protein translation.
- the expression levels of one or more enzymes in a metabolic pathway are individually manipulated. Differential metabolic expression levels can be manipulated using methods known in the art. For example, by selecting a specific promoter with a desired transcriptional level, one can vary the expression level of the gene that is operably linked to the promoter. Similarly, one may select an expression vector that produces the desired levels of expression.
- Endogenous sequences include genomic sequences of a cell. Such genomic sequences can include sequences previously modified by the constructs, methods and systems provided herein. Modifications of endogenous sequences can include insertions, deletions and mutations. In some embodiments, a modification can include the insertion of a heterologous sequence. Heterologous sequences include exogenous nucleic acid sequences and can include sequences with homology to endogenous sequences.
- Integrable polynucleotides for modifying endogenous nucleotide sequences in cell are provided.
- Such integrable polynucleotides can contain sequences with homology to endogenous sequences and a removable selectable marker cassette.
- the removable selectable marker cassette can include a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
- integrable polynucleotides can also contain heterologous sequences.
- the heterologous sequences and removable selectable marker cassette can be flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
- integrable polynucleotides can include episomal nucleic acids, such as plasmids and YACS.
- integrable polynucleotides can include autonomous replication sequences such as CoIEl, Ori, oriT, 2 ⁇ m, CEN/ARS.
- integrable polynucleotides can include linearized episomal nucleic acids, for example, plasmids cut with a restriction enzyme.
- integrable polynucleotides can include PCR products.
- a removable selectable cassette can contain a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
- Removable selectable marker cassettes can be used to select for integration of an integrable polynucleotide into the genome of a cell. Subsequent to integration of the integrable polynucleotide, the removable selectable marker cassette can be excised, if desired, from the genome of the cell. Because the number of known selectable markers is limited, one advantage of excising a selectable maker from the genome of a cell is that the selectable marker can be used repeatedly.
- the same selectable marker can be used in a second integrable polynucleotide to modify the genome of a cell previously modified by the first integrable polynucleotide.
- the selectable marker can allow selection for a cell in which the selectable marker has integrated into the cell's genome.
- Selectable markers can be antibiotic resistance genes against compounds, for example, kanamycin, ampicillin, tetracycline, chloramphenicol, spectinomycin, gentamycin, zeomycin, or streptomycin. More selectable markers can be genes capable of complementing strains of yeast having well characterized metabolic deficiencies, for example, tryptophan or histidine deficient mutants.
- a selectable marker can be used to select against cells that retain the selectable marker. In such embodiments, cells which do not express the selectable marker will be selected for.
- a selectable marker can be selected for and against.
- selectable markers examples include, but are not limited to, URA3 (Boeke, J. D. , LaCroute, F. , and Fink, G. R. (1984).
- a counterselection for the tryptophan pathway in yeast 5-fluoroanthranilic acid resistance.
- Yeast 16, 553-560 CANl (Whelan, W. L., Gocke, E., and Manney, T. R. (1979).
- the CANl locus of Saccharomyces cerevisiae fine-structure analysis and forward mutation rates. Genetics 35-51), KIURA3, CYH2, LYS2 and MET15 (Singh, A. and Sherman, F. (1975). Genetic and physiological characterization of metl5 mutants of Saccharomyces cerevisiae: a selective system for forward and reverse mutations. Genetics 75-97).
- Such examples can typically be used in conjunction with specific strains of Saccharamyces cerevisiae which are non-functional for specific genes.
- a first selection of the selectable marker can be made to select for incorporation of the selectable marker and a second selection of the selectable marker can be made to select against maintaining the selectable marker.
- Such embodiments can find particular application when the same selectable marker is utilized iteratively, namely, two or more times, for the separate incorporation of two or more heterologous polynucleotides into the host organism.
- the selectable marker can be flanked by site- specific recombinase recognition sequences.
- site-specific recombinase recognition sequences allow a site-specific recombinase to excise the selectable marker from an integrable polynucleotide integrated into the genome of a cell.
- sequence-specific recombinase target sites include, but are not limited to, loxP sites, fit sites, att sites and dif sites.
- the site-specific recombinase recognition sequences can be loxP sites recognized by the CRE recombinase.
- the CRE recombinase can be a CRE recombinase optimized for expression in a particular organism, for example, S. cerevisiae, using methods known in the art.
- the site-specific recombinase recognition sequence can be frt sites recognized by the FLP recombinase.
- flanking loxP sites or flanking frt sites should be in the same orientation, that is, the sites should be in tandem orientation.
- CRE recombinase or FLP recombinase expressed in a cell can excise the sequence between loxP sites or frt sites, respectively.
- the site-specific recombinase can be expressed from a plasmid. In other embodiments, the site-specific recombinase can be expressed from an inducible endogenous gene.
- integration of an integrable polynucleotide into the genome of a cell can be mediated by a variety of processes.
- Such processes can include, but are not limited to, random integration, homologous recombination, or site- specific recombination.
- integrable polynucleotides can contain sequences with homology to endogenous sequences. Such sequences with homology to endogenous sequences can direct integration of integrable polynucleotides to certain locations in a cell's genome, specifically, the location of the endogenous sequence.
- One advantage of directing integration of integrable polynucleotides to particular locations of the genome is that the integrable polynucleotides can be directed to locations of the genome that, for example, can contain enhancer elements, locus control regions, or can be more permissive for expression of a heterologous sequence contained within an integrable polynucleotide.
- sequences with homology to endogenous sequences can be more than about 5 nucleotides, more than about 10 nucleotides, more than about 15 nucleotides, more than about 20 nucleotides, more than about 25 nucleotides, more than about 30 nucleotides, more than about 35 nucleotides, more than about 40 nucleotides, more than about 45 nucleotides, more than about 50 nucleotides, more than about 100 nucleotides, more than 500 nucleotides, more than about 1 kilobases, more than about 2 kilobases, more than about 3 kilobases, more than about 4 kilobases, or more than about 5 kilobases in length.
- Sequences with homology to endogenous sequences can be 100% identical or can have at least 99 %, 98 %, 97 %, 96 %, 95 %, 94 %, 93 %, 92 %, 91 %, 90 %, 85 %, 80 %, 70 %, or 70% identity to the endogenous sequence.
- sequences with homology to endogenous sequences can contain sequences with homology to genomic repetitive elements, such as long interspersed repeats (LINEs), short interspersed repeats (SINEs), or retrotransposon DNA, such as long terminal repeats (LTR).
- genomic repetitive elements can be TyI or Ty3 elements.
- integrable polynucleotides containing sequences with homology to genomic repetitive elements may integrate at more than one site in the genome of a cell.
- sequences with homology to endogenous sequences can contain ⁇ sequences, ⁇ sequences are a component of the LTR of the TyI retrotransposon and are distributed throughout the S. cerevisiae genome.
- Vectors containing ⁇ sequences for integration into S. cerevisiae are known in the art, as exemplified in Lee F.W. and Da Dilva N.A., Sequential delta-integration for the regulated insertion of cloned genes in Saccharomyces cerevisiae. Biotechnol Prog. (1997) 13(4): 368-373.
- the 5' nucleic acid sequence with homology to an endogenous sequence and the 3' nucleic acid sequence with homology to an endogenous sequence can contain ⁇ sequences.
- Vectors containing heterologous sequences flanked by ⁇ sequences are known in the art to have an increased stability for expression of heterologous sequences contained therein (Lee F.W.
- an integrable polynucleotide can contain heterologous sequences.
- Such heterologous sequences can include sequences encoding polypeptides.
- the heterologous sequences can encode genes important in sugar metabolism, cellulose metabolism, arabinose metabolism, and xylose metabolism.
- heterologous sequences can contain regulatory elements operatively linked to a sequence encoding a polypeptide.
- regulatory elements can include, for example, promoters, enhancers, and terminator sequences. Promoters may be constitutive or inducible. Suitable promoters for use in prokaryotic hosts include, but are not limited to, the trp, lac and phage promoters, tRNA promoters and glycolytic enzyme promoters.
- Useful yeast promoters include, but are not limited to, the promoter regions for metallothionein, 3-phosphoglycerate kinase or other glycolytic enzymes such as enolase or glyceraldehyde-3 -phosphate dehydrogenase and the enzymes responsible for maltose and galactose utilization.
- Appropriate mammalian promoters include, but are not limited to, the early and late promoters from SV40 and promoters derived from murine Moloney leukemia virus (MLV), mouse mammary tumor virus (MMTV), avian sarcoma viruses, adenovirus II, bovine papilloma virus and polyomas.
- a heterologous sequence can contain the PGKl promoter, the TEFl promoter, the CYCl terminator, and combinations thereof.
- heterologous sequences encode and express the gene of interest in a cell in which the heterologous sequence has integrated.
- a cell can contain any of the integrable polynucleotides described herein.
- a cell can be a prokaryotic cell or a eukaryotic cell.
- prokaryotic cells include Escherichia coli, and Clostridium species.
- eukaryotic cells include, but are not limited to, fungi and yeast cells, such as, Saccharomyces cerevisiae, Pichia pasto ⁇ s, Zymomonas mobilis, Kluyveromyces lactis, Kluveromyces marxianus, Trichoderma species, and Aspergillus species; mammalian cells, such as Chinese hamster cells; avian cells; and insect cells.
- the cell can contain an integrable polynucleotide integrated into the genome of a cell.
- a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which the removable selectable marker is juxtaposed to said heterologous nucleic acid.
- a removable selectable marker can be juxtaposed to a heterologous nucleic acid where the removable selectable marker and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the removable selectable marker and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less
- a cell can contain an integrable polynucleotide integrated into the genome of the cell where the removable selectable cassette has been excised from the integrated polynucleotide.
- a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which a site-specific recombinase recognition site is juxtaposed to the heterologous nucleic acid.
- a site-specific recombinase recognition site can be juxtaposed to a heterologous nucleic acid where the site-specific recombinase recognition site and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the site-specific recombinase recognition site and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilob
- a cell can contain a plurality of integrable polynucleotides.
- a cell can contain a plurality of different integrable polynucleotides containing different selectable markers.
- a cell contains no more than about 1, no more than about 2, no more than about 3, no more than about 4, no more than about 5, no more than about 6, no more than about 7, no more than about 8, no more than about 8, or no more than about 10 different selectable markers.
- the number of selectable markers a cell can contain can include the number of different selectable markers compatible with the methods and compositions described herein.
- a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell.
- a cell can contain 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 45 or more, or 50 or more different integrable polynucleotides that have integrated into the genome of the cell.
- a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell where some integrable polynucleotides contain selectable markers, and some integrable polynucleotides have no selectable marker. In even more embodiments, a cell can contain a plurality of different integrable polynucleotides where some or all of the selectable markers have been excised.
- methods to modify an endogenous sequence in a cell can include providing a cell with any integrable polynucleotide described herein, and selecting for at least one cell containing the integrable polynucleotide integrated into the genome of the cell.
- a plurality of different integrable polynucleotides can be provided to a cell.
- the plurality of different integrable polynucleotides can include 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
- the plurality of integrable polynucleotides can include integrable polynucleotides with different selectable makers.
- One advantage of providing a cell with a plurality of polynucleotides with different selectable markers includes the ability to make more than one modification to endogenous sequences in a cell simultaneously.
- the plurality of integrable polynucleotides can include integrable polynucleotides with different heterologous sequences.
- the plurality of integrable polynucleotides can include integrable polynucleotides with different flanking sequences with homology to endogenous sequences.
- at least one selectable marker can be used iteratively.
- a cell can be produced from a first round of modification(s) using the methods described herein.
- a cell can be provided with a first integrable polynucleotide containing a selectable marker, a cell can be selected for containing the integrable polynucleotide integrated into the cell's genome, the selection cassette can be excised from a cell containing an integrated integrable polynucleotide, and a cell can be selected for having the selection cassette excised. Subsequent to the first round of modifications, a cell containing the modifications of the first round, can undergo at least a second round of modifications using a second integrable polynucleotide containing the same selectable marker as the first integrable polynucleotide. As such, a selectable marker can be reused and is used iteratively.
- a cell can be provided with a plurality of integrable polynucleotides containing set of different selectable markers in a first round of modifications.
- a cell containing the modifications of the first round of modifications can be provided with a plurality of integrable polynucleotides containing the same set of different selectable markers as the first round of modifications.
- the integrable polynucleotide can be provided to a cell as a linearized plasmid.
- the integrable polynucleotide can be provided to a cell as a PCR product.
- Methods of PCR are well known in the art.
- the template for the PCR can comprise a sequence for an integrable polynucleotide, for example, a vector containing the integrable polynucleotide sequence.
- the initial template for PCR may not contain the entire sequence for an integrable polynucleotide.
- One advantage of using PCR to generate the integrable polynucleotide includes the ability to incorporate additional sequences to the ends of the initial PCR template.
- PCR primers with tails can be designed and used to amplify the initial PCR template and incorporate the additional sequences in the tails into the amplified product.
- Such additional tail sequences can be 2 nucleotides, 3 nucleotides, 4 nucleotides, 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 1 1 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 31 nucleotides, 32 nucleotides, 33 nucleotides, 34 nucleotides, 35 nucleotides, 36 nucleotides, 37 nucleotides, 38
- primers for the PCR can be designed to add sequences with homology to endogenous sequences to the initial PCR template.
- an integrable polynucleotide with flanking sequences with homology to endogenous sequences can be generated.
- additional tail sequences can include TyI sequences.
- methods to modify an endogenous sequence in a cell can also include excising the selectable marker from the integrable polynucleotide integrated into the genome of the cell.
- excising a selectable marker integrated into the genome of a cell is that the selectable marker can be re-used to select for another modification in a subsequent round of modifications.
- a selectable marker can be excised from an integrated site by site-specific recombination using a site-specific recombinase expressed in the cell.
- Site-specific recombinases can include CRE recombinase to excise sequences between tandem loxP sites, and FLP recombinase to excise sequences between tandem frt sites.
- the site- specific recombinase can be expressed from a plasmid transformed into the cell.
- the site-specific recombinase can be expressed from an inducible endogenous gene. It is contemplated that in instances where more than one type of different selectable makers have integrated into the cell's genome, all the different selectable makers can be excised simultaneously by the expression of at least one type of site-specific recombination.
- the selectable markers of an integrable polynucleotide containing the URA3 marker flanked by loxP sites, and an integrable polynucleotide containing the TRPl marker flanked by loxP sites can both be excised from sites where the integrable polynucleotides have integrated into the cell by expression in the cell of CRE recombinase.
- a cell can be provided with a plurality of integrable polynucleotides which contain different recombinase recognition sequences.
- the plurality of integrable polynucleotides can include some integrable polynucleotides that contain one type of recombinase recognition sequences, such as loxP sites, and some integrable polynucleotides can contain another type of recombinase recognition sequences, such as frt sites.
- a cell in which a selectable marker has been excised can be identified by selecting against cells that retain the marker. Methods for such negative selection are well known in the art.
- one or more, or all of the enzymes are heterologous to the one or more host organisms.
- the translational kinetics of each of the DNA sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, or 3 codon pairs present in the original sequence for each enzyme.
- a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
- the at least 1 , 2 or 3 substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
- each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
- one or more of the endo-l,4- ⁇ -glucanase, exo-l ,4- ⁇ - D-glucanase, and ⁇ -D-glucosidase enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for degradation of cellulose.
- Methods for measuring the activity of the enzymes in the system are known in the art.
- the incorporated materials of U.S. Patent No. 6,566,1 13 provide methods for measuring the activity of cellobiohydrolases that have been recombinantly expressed.
- Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed.
- the carbohydrate is cellulose.
- the carbohydrate comprises two or more ⁇ -l ,4-linked glucose units.
- Such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
- An exemplary system for lignin metabolism is a cassette of enzymes that can include laccase (LCC), Mn-dependent peroxidase (MnP), and lignin peroxidase (LiP).
- LCC laccase
- MnP Mn-dependent peroxidase
- LiP lignin peroxidase
- one or more, or all of the enzymes are heterologous to the one or more host organisms.
- the translational kinetics of each of the DNA sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1, 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme.
- a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
- the at least 1, 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
- a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
- the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster and Schizosaccharomyces pombe.
- each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
- one or more of the enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for metabolism of lignin. Methods for measuring the activity of the enzymes in the system are known in the art.
- Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed.
- the carbohydrate is cellulose.
- the carbohydrate comprises two or more ⁇ -l,4-linked glucose units.
- Such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
- a polynucleotide containing an improved-expression nucleotide sequence calculated in accordance with the teachings herein can be prepared by known methods, such as, for example, assembly of overlapping oligonucleotides which can be solid phase synthesized, as is described in U.S. Patent Number 7,262,031, and U.S. Patent Publication Numbers 2005/0106590 and 2007/0009928.
- the prepared polynucleotide can then be amplified by PCR methodologies or by insertion into a vector, transformation into cells, and subsequent harvesting of the vector from the cells. Examples of such methods for amplification of a polynucleotide are provided in Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y.
- the polynucleotide itself or amplicon thereof can be inserted into an expression vector configured to produce the polypeptide encoded by the inserted polynucleotide.
- the expression vector is then inserted into cells, and according to the expression vector used, the cells are treated under conditions suitable for polypeptide expression.
- the expressed polypeptide can be analyzed and manipulated as desired.
- the expressed polypeptide can be analyzed by Western blot analysis using a known antibody to the expressed polypeptide or using an anti-polypeptide antibody generated by known methods.
- the expressed polypeptide also can be subjected to one or more purification steps to increase the purity of the expressed polypeptide.
- Various analytical and purification method, as well as antibody-generation methods are known in the art, as exemplified in Ausubel, supra.
- This example describes optimization of a DNA sequence encoding TrCBH-II for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for S. cerevisiae.
- the DNA sequence encoding TrCBH-II (SEQ ID NO: 1) was derived from GenBank accession number M 16190 by removing untranslated sequence (5' untranslated region and introns).
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position.
- the graphical display is provided in Figure 1.
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 2A.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 3) was found to encode a protein (SEQ ID NO: 4) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 3) encoding the TrCBH-II protein (SEQ ID NO: 4) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2B.
- This example describes optimization of a DNA sequence encoding TrCBH-II for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3 A.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 9) was found to encode a protein (SEQ ID NO: 10) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 9) encoding the TrCBH-II protein (SEQ ID NO: 10) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3B.
- This example describes optimization of a DNA sequence encoding TrCBH-II for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4A.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 15) was found to encode a protein (SEQ ID NO: 16) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 15) encoding the TrCBH-II protein (SEQ ID NO: 16) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4B.
- This example describes optimization of a DNA sequence encoding TrCBH-II for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5A.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 21) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21) encoding the TrCBH-II protein (SEQ ID NO: 22) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5B.
- This example describes optimization of a DNA sequence encoding TrCBH-II for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for Z mobilis.
- a graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6A.
- the nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 23) was found to encode a protein (SEQ ID NO: 24) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 23) encoding the TrCBH-II protein (SEQ ID NO: 24) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 6B.
- Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
- a graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 7A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in 5. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 27) was found to encode a protein (SEQ ID NO: 28) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 27) encoding the LCC protein (SEQ ID NO: 28) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 7B.
- This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 8A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 33) was found to encode a protein (SEQ ID NO: 34) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 33) encoding the LCC protein (SEQ ID NO: 34) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 8B.
- This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 9A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 39) was found to encode a protein (SEQ ID NO: 40) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 39) encoding the LCC protein (SEQ ID NO: 40) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 9B.
- This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1OA.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 45) was found to encode a protein (SEQ ID NO: 46) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 45) encoding the LCC protein (SEQ ID NO: 46) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1OB.
- This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z. mobilis.
- a graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 1 IA.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 47) was found to encode a protein (SEQ ID NO: 48) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 47) encoding the LCC protein (SEQ ID NO: 48) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 1 IB.
- E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 8 and native LCC protein is examined by Western blot analysis.
- Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 6(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
- An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 600 of 0.5.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding LIP for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for S. cerevisiae.
- a graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 12A.
- the nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 51) was found to encode a protein (SEQ ID NO: 52) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 51) encoding the LIP protein (SEQ ID NO: 52) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 12B.
- This example describes optimization of a DNA sequence encoding LIP for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 13 A.
- the nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 57) was found to encode a protein (SEQ ID NO: 58) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 57) encoding the LIP protein (SEQ ID NO: 58) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 13B.
- This example describes optimization of a DNA sequence encoding LIP for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 14A.
- the nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the LIP protein (SEQ ID NO: 64) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 14B.
- This example describes optimization of a DNA sequence encoding LIP for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 15 A.
- the nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 69) was found to encode a protein (SEQ ID NO: 70) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 69) encoding the LIP protein (SEQ ID NO: 70) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 15B.
- This example describes optimization of a DNA sequence encoding LIP for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0508] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for Z mobilis.
- a graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position.
- the graphical display is provided in Figure 16A.
- the nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 71) was found to encode a protein (SEQ ID NO: 72) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 71) encoding the LIP protein (SEQ ID NO: 72) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 16B.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LIP antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding MnP for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- a graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 17A.
- the nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 75) was found to encode a protein (SEQ ID NO: 76) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 75) encoding the MnP protein (SEQ ID NO: 76) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 17B.
- EXAMPLE 20 [0517] This example describes optimization of a DNA sequence encoding MnP for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 18A.
- the nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 81) was found to encode a protein (SEQ ID NO: 82) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 81) encoding the MnP protein (SEQ ID NO: 82) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 18B.
- This example describes optimization of a DNA sequence encoding MnP for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0523] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position.
- the graphical display is provided in Figure 19A.
- the nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 87) was found to encode a protein (SEQ ID NO: 88) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 87) encoding the MnP protein (SEQ ID NO: 88) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 19B.
- This example describes optimization of a DNA sequence encoding MnP for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 2OA.
- the nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 93) was found to encode a protein (SEQ ID NO: 94) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 93) encoding the MnP protein (SEQ ID NO: 94) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 2OB.
- This example describes optimization of a DNA sequence encoding MnP for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for Z. mobilis.
- a graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 21 A.
- the nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 95) was found to encode a protein (SEQ ID NO: 96) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 95) encoding the MnP protein (SEQ ID NO: 96) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 22B.
- E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 20 and native MnP protein is examined by Western blot analysis.
- Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
- An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 600 of 0.5.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-MnP antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in N. crassa was prepared by plotting z scores of translational kinetics values for codon pair utilization in N. crassa as a function of codon pair position. The graphical display is provided in Figure 22.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 23 A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 99) was found to encode a protein (SEQ ID NO: 100) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 99) encoding the LCC protein (SEQ ID NO: 100) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 23B.
- This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 24A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 105) was found to encode a protein (SEQ ID NO: 106) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 105) encoding the LCC protein (SEQ ID NO: 106) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 24B.
- This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 25A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 111) was found to encode a protein (SEQ ID NO: 112) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 111) encoding the LCC protein (SEQ ID NO: 112) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 25B.
- EXAMPLE 28 [0549] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 26A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 117) was found to encode a protein (SEQ ID NO: 118) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 117) encoding the LCC protein (SEQ ID NO: 118) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 26B.
- This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0555] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z mobilis.
- a graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position.
- the graphical display is provided in Figure 27A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 1 19) was found to encode a protein (SEQ ID NO: 120) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 1 19) encoding the LCC protein (SEQ ID NO: 120) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 27B.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
- a graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 28A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 123) was found to encode a protein (SEQ ID NO: 124) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 123) encoding the LCC protein (SEQ ID NO: 124) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 28B.
- This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 29A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 129) was found to encode a protein (SEQ ID NO: 130) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 129) encoding the LCC protein (SEQ ID NO: 130) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 29B.
- This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0570] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position.
- the graphical display is provided in Figure 30A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 135) was found to encode a protein (SEQ ID NO: 136) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 135) encoding the LCC protein (SEQ ID NO: 136) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 30B.
- This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 31A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 141) was found to encode a protein (SEQ ID NO: 142) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 141) encoding the LCC protein (SEQ ID NO: 142) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 3 IB.
- This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
- Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z. mobilis.
- a graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 32A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 143) was found to encode a protein (SEQ ID NO: 144) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 143) encoding the LCC protein (SEQ ID NO: 144) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 32B.
- E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 32 and native LCC protein is examined by Western blot analysis.
- Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
- An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 6 oo of 0.5.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- a graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 33 A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 147) was found to encode a protein (SEQ ID NO: 148) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 147) encoding the LCC protein (SEQ ID NO: 148) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 33 B.
- This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 34A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 153) was found to encode a protein (SEQ ID NO: 154) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 153) encoding the LCC protein (SEQ ID NO: 154) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 34B.
- This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 35 A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 159) was found to encode a protein (SEQ ID NO: 160) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 159) encoding the LCC protein (SEQ ID NO: 160) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 35B.
- EXAMPLE 40 [0595] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 36A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 165) was found to encode a protein (SEQ ID NO: 166) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 165) encoding the LCC protein (SEQ ID NO: 166) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 36B.
- This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
- Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0601] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z mobilis.
- a graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position.
- the graphical display is provided in Figure 37A.
- the nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 167) was found to encode a protein (SEQ ID NO: 168) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 167) encoding the LCC protein (SEQ ID NO: 168) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 37B.
- E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 38 and native LCC protein is examined by Western blot analysis.
- Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
- An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 600 of 0.5.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding enzyme of T. Reesei cellobiohydrolase-I (TrCBH-I) for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for S. cerevisiae.
- the DNA sequence encoding TrCBH-I (SEQ ID NO: 169) was derived from GenBank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
- a graphical display for the native gene (SEQ ID NO: 169) encoding the protein (SEQ ID NO: 170) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position.
- the graphical display is provided in Figure 38.
- a graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 39A.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in 5. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 171) was found to encode a protein (SEQ ID NO: 172) with 100% amino acid sequence identity to wild-type TrCBH- I (SEQ ID NO: 170).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 171) encoding the TrCBH-I protein (SEQ ID NO: 172) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 39B.
- This example describes optimization of a DNA sequence encoding TrCBH-I for expression in bacteria.
- Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for E. coli.
- a graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 40A.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 173) was found to encode a protein (SEQ ID NO: 174) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 173) encoding the TrCBH-I protein (SEQ ID NO: 174) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 4OB.
- EXAMPLE 45 This example describes optimization of a DNA sequence encoding TrCBH-I for expression in P. pastoris.
- Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for P. pastoris.
- a graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4 IA.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 175) was found to encode a protein (SEQ ID NO: 176) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 175) encoding the TrCBH-I protein (SEQ ID NO: 176) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4 IB.
- This example describes optimization of a DNA sequence encoding TrCBH-I for expression in K. lactis.
- Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for K. lactis.
- a graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 42A.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 177) was found to encode a protein (SEQ ID NO: 178) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 177) encoding the TrCBH-I protein (SEQ ID NO: 178) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 42B.
- This example describes optimization of a DNA sequence encoding TrCBH-I for expression in Z. mobilis.
- Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for Z. mobilis.
- a graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 43A.
- the nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 179) was found to encode a protein (SEQ ID NO: 180) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 179) encoding the TrCBH-I protein (SEQ ID NO: 180) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 43 B.
- Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
- This example describes optimization of a DNA sequence encoding T. aurantiacus endoglucanase (EGl) for expression in yeast.
- the chi-squared value "chisql” was generated by the expected and observed values determined.
- the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
- the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
- z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
- the nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for S. cerevisiae.
- the DNA sequence encoding EGl (SEQ ID NO: 181) was derived from GenBank accession number M16190 by removing untranslated sequence (5' untranslated region and introns).
- a graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
- the graphical display is provided in Figure 44A.
- the nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
- the resulting nucleotide sequence (SEQ ID NO: 183) was found to encode a protein (SEQ ID NO: 184) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182).
- a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 183) encoding the EGl protein (SEQ ID NO: 184) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 44B.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Genetics & Genomics (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Wood Science & Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Zoology (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biochemistry (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Microbiology (AREA)
- Medicinal Chemistry (AREA)
- Physics & Mathematics (AREA)
- Biophysics (AREA)
- Plant Pathology (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
- Preparation Of Compounds By Using Micro-Organisms (AREA)
Abstract
Provided are polynucleotide sequences and synthetic genes encoding cellulose- and hemicellulose-degradation enzymes for expression in a host organism with improved and/or refined translational kinetics, and methods of making same. The resultant cellulose- and hemicellulose-degradation enzyme-encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant cellulose- and hemicellulose-degradation enzyme-encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant cellulose- and hemicellulose-degradation enzyme-encoding nucleotide is predicted to result in improved levels of active and/or natively folded and functional polypeptide expression in cases where inappropriate or excessive translational pauses causes expression of inactive, insoluble, aggregated or somehow dysfunctional or minimally active cellulose- and hemicellulose-degradation enzyme.
Description
CELLULOSE- AND HEMICELLULOSE-DEGRAD ATION ENZYME -
ENCODING NUCLEOTIDE SEQUENCES WITH REFINED TRANSLATIONAL
KINETICS AND METHODS OF MAKING SAME
BACKGROUND
Field of the Invention
[0001] The present invention relates to refining the translational kinetics of an mRNA into polypeptide, and polypeptide-encoding nucleotide sequences which have refined translational properties.
Description of the Related Art
[0002] Recent innovations have shown that enzymes can be useful for industrial applications. However, production of large amounts of functional enzyme is often limited. Despite the burgeoning knowledge of expression systems and recombinant DNA, significant obstacles remain when one attempts to express a foreign or synthetic gene in a non-native host organism. Often, a synthetic gene, even when coupled with a strong promoter, is inefficiently translated and can produce a low yield of protein, a faulty protein, or in many cases, low yields of an inactive protein. The same is frequently true of exogenous genes foreign to the expression organism. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in structural and activity properties from the native protein expressed in the native organism.
[0003] The Saccharomyces yeasts have proven to be safe, effective and user- friendly microorganisms for large-scale production of industrial ethanol from glucose- based feedstocks. Recently, efforts have been made to use cellulosic biomass as feedstock for producing ethanol. However, the major fermentable sugars from hydrolysis of these feedstocks (such as rice and wheat straw, sugarcane bagasse, corn stover, corn fibre, softwood, hardwood and grasses) are cross-linked with lignin, a major component of such feedstocks. Lignin minimizes the accessibility of cellulose and hemicellulose to microbial enzymes. Hence, lignin is generally associated with reduced digestibility of the overall plant biomass. Thus, there is a need for recombinant yeast and other
microorganisms that can degrade cellulose, hemicellulose and lignin. Many such pathways have been identified in organism such as white-rot fungi.
[0004] Despite knowledge in the art related to expression of a foreign or synthetic gene in a host organism, many hydrolysis enzymes do not express well in host organisms such as Escherichia coli or Saccharomyces cerevisiae. As a result, large-scale production is limited. Therefore, there is a continued need for improved expression of these enzymes.
SUMMARY
[0005] Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation and poor expression. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pause structures coded for by specific di-codon nucleotide sequences in the open reading frame (ORF) can improve protein expression.
[0006] In accordance with the above, provided herein are hydrolysis enzyme- encoding nucleotide sequences with refined translational kinetics and methods of designing and synthesizing the same. In one embodiment is provided a hydrolysis enzyme-encoding nucleotide sequence, wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme polypeptide, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing original codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The resultant hydrolysis enzyme-encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant hydrolysis enzyme-encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant hydrolysis enzyme- encoding nucleotide is predicted to result in improved levels of active and/or natively
folded polypeptide expression products in cases where inappropriate or excessive translation pauses cause expression of inactive, insoluble or aggregated enzyme.
[0007] Also provided herein are hydrolysis enzyme-encoding nucleotide sequences, wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme -encoding nucleotide sequence and is adapted for expression in a heterologous host organism, wherein at least 1 , 2, or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein. In some embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0008] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CCCTCT (nucleotides 463-468); GGCCAA (nucleotides 94- 99); CAGTTT (nucleotides 565-570); GATATC (nucleotides 703-708); GTGGAA (nucleotides 691-696); GGATTT (nucleotides 1 192-1197); GGTATT (nucleotides 1198- 1203). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CCCTCT (nucleotides 463-468) replaced with CCTTCT; GGCCAA (nucleotides 94-99) replaced with GGTCAA; CAGTTT (nucleotides 565-570) replaced with CAATTT; GATATC (nucleotides 703-708) replaced with GACATT; GTGGAA (nucleotides 691- 696) replaced with GTTGAA; GGATTT (nucleotides 1 192-1 197) replaced with GGTTTC; GGTATT (nucleotides 1 198-1203) replaced with GGAATT. In certain aspects, the nucleotide sequence is optimized for expression in S. cerevisiae.
[0009] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid
sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTCGGT (nucleotides 760-765); ATTGCC (nucleotides 631-636); GACAGC (nucleotides 1285-1290); GTCTGG (nucleotides 88-93); GTCTGG (nucleotides 1246-1251); TTGCTG (nucleotides 1231-1236); GTGGTG (nucleotides 571-576); ACGCTG (nucleotides 22-27); ACGCTG (nucleotides 31-36); GACTGG (nucleotides 1168-1173); GCCGGA (nucleotides 559-564); CTGGTG (nucleotides 748- 753). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTCGGT (nucleotides 760-765) replaced with CTGGGT; ATTGCC (nucleotides 631- 636) replaced with ATTGCG; GACAGC (nucleotides 1285-1290) replaced with GACTCT; GTCTGG (nucleotides 88-93) replaced with GTTTGG; GTCTGG (nucleotides 1246-1251) replaced with GTTTGG; TTGCTG (nucleotides 1231-1236) replaced with CTGCTG; GTGGTG (nucleotides 571-576) replaced with GTTGTT; ACGCTG (nucleotides 22-27) replaced with ACCCTC; ACGCTG (nucleotides 31-36) replaced with ACCCTG; GACTGG (nucleotides 1168-1173) replaced with GATTGG; GCCGGA (nucleotides 559-564) replaced with GCGGGC; CTGGTG (nucleotides 748- 753) replaced with CTGGTT. In certain aspects, the nucleotide sequence is optimized for expression in E. coli.
[0010] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CAGTTT (nucleotides 565-570); TTTGAC (nucleotides 1303-1308); TCGTTT (nucleotides 1240-1245); GGCCAA (nucleotides 94-99); AAGAAT (nucleotides 541-546); AAGAAT (nucleotides 934-939); GCCAAA (nucleotides 649-654); GTCAAG (nucleotides 1252-1257); GGTATT (nucleotides 1 198- 1203); ATCAAC (nucleotides 808-813); GGCCAT (nucleotides 865-870); CTTCCA
(nucleotides 835-840); GATATC (nucleotides 703-708); TCGTTG (nucleotides 1228- 1233). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CAGTTT (nucleotides 565-570) replaced with CAATTT; TTTGAC (nucleotides 1303- 1308) replaced with TTTGAT; TCGTTT (nucleotides 1240-1245) replaced with TCTTTT; GGCCAA (nucleotides 94-99) replaced with GGACAA; AAGAAT (nucleotides 541-546) replaced with AAAAAT; AAGAAT (nucleotides 934-939) replaced with AAAAAC; GCCAAA (nucleotides 649-654) replaced with GCTAAA; GTCAAG (nucleotides 1252-1257) replaced with GTTAAA; GGTATT (nucleotides 1198-1203) replaced with GGAATC; ATCAAC (nucleotides 808-813) replaced with ATTAAT; GGCCAT (nucleotides 865-870) replaced with GGACAC; CTTCCA (nucleotides 835-840) replaced with TTGCCT; GATATC (nucleotides 703-708) replaced with GATATA; TCGTTG (nucleotides 1228-1233) replaced with TCATTG. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0011] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGCCAA (nucleotides 94-99); CAGTTT (nucleotides 565- 570); GATATC (nucleotides 703-708); TATTTG (nucleotides 853-858); GGCCAT (nucleotides 865-870); TCGTTG (nucleotides 1228-1233); TTTGTC (nucleotides 1243- 1248); TTCCAA (nucleotides 1363-1368). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGCCAA (nucleotides 94-99) replaced with GGTCAA; CAGTTT (nucleotides 565-570) replaced with CAATTC; GATATC (nucleotides 703- 708) replaced with GACATT; TATTTG (nucleotides 853-858) replaced with TATTTA; GGCCAT (nucleotides 865-870) replaced with GGACAT; TCGTTG (nucleotides 1228- 1233) replaced with TCTTTA; TTTGTC (nucleotides 1243-1248) replaced with
TTCGTT; TTCCAA (nucleotides 1363-1368) replaced with TTCCAG. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0012] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GTGCCT (nucleotides 55-60); GCCAAT (nucleotides 370- 375); GCTATT (nucleotides 406-41 1); GCCGGA (nucleotides 559-564); GCCAAT (nucleotides 778-783); TTGGCA (nucleotides 967-972); AAGCTG (nucleotides 1051- 1056); GCTATT (nucleotides 1066-1071); GCCAAT (nucleotides 1084-1089); ACCGGA (nucleotides 1 147-1 152); ACCGGA (nucleotides 1189-1 194); GGTATT (nucleotides 1198 - 1203); GACAGC (nucleotides 1285-1290); GATGCC (nucleotides 1327-1332); GCCTTG (nucleotides 1330-1335); CAGCTT (nucleotides 1381-1386). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GTGCCT (nucleotides 55-60) replaced with GTTCCG; GCCAAT (nucleotides 370-375) replaced with GCTAAT; GCTATT (nucleotides 406-411) replaced with GCCATT; GCCGGA (nucleotides 559-564) replaced with GCTGGT;GCCAAT (nucleotides 778- 783) replaced with GCGAAT; TTGGCA (nucleotides 967-972) replaced with TTGGCT; AAGCTG (nucleotides 1051-1056) replaced with AAATTG; GCTATT (nucleotides 1066-1071) replaced with GCCATT; GCCAAT (nucleotides 1084-1089) replaced with GCTAAT; ACCGGA (nucleotides 1 147-1 152) replaced with ACCGGT; ACCGGA (nucleotides 1 189-1 194) replaced with ACAGGT; GGTATT (nucleotides 1198 - 1203) replaced with GGAATC; GACAGC (nucleotides 1285-1290) replaced with GATTCT; GATGCC (nucleotides 1327-1332) replaced with GACGCC; GCCTTG (nucleotides 1330-1335) replaced with GCCCTT; CAGCTT (nucleotides 1381-1386) replaced with CAGTTG. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis.
[0013] Also provided herein is a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity
with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0014] Also provided herein is a cellobiohydrolase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Orγctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W3110; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0015] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0016] In some embodiments, provided herein is a system for degrading cellulose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4-β-glucanase, exo-l,4-β-D- glucanase, and β-D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein
said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizo saccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the exo-l,4-β-D-glucanase retains at least 75% of the enzymatic activity of wild-type TrCBH-II (SEQ ID NO: 2) under normal physiological conditions.
[0017] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 27-62 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 27-62 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 27-62 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 27-62 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAC when expressed in the native organism.
[0018] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 107-
471 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 107-471 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 107-471 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 107-471 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GCAAAG when expressed in the native organism.
[0019] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 27-471 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 62-107 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 62- 107 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 62-107 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 62-107 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more
than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TCTACT when expressed in the native organism.
[0020] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 1474 - 1479); TTGAAT (nucleotides 802 - 807); ATCAAG (nucleotides 1477 - 1482); GCCAAG (nucleotides 526 - 531). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GATATC (nucleotides 1474 - 1479) replaced with GATATA; TTGAAT (nucleotides 802 - 807) replaced with TTAAAT; ATCAAG (nucleotides 1477 - 1482) replaced with ATAAAA; GCCAAG (nucleotides 526 - 531) replaced with GCAAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0021] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTCCTC (nucleotides 1405 - 1410); ATCCTC (nucleotides 892 - 897); TTCCAG (nucleotides 190 - 195); TTCCAG (nucleotides 265 - 270); GACAGC (nucleotides 1360 - 1365); TTCCCG (nucleotides 544 - 549); CAGGCG (nucleotides 457 - 462); GCGGCA (nucleotides 589 - 594); TTCCGC (nucleotides 1327 - 1332). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG; ATCCTC (nucleotides 892 - 897) replaced with ATCCTG; TTCCAG (nucleotides 190 - 195) replaced with TTCCAA; TTCCAG (nucleotides 265 -
270) replaced with TTTCAG; GACAGC (nucleotides 1360 - 1365) replaced with GATTCT; TTCCCG (nucleotides 544 - 549) replaced with TTCCCA; CAGGCG (nucleotides 457 - 462) replaced with CAAGCG; GCGGCA (nucleotides 589 - 594) replaced with GCGGCT; TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0022] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 1474 - 1479); ATCAAG (nucleotides 1477 - 1482); TTCAAC (nucleotides 1051 - 1056); ATCAAC (nucleotides 205 - 210); ATCAAC (nucleotides 571 - 576); ATCAAC (nucleotides 880 - 885); ATCAAC (nucleotides 1078 - 1083). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GATATC (nucleotides 1474 - 1479) replaced with GACATT; ATCAAG (nucleotides 1477 - 1482) replaced with ATTAAA; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; ATCAAC (nucleotides 205 - 210) replaced with ATTAAT; ATCAAC (nucleotides 571 - 576) replaced with ATTAAT; ATCAAC (nucleotides 880 - 885) replaced with ATTAAT; ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0023] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAGAAG (nucleotides 175 - 180 ); TTCCAT (nucleotides 349 - 354 ); GCCAAG (nucleotides 526 - 531 ); TTCCAT (nucleotides 1426 - 1431 ); GATATC (nucleotides 1474 - 1479 ).). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical
amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAGAAG (nucleotides 175 - 180 ) replaced with AAAAAG; TTCCAT (nucleotides 349
- 354 ) replaced with TTTCAT; GCCAAG (nucleotides 526 - 531 ) replaced with GCCAAA; TTCCAT (nucleotides 1426 - 1431 ) replaced with TTCCAC; GATATC (nucleotides 1474 - 1479 ) replaced with GACATT. In certain aspects, the nucleotide sequence is optimized for expression in K.lactis.
[0024] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCCGGT (nucleotides 7 - 12 ); ATCGGG (nucleotides 64 - 69 ); CACAGC (nucleotides 385 - 390 ); GCCAAG (nucleotides 526 - 531 ); AAGCTG (nucleotides 529 - 534 ); CGCTAT (nucleotides 643 - 648 ); GTCGAT (nucleotides 727 - 732 ); AACAGC (nucleotides 739 - 744 ); GATGCC (nucleotides 916 - 921 ); GCACCG (nucleotides 940
- 945 ); GTGCCT (nucleotides 1000 - 1005 ); GTCGAT (nucleotides 1027 - 1032 ); GCAGGG (nucleotides 1 165 - 1170 ); CACAGC (nucleotides 1192 - 1197 ); GACAGC (nucleotides 1360 - 1365 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCCGGT (nucleotides 7 - 12 ) replaced with TCTGGT;ATCGGG (nucleotides 64 - 69 ) replaced with ATTGGT; CACAGC (nucleotides 385 - 390 ) replaced with CATTCT; GCCAAG (nucleotides 526 - 531 ) replaced with GCGAAA; AAGCTG (nucleotides 529 - 534 ) replaced with AAATTG; CGCTAT (nucleotides 643 - 648 ) replaced with CGTTAT; GTCGAT (nucleotides 727 - 732 ) replaced with GTTGAT;AACAGC (nucleotides 739 - 744 ) replaced with AATTCT; GATGCC (nucleotides 916 - 921 ) replaced with GATGCA; GCACCG (nucleotides 940 - 945 ) replaced with GCTCCG; GTGCCT (nucleotides 1000 - 1005 ) replaced with GTCCCT; GTCGAT (nucleotides 1027 - 1032 ) replaced with GTTGAT; GCAGGG (nucleotides 1 165 - 1 170 ) replaced with GCTGGC; CACAGC (nucleotides 1192 - 1 197 ) replaced
with CATTCT; GACAGC (nucleotides 1360 - 1365 ) replaced with GACTCT. In certain aspects, the nucleotide sequence is optimized for expression in Z.mobilis.
[0025] Also provided herein is a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0026] Also provided herein is a laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Otyctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W3110; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0027] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0028] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host
organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 26) under normal physiological conditions.
[0029] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28-152 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism, hi certain aspects, no replacement codon encoding amino acids 28-152 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 28-152 when expressed in the native organism.
[0030] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 161-305 of SEQ ID NO: 26 have
been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 161-305 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 161-305 when expressed in the native organism.
[0031] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 364-493 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 364-493 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
[0032] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-28 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is
predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-28 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-28 when expressed in the native organism.
[0033] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 152-161 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 152-161 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 152-161 when expressed in the native organism.
[0034] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 305-364 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the
heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 305-364 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 305-364 when expressed in the native organism.
[0035] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 901 - 906); CTTTCT (nucleotides 19 - 24); GACCGT (nucleotides 547 - 552); TTCCCC (nucleotides 301 - 306); TTCCCC (nucleotides 730 - 735); TTCCCC (nucleotides 988 - 993); TTCCCC (nucleotides 1051 - 1056). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 901 - 906) replaced with TTGTCT; CTTTCT (nucleotides 19 - 24) replaced with TTGTCT; GACCGT (nucleotides 547 - 552) replaced with GATAGA; TTCCCC (nucleotides 301 - 306) replaced with TTTCCA; TTCCCC (nucleotides 730 - 735) replaced with TTTCCA; TTCCCC (nucleotides 988 - 993) replaced with TTTCCA; TTCCCC (nucleotides 1051 - 1056) replaced with TTTCCA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0036] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 901 - 906); TTCCTC (nucleotides 700 - 705); CTCGAC (nucleotides 340 - 345); CTTTCT (nucleotides 19 - 24); TTCCAG
(nucleotides 880 - 885); GTCTGG (nucleotides 595 - 600); TTCCCG (nucleotides 1042 - 1047); ATCGCC (nucleotides 229 - 234); ATCGCC (nucleotides 373 - 378). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 901 - 906) replaced with CTGTCT; TTCCTC (nucleotides 700 - 705) replaced with TTCTTG; CTCGAC (nucleotides 340 - 345) replaced with CTGGAC; CTTTCT (nucleotides 19 - 24) replaced with CTGTCT; TTCCAG (nucleotides 880 - 885) replaced with TTCCAA; GTCTGG (nucleotides 595 - 600) replaced with GTTTGG ;TTCCCG (nucleotides 1042 - 1047) replaced with TTCCCA; ATCGCC (nucleotides 229 - 234) replaced with ATTGCG; ATCGCC (nucleotides 373 - 378) replaced with ATCGCT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0037] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTCAAG (nucleotides 7 - 12); ATCAAC (nucleotides 922 - 927); GACGAA (nucleotides 343 - 348); CTTTCC (nucleotides 901 - 906). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTCAAG (nucleotides 7 - 12) replaced with TTTAAA; ATCAAC (nucleotides 922 - 927) replaced with ATTAAT; GACGAA (nucleotides 343 - 348) replaced with GATGAA; CTTTCC (nucleotides 901 - 906) replaced with TTGTCT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0038] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced
with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCT (nucleotides 19 - 24 ); TTTGTC (nucleotides 25 - 30 ); TTCCCC (nucleotides 301 - 306 ); GACCGT (nucleotides 547 - 552 ); TTCCCC (nucleotides 730 - 735 ); CTTTCC (nucleotides 901 - 906 ); TTCCCC (nucleotides 988 - 993 ); TTCCCC (nucleotides 1051 - 1056 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT; TTTGTC (nucleotides 25 - 30 ) replaced with TTCGTT; TTCCCC (nucleotides 301 - 306 ) replaced with TTCCCT; GACCGT (nucleotides 547 - 552 ) replaced with GATAGA; TTCCCC (nucleotides 730 - 735 ) replaced with TTCCCT; CTTTCC (nucleotides 901 - 906 ) replaced with TTGTCT; TTCCCC (nucleotides 988 - 993 ) replaced with TTTCCT; TTCCCC (nucleotides 1051 - 1056 ) replaced with TTTCCA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0039] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCT (nucleotides 19 - 24 ); ACGGCT (nucleotides 184 - 189 ); CTGACC (nucleotides 211 - 216 ); GCCCGT (nucleotides 376 - 381 ); ATCGGT (nucleotides 424 - 429 ); CTGACC (nucleotides 604 - 609 ); AAGGCT (nucleotides 865 - 870 ); CTTTCC (nucleotides 901 - 906 ); CCCGGA (nucleotides 1063 - 1068 ).. In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT; ACGGCT (nucleotides 184 - 189 ) replaced with ACCGCT; CTGACC (nucleotides 21 1 - 216 ) replaced with TTGACC; GCCCGT (nucleotides 376 - 381 ) replaced with GCTCGT; ATCGGT (nucleotides 424 - 429 ) replaced with ATTGGA; CTGACC (nucleotides 604 - 609 )
replaced with TTGACA; AAGGCT (nucleotides 865 - 870 ) replaced with AAAGCC; CTTTCC (nucleotides 901 - 906 ) replaced with TTGTCT; CCCGGA (nucleotides 1063 - 1068 ) replaced with CCTGGT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0040] Also provided herein is a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0041] Also provided herein is a lignin peroxidase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W3110; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0042] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0043] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide
sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, .Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the lignin peroxidase retains at least 75% of the enzymatic activity of wild-type LIP (SEQ ID NO: 50) under normal physiological conditions.
[0044] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 46- 287 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 46-287 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 46-287 when expressed in the native organism.
[0045] In some embodiments are provided a lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid
sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 1 -46 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-46 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -46 when expressed in the native organism.
[0046] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTCCCC (nucleotides 130 - 135); TTCCCC (nucleotides 721 - 726); TTCCCC (nucleotides 979 - 984); TTCCCC. (nucleotides 1033 - 1038); GCCAAG (nucleotides 247 - 252). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTCCCC (nucleotides 130 - 135) replaced with TTTCCG; TTCCCC (nucleotides 721 - 726) replaced with TTCCCA; TTCCCC (nucleotides 979 - 984) replaced with TTTCCG; TTCCCC (nucleotides 1033 - 1038) replaced with TTCCCA; GCCAAG (nucleotides 247 - 252) replaced with GCGAAG. In certain aspects, the nucleotide sequence is optimized for expression in S.cerβvisiae.
[0047] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ATTGCC (nucleotides 289 - 294); CAGGCG (nucleotides 358 - 363); CAGGCG (nucleotides 850 - 855); CAGGCG (nucleotides 1012 - 1017); CTCTCC (nucleotides 991 - 996); ATCGCC (nucleotides 244
- 249); ATCGCC (nucleotides 370 - 375); ATCGCC (nucleotides 610 - 615). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ATTGCC (nucleotides 289 - 294) replaced with ATCGCT; CAGGCG (nucleotides 358 - 363) replaced with CAGGCT; CAGGCG (nucleotides 850 - 855) replaced with CAGGCT; CAGGCG (nucleotides 1012 - 1017) replaced with CAGGCT; CTCTCC (nucleotides 991 - 996) replaced with CTGTCT; ATCGCC (nucleotides 244 - 249) replaced with ATTGCG; ATCGCC (nucleotides 370 - 375) replaced with ATCGCT; ATCGCC (nucleotides 610 - 615) replaced with ATTGCT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0048] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are as follows: TTCAAG (nucleotides 7 - 12 ); GACGAG (nucleotides 340 - 345 ); ACCAAG (nucleotides 532 - 537 ); GAGCTG (nucleotides 670
- 675 ); TCTCCC (nucleotides 757 - 762 ); GTCAAC (nucleotides 841 - 846 )TTCAAG (nucleotides 871 - 876 ). In some such nucleotide sequences, at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 2 of the following codon pair replacements have been made:
TTCAAG (nucleotides 7 - 12 ) replaced with TTTAAA; GACGAG (nucleotides 340 - 345 ) replaced with GATGAA; ACCAAG (nucleotides 532 - 537 ) replaced with ACTAAA; GAGCTG (nucleotides 670 - 675 ) replaced with GAATTG; TCTCCC (nucleotides 757 - 762 ) replaced with TCACCA; GTCAAC (nucleotides 841 - 846 ) replaced with GTTAAT; TTCAAG (nucleotides 871 - 876 ) replaced with TTTAAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0049] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 2 codon pairs to be replaced are as follows: TTCCCC (nucleotides 130 - 135 ); GCCAAG (nucleotides 247 - 252 ); TTCCCC (nucleotides 721 - 726 ); TTCCCC (nucleotides 979 - 984 ); TTCCCC (nucleotides 1033 - 1038 ).In some such nucleotide sequences, at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 2 of the following codon pair replacements have been made: TTCCCC (nucleotides 130 - 135 ) replaced with TTTCCA; GCCAAG (nucleotides 247 - 252 ) replaced with GCTAAA; TTCCCC (nucleotides 721 - 726 ) replaced with TTTCCA; TTCCCC (nucleotides 979 - 984 ) replaced with TTTCCA; TTCCCC (nucleotides 1033 - 1038 ) replaced with TTCCCT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0050] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74, wherein at least 2 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 2 codon pairs to be replaced are as follows: GCCAAG (nucleotides 247 - 252 ); GCCGGT (nucleotides 412 - 417 ); ATCGGT (nucleotides 421 - 426 ); GATGCC (nucleotides 556 - 561 ); GGAACG (nucleotides 646 - 651 ); CCCGGA (nucleotides 1054 - 1059 ). In some such nucleotide sequences, at least 2 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino
acid substitutions thereof. In some aspects of the above embodiments, at least 2 of the following codon pair replacements have been made: GCCAAG (nucleotides 247 - 252 ) replaced with GCGAAA; GCCGGT (nucleotides 412 - 417 ) replaced with GCTGGT; ATCGGT (nucleotides 421 - 426 ) replaced with ATAGGT; GATGCC (nucleotides 556 - 561 ) replaced with GATGCT; GGAACG (nucleotides 646 - 651 ) replaced with GGCACA; CCCGGA (nucleotides 1054 - 1059 ) replaced with CCTGGT. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis.
[0051] Also provided herein is a Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0052] Also provided herein is a Mn-dependent peroxidase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoήs; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatta (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0053] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0054] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the Mn-dependent peroxidase retains at least 75% of the enzymatic activity of wild-type MnP (SEQ ID NO: 74) under normal physiological conditions.
[0055] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 has a z score for
expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
[0056] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
[0057] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 45-284 of SEQ ID NO: 74 has a z score for expression in the heterologous
host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 45-284 when expressed in the native organism.
[0058] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1- 45 when expressed in the native organism.
[0059] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the
heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 45 when expressed in the native organism.
[0060] In some embodiments are provided a Mn-dependent peroxidase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-45 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 45 when expressed in the native organism.
[0061] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGGTTC (nucleotides 1246 - 1251); GCAAGA (nucleotides 1834 - 1839); TTGAAC (nucleotides 1540 - 1545); TCTCCA (nucleotides 193 - 198); GACCGT (nucleotides 694 - 699); TTCCCC (nucleotides 1795 - 1800); GCCAAG (nucleotides 763 - 768); GCCAAG (nucleotides 1585 - 1590). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGGTTC (nucleotides 1246 - 1251) replaced with
GGTTTT; GCAAGA (nucleotides 1834 - 1839) replaced with GCTAGA; TTGAAC (nucleotides 1540 - 1545) replaced with TTAAAT; TCTCCA (nucleotides 193 - 198) replaced with TCACCA; GACCGT (nucleotides 694 - 699) replaced with GATAGA; TTCCCC (nucleotides 1795 - 1800) replaced with TTTCCA; GCCAAG (nucleotides 763
- 768) replaced with GCTAAA; GCCAAG (nucleotides 1585 - 1590) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0062] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGGTG (nucleotides 877 - 882); CTCGAC (nucleotides 1240 - 1245); ATCCTC (nucleotides 1462 - 1467); CTCGGC (nucleotides 652 - 657); CTCGGC (nucleotides 952
- 957); GTCTGG (nucleotides 1252 - 1257); GACAGC (nucleotides 940 - 945); AGCCAG (nucleotides 1495 - 1500); TTCCCG (nucleotides 661 - 666); ATTGCC (nucleotides 16 - 21); ATTGCC (nucleotides 1651 - 1656); CTCGGT (nucleotides 58 - 63); CTCGGT (nucleotides 1465 - 1470); GCCTGG (nucleotides 1654 - 1659); TCGCTG (nucleotides 874 - 879); GTGATG (nucleotides 1312 - 1317); TTCCGC (nucleotides 1609 - 1614). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGGTG (nucleotides 877 - 882) replaced with CTGGTT; CTCGAC (nucleotides 1240 - 1245) replaced with CTGGAC; ATCCTC (nucleotides 1462 - 1467) replaced with ATCCTG; CTCGGC (nucleotides 652 - 657) replaced with CTGGGT ;CTCGGC (nucleotides 952 - 957) replaced with CTGGGT; GTCTGG (nucleotides 1252
- 1257) replaced with GTTTGG; GACAGC (nucleotides 940 - 945) replaced with GACTCT; AGCCAG (nucleotides 1495 - 1500) replaced with TCTCAG; TTCCCG (nucleotides 661 - 666) replaced with TTCCCA; ATTGCC (nucleotides 16 - 21) replaced with ATCGCG; ATTGCC (nucleotides 1651 - 1656) replaced with ATCGCG; CTCGGT (nucleotides 58 - 63) replaced with CTGGGT; CTCGGT (nucleotides 1465 - 1470) replaced with CTGGGT; GCCTGG (nucleotides 1654 - 1659) replaced with GCGTGG;
TCGCTG (nucleotides 874 - 879) replaced with AGCCTG; GTGATG (nucleotides 1312 - 1317) replaced with GTTATG; TTCCGC (nucleotides 1609 - 1614) replaced with TTTCGT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0063] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAACTG (nucleotides 403 - 408); TTCAAC (nucleotides 202 - 207); TTCAAC (nucleotides 751 - 756); ATCAAC (nucleotides 208 - 213); ATCAAC (nucleotides 397 - 402); ATCAAC (nucleotides 616 - 621); ATCAAC (nucleotides 841 - 846); ATCAAC (nucleotides 1276 - 1281); ATCAAC (nucleotides 1282 - 1287); GTCAAG (nucleotides 1828 - 1833); GGGTTC (nucleotides 1246 - 1251); TTGAAC (nucleotides 1540 - 1545); TTTGAC (nucleotides 1513 - 1518). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 403 - 408) replaced with AAATTA; TTCAAC (nucleotides 202 - 207) replaced with TTTAAC; TTCAAC (nucleotides 751 - 756) replaced with TTTAAT; ATCAAC (nucleotides 208 - 213) replaced with ATTAAT; ATCAAC (nucleotides 397 - 402) replaced with ATTAAT; ATCAAC (nucleotides 616 - 621) replaced with ATTAAC; ATCAAC (nucleotides 841 - 846) replaced with ATTAAT; ATCAAC (nucleotides 1276 - 1281) replaced with ATTAAC; ATCAAC (nucleotides 1282 - 1287) replaced with ATTAAT; GTCAAG (nucleotides 1828 - 1833) replaced with GTTAAA; GGGTTC (nucleotides 1246 - 1251) replaced with GGATTT; TTGAAC (nucleotides 1540 - 1545) replaced with TTAAAT; TTTGAC (nucleotides 1513 - 1518) replaced with TTTGAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0064] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs
encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GACCGT (nucleotides 694 - 699 ); GCCAAG (nucleotides 763 - 768 ); AAGAAG (nucleotides 820 - 825 ); TTCCAA (nucleotides 865 - 870 ); GGTACC (nucleotides 1048
- 1053 ); GGGTTC (nucleotides 1246 - 1251 ); GTGTTT (nucleotides 1510 - 1515 ); TTGAAC (nucleotides 1540 - 1545 ); GCCAAG (nucleotides 1585 - 1590 ); AAGAAG (nucleotides 1735 - 1740 ); TTCCCC (nucleotides 1795 - 1800 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 403 - 408) replaced with AAATTA; TTCAAC (nucleotides 202 - 207) replaced with GACCGT (nucleotides 694 - 699 ) replaced with GACAGA; GCCAAG (nucleotides 763 - 768 ) replaced with GCTAAA; AAGAAG (nucleotides 820 - 825 ) replaced with AAAAAG; TTCCAA (nucleotides 865 - 870 ) replaced with TTTCAG; GGTACC (nucleotides 1048
- 1053 ) replaced with GGAACT; GGGTTC (nucleotides 1246 - 1251 ) replaced with GGTTTT; GTGTTT (nucleotides 1510 - 1515 ) replaced with GTTTTC; TTGAAC (nucleotides 1540 - 1545 ) replaced with TTAAAT; GCCAAG (nucleotides 1585 - 1590 ) replaced with GCTAAA; AAGAAG (nucleotides 1735 - 1740 ) replaced with AAAAAG; TTCCCC (nucleotides 1795 - 1800 ) replaced with TTTCCA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0065] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCAAG (nucleotides 763 - 768 ); GACAGC (nucleotides 940 - 945 ); AACAGC (nucleotides 1198 - 1203 ); GCCTTT (nucleotides 1414 - 1419 ); GCCAAG (nucleotides 1585 - 1590 ); GCCTTT (nucleotides 1741 - 1746 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCAAG (nucleotides 763 - 768 )
replaced with GCCAAA; GACAGC (nucleotides 940 - 945 ) replaced with GATTCT; AACAGC (nucleotides 1 198 - 1203 ) replaced with AACTCT; GCCTTT (nucleotides 1414 - 1419 ) replaced with GCTTTC; GCCAAG (nucleotides 1585 - 1590 ) replaced with GCGAAA; GCCTTT (nucleotides 1741 - 1746 ) replaced with GCCTTC. In certain aspects, the nucleotide sequence is optimized for expression in Z.mobilis.
[0066] Also provided herein is a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0067] Also provided herein is a laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W3110; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0068] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0069] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizo saccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 98) under normal physiological conditions.
[0070] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 90-212 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 90-212 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 90-212 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 90-212 of SEQ ID NO: 98 has a z score for
expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GTCAAC when expressed in the native organism.
[0071] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 216-367 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 216-367 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 216-367 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 216-367 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GCCGAC when expressed in the native organism.
[0072] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 426-570 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain
aspects, no replacement codon encoding amino acids 426-570 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 426-570 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 426-570 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TTCCGC when expressed in the native organism.
[0073] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 1-90 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-90 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-90 when expressed in the native organism, hi certain aspects, at least one replacement codon encoding amino acids 1-90 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GGTGGT when expressed in the native organism.
[0074] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 212-216 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or
conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 212-216 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 212-216 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 212-216 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCCAAC when expressed in the native organism.
[0075] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 367-426 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 367-426 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 367-426 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 367-426 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CTCGAC when expressed in the native organism.
(0076] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 235 - 240); CTTTCT (nucleotides 670 - 675); TTTGCC (nucleotides 778 - 783); TTCCCC (nucleotides 1240 - 1245); ATCAAG (nucleotides 625 - 630); GCCAAG (nucleotides 529 - 534). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAA (nucleotides 235 - 240) replaced with TTAAAA; CTTTCT (nucleotides 670 - 675) replaced with TTGTCT; TTTGCC (nucleotides 778 - 783) replaced with TTTGCT; TTCCCC (nucleotides 1240 - 1245) replaced with TTTCCA; ATCAAG (nucleotides 625 - 630) replaced with ATTAAA; GCCAAG (nucleotides 529 - 534) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0077] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTCCTC (nucleotides 1405 - 1410); CTCGAC (nucleotides 1432 - 1437); CTTTCT (nucleotides 670 - 675); TTTGCC (nucleotides 778 - 783); ATCCTC (nucleotides 1126 - 1131); ACGCTG (nucleotides 502 - 507); TTCCAG (nucleotides 10 - 15); TTCCAG (nucleotides 193 - 198); TTCCAG (nucleotides 268 - 273); GTGGTG (nucleotides 139 - 144); GTCAGC (nucleotides 106 - 1 1 1); GTCAGC (nucleotides 1339 - 1344); AGCCAG (nucleotides 814 - 819); GCCGGG (nucleotides 1291 - 1296); CAGGCG (nucleotides 1141 - 1 146); CAGGCG (nucleotides 1501 - 1506); GGCGCA (nucleotides 910 - 915); TTCCGC (nucleotides 655 - 660); TTCCGC (nucleotides 1327 - 1332); TTCTGG (nucleotides 379 - 384); CTCTCC (nucleotides 397 - 402). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been
replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG; CTCGAC (nucleotides 1432 - 1437) replaced with CTGGAT; CTTTCT (nucleotides 670 - 675) replaced with CTGTCT; TTTGCC (nucleotides 778 - 783) replaced with TTCGCT; ATCCTC (nucleotides 1 126 - 1 131) replaced with ATTCTG; ACGCTG (nucleotides 502 - 507) replaced with ACCCTC; TTCCAG (nucleotides 10 - 15) replaced with TTTCAG; TTCCAG (nucleotides 193 - 198) replaced with TTCCAA; TTCCAG (nucleotides 268 - 273) replaced with TTCCAA; GTGGTG (nucleotides 139 - 144) replaced with GTTGTT; GTCAGC (nucleotides 106 - 111) replaced with GTTAGC; GTCAGC (nucleotides 1339 - 1344) replaced with GTGTCT; AGCCAG (nucleotides 814 - 819) replaced with TCTCAG; GCCGGG (nucleotides 1291 - 1296) replaced with GCTGGT; CAGGCG (nucleotides 1 141 - 1 146) replaced with CAAGCT; CAGGCG (nucleotides 1501 - 1506) replaced with CAGGCT; GGCGCA (nucleotides 910 - 915) replaced with GGTGCT; TTCCGC (nucleotides 655 - 660) replaced with TTTCGT; TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT; TTCTGG (nucleotides 379 - 384) replaced with TTTTGG; CTCTCC (nucleotides 397 - 402) replaced with CTGTCT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0078] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ATCAAG (nucleotides 625 - 630); TTTGCC (nucleotides 778 - 783); TTGAAA (nucleotides 235 - 240); TTCAAC (nucleotides 1051 - 1056); TTCAAC (nucleotides 1057 - 1062); ATCAAC (nucleotides 739 - 744); ATCAAC (nucleotides 1078 - 1083); GGTATC (nucleotides 148 - 153). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ATCAAG (nucleotides 625 - 630) replaced with ATTAAA; TTTGCC (nucleotides 778 - 783) replaced with TTTGCA; TTGAAA (nucleotides 235 - 240)
replaced with TTAAAA; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; TTCAAC (nucleotides 1057 - 1062) replaced with TTTAAC; ATCAAC (nucleotides 739 - 744) replaced with ATTAAT; ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT; GGTATC (nucleotides 148 - 153) replaced with GGAATT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0079] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 148 - 153 ); TTGAAA (nucleotides 235 - 240 ); GCCAAG (nucleotides 529 - 534 ); TTCCCA (nucleotides 547 - 552 ); CTTTCT (nucleotides 670 - 675 ); TTTGCC (nucleotides 778 - 783 ); TTTGCT (nucleotides 871 - 876 ); TTTGTC (nucleotides 1093 - 1098 ); TTCCCC (nucleotides 1240 - 1245 ); TTTGCT (nucleotides 1444 - 1449 ).In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 148 - 153 ) replaced with GGAATT; TTGAAA (nucleotides 235 - 240 ) replaced with TTAAAA; GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAA; TTCCCA (nucleotides 547 - 552 ) replaced with TTCCCG; CTTTCT (nucleotides 670 - 675 ) replaced with CTTAGT; TTTGCC (nucleotides 778 - 783 ) replaced with TTCGCT; TTTGCT (nucleotides 871 - 876 ) replaced with TTCGCT; TTTGTC (nucleotides 1093 - 1098 ) replaced with TTCGTT; TTCCCC (nucleotides 1240 - 1245 ) replaced with TTTCCA; TTTGCT (nucleotides 1444 - 1449 ) replaced with TTCGCA. In certain aspects, the nucleotide sequence is optimized for expression in K.lactis.
[0080] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following:
GGTATC (nucleotides 148 - 153 ); GCAGGG (nucleotides 370 - 375 ); GCCAAG (nucleotides 529 - 534 ); ATCAAT (nucleotides 574 - 579 ); GCACCG (nucleotides 604 - 609 ); TTGGCA (nucleotides 616 - 621 ); ATCAAT (nucleotides 883 - 888 ); GTGCCT (nucleotides 1000 - 1005 ); GCGGCT (nucleotides 1144 - 1 149 ); GCCAAT (nucleotides 1225 - 1230 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 148 - 153 ) replaced with GGCATT; GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGA; GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAA; ATCAAT (nucleotides 574 - 579 ) replaced with ATTAAT; GCACCG (nucleotides 604 - 609 ) replaced with GCCCCA; TTGGCA (nucleotides 616 - 621 ) replaced with TTGGCT; ATCAAT (nucleotides 883 - 888 ) replaced with ATAAAT; GTGCCT (nucleotides 1000 - 1005 ) replaced with GTACCA; GCGGCT (nucleotides 1144 - 1 149 ) replaced with GCTGCC; GCCAAT (nucleotides 1225 - 1230 ) replaced with GCCAAC. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0081] Also provided herein is a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0082] Also provided herein is a laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are
predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0083] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0084] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 122) under normal physiological conditions.
[0085] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 29-153 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or
conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 29-153 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 29-153 when expressed in the native organism.
[0086] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 162-306 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 162-306 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 162-306 when expressed in the native organism.
[0087] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 364-493 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host
organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 364-493 of SEQ ID NO: 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
[0088] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 1-30 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-30 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-30 when expressed in the native organism.
[0089] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 153-162 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one
replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 153-162 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 153-162 when expressed in the native organism.
[0090] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 306-364 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 306-364 of SEQ ID NO: 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 306-364 when expressed in the native organism.
[0091] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 397 - 402); TTGAAG (nucleotides 235 - 240); GGGTTC (nucleotides 868 - 873); ATCAAA (nucleotides 625 - 630); ACTTTG (nucleotides 502 - 507); GACCGT (nucleotides 187 - 192); GGCCAA (nucleotides 148 - 153); AGCGAT (nucleotides 1546 - 1551). In some such nucleotide sequences, at least 3, or 4, or 5, or 6
or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 397 - 402) replaced with CTGTCT; TTGAAG (nucleotides 235 - 240) replaced with CTGAAA; GGGTTC (nucleotides 868 - 873) replaced with GGTTTC; ATCAAA (nucleotides 625 - 630) replaced with ATCAAA; ACTTTG (nucleotides 502 - 507) replaced with ACCCTG; GACCGT (nucleotides 187 - 192) replaced with GACCGT; GGCCAA (nucleotides 148 - 153) replaced with GGTCAA; AGCGAT (nucleotides 1546 - 1551) replaced with TCTGAC. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0092] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCAGC (nucleotides 811 - 816); CTTTCC (nucleotides 397 - 402); TTCCTC (nucleotides 1405 - 1410); ATCCTC (nucleotides 895 - 900); TTCCAG (nucleotides 10 - 15); TTCCAG (nucleotides 193 - 198); TTCCAG (nucleotides 268 - 273); TTCCAG (nucleotides 1378 - 1383); CTCTCT (nucleotides 670 - 675); GTCAGC (nucleotides 106
- I l l); GTCAGC (nucleotides 1339 - 1344); AGCCAG (nucleotides 814 - 819); TTCCCG (nucleotides 547 - 552); ATTGCC (nucleotides 169 - 174); GATCTC (nucleotides 1549 - 1554); CTCGGT (nucleotides 583 - 588); TTCCGC (nucleotides 655
- 660); TTCCGC (nucleotides 1327 - 1332); TTCTGG (nucleotides 379 - 384); CTCTCC (nucleotides 22 - 27). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCAGC (nucleotides 81 1 - 816) replaced with GCTTCT; CTTTCC (nucleotides 397 - 402) replaced with CTGTCT; TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG; ATCCTC (nucleotides 895 - 900) replaced with ATTCTG; TTCCAG (nucleotides 10 - 15) replaced with TTCCAA; TTCCAG (nucleotides 193 - 198) replaced with TTTCAG; TTCCAG (nucleotides 268 - 273) replaced with TTTCAG; TTCCAG (nucleotides 1378 - 1383) replaced with TTCCAA; CTCTCT (nucleotides 670 - 675)
replaced with CTGTCT; GTCAGC (nucleotides 106 - 1 1 1) replaced with GTTAGC; GTCAGC (nucleotides 1339 - 1344) replaced with GTTTCG; AGCCAG (nucleotides 814 - 819) replaced with TCTCAG; TTCCCG (nucleotides 547 - 552) replaced with TTTCCG; ATTGCC (nucleotides 169 - 174) replaced with ATCGCG; GATCTC (nucleotides 1549 - 1554) replaced with GACCTG; CTCGGT (nucleotides 583 - 588) replaced with CTGGGT; TTCCGC (nucleotides 655 - 660) replaced with TTTCGT; TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT; TTCTGG (nucleotides 379
- 384) replaced with TTTTGG; CTCTCC (nucleotides 22 - 27) replaced with CTGTCT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0093] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAACTG (nucleotides 532 - 537); TTCAAC (nucleotides 1051 - 1056); ATCAAC (nucleotides 307 - 312); TCAAC (nucleotides 1078 - 1083); TCAAA (nucleotides 625 - 630); GGCCGT (nucleotides 1006 - 1011); GGGTTC (nucleotides 868 - 873); GGCCAA (nucleotides 148 - 153); CTTTCC (nucleotides 397 - 402). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAACTG (nucleotides 532 - 537) replaced with AAATTG; TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT; ATCAAC (nucleotides 307 - 312) replaced with ATTAAT; ATCAAC (nucleotides 1078
- 1083) replaced with ATTAAT; ATCAAA (nucleotides 625 - 630) replaced with ATTAAA; GGCCGT (nucleotides 1006 - 1011) replaced with GGTAGA; GGGTTC (nucleotides 868 - 873) replaced with GGATTC; GGCCAA (nucleotides 148 - 153) replaced with GGACAA; CTTTCC (nucleotides 397 - 402) replaced with TTGTCT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0094] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs
encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGCCAA (nucleotides 148 - 153 ); GACCGT (nucleotides 187 - 192 ); TTGAAG (nucleotides 235 - 240 ); CTTTCC (nucleotides 397 - 402 ); ATCAAA (nucleotides 625 - 630 ); GGGTTC (nucleotides 868 - 873 ); GGCCGT (nucleotides 1006 - 101 1 ); TTTGCT (nucleotides 1444 - 1449 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGCCAA (nucleotides 148 - 153 ) replaced with GGTCAA; GACCGT (nucleotides 187 - 192 ) replaced with GATAGA; TTGAAG (nucleotides 235 - 240 ) replaced with TTAAAA; CTTTCC (nucleotides 397 - 402 ) replaced with TTGTCT; ATCAAA (nucleotides 625 - 630 ) replaced with ATTAAA; GGGTTC (nucleotides 868 - 873 ) replaced with GGTTTC; GGCCGT (nucleotides 1006
- 1011 ) replaced with GGTAGA; TTTGCT (nucleotides 1444 - 1449 ) replaced with TTTGCG. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0095] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AGCCGT (nucleotides 124 - 129 ); GCCGGT (nucleotides 172 - 177 ); GGCCCC (nucleotides 295 - 300 ); TCCGGT (nucleotides 328 - 333 ); GCAGGG (nucleotides 370
- 375 ); CACAGC (nucleotides 388 - 393 ); CTCTAT (nucleotides 469 - 474 ); ACTTTG (nucleotides 502 - 507 ); ATCAAT (nucleotides 574 - 579 ); GCGGCT (nucleotides 607 - 612 ); GATGCC (nucleotides 808 - 813 ); GCCAAT (nucleotides 844 - 849 ); GCCGGT (nucleotides 874 - 879 ); GTGCCT (nucleotides 1000 - 1005 ); GCCAAT (nucleotides 1225 - 1230 ); GATGCC (nucleotides 1435 - 1440 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AGCCGT (nucleotides 124 - 129 )
replaced with TCTCGT; GCCGGT (nucleotides 172 - 177 ) replaced with GCTGGT; GGCCCC (nucleotides 295 - 300 ) replaced with GGACCT; TCCGGT (nucleotides 328 - 333 ) replaced with TCTGGT; GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGT; CACAGC (nucleotides 388 - 393 ) replaced with CATTCT; CTCTAT (nucleotides 469 - 474 ) replaced with TTGTAT; ACTTTG (nucleotides 502 - 507 ) replaced with ACCTTG; ATCAAT (nucleotides 574 - 579 ) replaced with ATTAAT; GCGGCT (nucleotides 607 - 612 ) replaced with GCTGCT; GATGCC (nucleotides 808 - 813 ) replaced with GACGCC; GCCAAT (nucleotides 844 - 849 ) replaced with GCTAAT; GCCGGT (nucleotides 874 - 879 ) replaced with GCTGGT; GTGCCT (nucleotides 1000 - 1005 ) replaced with GTTCCT; GCCAAT (nucleotides 1225 - 1230 ) replaced with GCTAAC; GATGCC (nucleotides 1435 - 1440 ) replaced with GATGCT. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis.
[0096] Also provided herein is a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0097] Also provided herein is a laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai;
Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0098] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0099] In some embodiments, provided herein is a system for metabolizing lignin, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the laccase retains at least 75% of the enzymatic activity of wild-type LCC (SEQ ID NO: 146) under normal physiological conditions.
[0100] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 29-153 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of
the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 29-153 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 29-153 when expressed in the native organism.
[0101] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO:146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 162-306 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 162-306 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 162-306 when expressed in the native organism.
[0102] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 364-493 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 364-493 of SEQ ID NO: 146 has a z
score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 364-493 when expressed in the native organism.
[0103] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:145 and which encode amino acids 1-29 of SEQ ID NO:146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -29 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -29 when expressed in the native organism.
[0104] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 153-162 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 153-162 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%,
or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 153-162 when expressed in the native organism.
[0105] In some embodiments are provided a laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 306-364 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 306-364 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 306-364 when expressed in the native organism.
[0106] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTGAAC (nucleotides 421 - 426 ); GCCAAG (nucleotides 496 - 501 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); GCCAAG (nucleotides 1243 - 1248 ); ATCAAG (nucleotides 1264 - 1269 ); GGTATT (nucleotides 1411 - 1416 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT; GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA; GATATC (nucleotides 643 - 648 ) replaced with GACATT; AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG; GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAG;
ATCAAG (nucleotides 1264 - 1269 ) replaced with ATTAAA; GGTATT (nucleotides 141 1 - 1416 ) replaced with GGAATA. In certain aspects, the nucleotide sequence is optimized for expression in S. cerevisiae.
[0107] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CTCTCC (nucleotides 274 - 279 ); GACAGC (nucleotides 520 - 525 ); AGCCAG (nucleotides 523 - 528 ); GACTGG (nucleotides 787
- 792 ); TTCCAG (nucleotides 934 - 939 ); GCCAGC (nucleotides 1441 - 1446 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTCTCC (nucleotides 274 - 279 ) replaced with TTATCT; GACAGC (nucleotides 520 - 525 ) replaced with GATTCT; AGCCAG (nucleotides 523 - 528 ) replaced with TCTCAA; GACTGG (nucleotides 787 - 792 ) replaced with GATTGG; TTCCAG (nucleotides 934 - 939 ) replaced with TTCCAG; GCCAGC (nucleotides 1441 - 1446 ) replaced with GCTTCG. In certain aspects, the nucleotide sequence is optimized for expression in E. coli.
[0108] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTGAAC (nucleotides 421 - 426 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); ATCAAC (nucleotides 901
- 906 ); TTCAAG (nucleotides 1057 - 1062 ); ATCAAG (nucleotides 1264 - 1269 ); GGTATT (nucleotides 141 1 - 1416 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 421 - 426 ) replaced with
TTAAAT; GATATC (nucleotides 643 - 648 ) replaced with GACATT; AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG; ATCAAC (nucleotides 901 - 906 ) replaced with ATTAAT; TTCAAG (nucleotides 1057 - 1062 ) replaced with TTTAAA; ATCAAG (nucleotides 1264 - 1269 ) replaced with ATTAAA; GGTATT (nucleotides 141 1 - 1416 ) replaced with GGAATT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0109] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TTTGTC (nucleotides 286 - 291 ); TTGAAC (nucleotides 421 - 426 ); GCCAAG (nucleotides 496 - 501 ); GATATC (nucleotides 643 - 648 ); AAGAAA (nucleotides 859 - 864 ); AAGAAG (nucleotides 1060 - 1065 ); GCCAAG (nucleotides 1243 - 1248 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTTGTC (nucleotides 286 - 291 ) replaced with TTCGTT; TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT; GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA; GATATC (nucleotides 643 - 648 ) replaced with GACATT; AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG; AAGAAG (nucleotides 1060 - 1065 ) replaced with AAAAAG; GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0110] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: ACATGG (nucleotides 46 - 51 ); AACAGC (nucleotides 136 - 141 ); AACAGC (nucleotides 268 - 273 ); CTTTAC (nucleotides 325 - 330 ); GCCAAG (nucleotides 496 - 501 ); GACAGC (nucleotides 520 - 525 ); ATCAAT (nucleotides 550 - 555 ); CTCGAT (nucleotides 847 - 852 ); TCCGGT (nucleotides 1204
- 1209 ); GCCAAG (nucleotides 1243 - 1248 ); GGTATT (nucleotides 141 1 - 1416 ); GGCCCC (nucleotides 1426 - 1431 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ACATGG (nucleotides 46 - 51 ) replaced with ACCTGG; AACAGC (nucleotides 136 - 141 ) replaced with AATAGT; AACAGC (nucleotides 268
- 273 ) replaced with AACTCC; CTTTAC (nucleotides 325 - 330 ) replaced with TTATAT; GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAG; GACAGC (nucleotides 520 - 525 ) replaced with GATAGC; ATCAAT (nucleotides 550 - 555 ) replaced with ATCAAC; CTCGAT (nucleotides 847 - 852 ) replaced with TTAGAT; TCCGGT (nucleotides 1204 - 1209 ) replaced with AGCGGT; GCCAAG (nucleotides 1243 - 1248 ) replaced with GCAAAG; GGTATT (nucleotides 1411 - 1416 ) replaced with GGAATT; GGCCCC (nucleotides 1426 - 1431 ) replaced with GGTCCG. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0111] Also provided herein is a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0112] Also provided herein is a cellobiohydrolase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause
therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fasciculaήs (Long-tailed monkey); M. mulatta (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 ΕDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0113] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0114] In some embodiments, provided herein is a system for degrading cellulose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4-β-glucanase, exo-l,4-β-D- glucanase, and β-D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the exo-l,4-β-D-glucanase retains at least 75% of the enzymatic activity of wild-type TrCBH-I (SEQ ID NO: 170) under normal physiological conditions.
[0115] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 465-493 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational
pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 465-493 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 465-493 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 465-493 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair ATTGGC when expressed in the native organism.
[0116] In some embodiments are provided a cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 435-464 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 435-464 of SEQ ID NO: 170 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 435-464 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 62-107 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CCTACC when expressed in the native organism.
[0117] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid
sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CAGTTT (nucleotides 445 - 450 ); CAGTAC (nucleotides 571 - 576 ); CAGTAC (nucleotides 685 - 690 ); AAGGGC (nucleotides 793 - 798 ); GAGTTT (nucleotides 808 - 813 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CAGTTT (nucleotides 445 - 450 ) replaced with CAATTT; CAGTAC (nucleotides 571 - 576 ) replaced with CAATAT; CAGTAC (nucleotides 685 - 690 ) replaced with CAATAT; AAGGGC (nucleotides 793 - 798 ) replaced with AAGGGA; GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT. In certain aspects, the nucleotide sequence is optimized for expression in S. cerevisiae.
[0118] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CTCGGC (nucleotides 7 - 12 ); AGCCAG (nucleotides 142 - 147 ); CTGGCA (nucleotides 301 - 306 ); GATCTC (nucleotides 307 - 312 ); TTCCAG (nucleotides 415 - 420 ); TTCTGG (nucleotides 424 - 429 ); GCCGGA (nucleotides 556 - 561 ); GTCTGG (nucleotides 886 - 891 ); GCCGGG (nucleotides 913 - 918 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made:CTCGGC (nucleotides 7 - 12 ) replaced with CTGGGT; AGCCAG (nucleotides 142 - 147 ) replaced with AGCCAA; CTGGCA (nucleotides 301 - 306 ) replaced with CTCGCG; GATCTC (nucleotides 307 - 312 ) replaced with GACCTG; TTCCAG (nucleotides 415 - 420 ) replaced with TTCCAA; TTCTGG (nucleotides 424 - 429 ) replaced with TTTTGG; GCCGGA (nucleotides 556 - 561 ) replaced with GCGGGT; GTCTGG (nucleotides 886 - 891 ) replaced with GTTTGG; GCCGGG (nucleotides 913 -
918 ) replaced with GCAGGT. In certain aspects, the nucleotide sequence is optimized for expression in E. coli.
[0119] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTCT (nucleotides 10 - 15 ); ACCAAG (nucleotides 82 - 87 ); CTTCCA (nucleotides 151 - 156 ); GGCTCT (nucleotides 280 - 285 ); CAGTTT (nucleotides 445 - 450 ); CACGAT (nucleotides 493 - 498 ); AAGAAG (nucleotides 790
- 795 ); GAGTTT (nucleotides 808 - 813 ); CTTCCT (nucleotides 982 - 987 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made:GGCTCT (nucleotides 10 - 15 ) replaced with GGATCT; ACCAAG (nucleotides 82 - 87 ) replaced with ACTAAA; CTTCCA (nucleotides 151 - 156 ) replaced with TTGCCA; GGCTCT (nucleotides 280 - 285 ) replaced with GGATCA; CAGTTT (nucleotides 445 - 450 ) replaced with CAATTC; CACGAT (nucleotides 493 - 498 ) replaced with CATGAT; AAGAAG (nucleotides 790 - 795 ) replaced with AAAAAG; GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT; CTTCCT (nucleotides 982 - 987 ) replaced with TTGCCA. In certain aspects, the nucleotide sequence is optimized for expression in P. pas tor is.
[0120] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTCT (nucleotides 10 - 15 ); ACCAAG (nucleotides 82 - 87 ); CTTCCA (nucleotides 151 - 156 ); GGCTCT (nucleotides 280 - 285 ); CAGTTT (nucleotides 445 - 450 ); CACGAT (nucleotides 493 - 498 ); AAGAAG (nucleotides 790
- 795 ); GAGTTT (nucleotides 808 - 813 ); CTTCCT (nucleotides 982 - 987 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or
conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made:GGCTCT (nucleotides 10 - 15 ) replaced with GGATCT; ACCAAG (nucleotides 82 - 87 ) replaced with ACTAAA; CTTCCA (nucleotides 151 - 156 ) replaced with TTGCCA; GGCTCT (nucleotides 280 - 285 ) replaced with GGATCA; CAGTTT (nucleotides 445 - 450 ) replaced with CAATTC; CACGAT (nucleotides 493 - 498 ) replaced with CATGAT; AAGAAG (nucleotides 790 - 795 ) replaced with AAAAAG; GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT; CTTCCT (nucleotides 982 - 987 ) replaced with TTGCCA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0121] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: TCCGGT (nucleotides 124 - 129 ); GTCGAT (nucleotides 358 - 363 ); GCCGGA (nucleotides 556 - 561 ); GGGGCA (nucleotides 604 - 609 ); GCATGG (nucleotides 607 - 612 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCCGGT (nucleotides 124 - 129 ) replaced with TCTGGT; GTCGAT (nucleotides 358 - 363 ) replaced with GTTGAT; GCCGGA (nucleotides 556 - 561 ) replaced with GCTGGT; GGGGCA (nucleotides 604 - 609 ) replaced with GGCGCG; GCATGG (nucleotides 607 - 612 ) replaced with GCGTGG. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0122] Also provided herein is a endoglucanase -encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism
are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0123] Also provided herein is a endoglucanase -encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0124] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0125] In some embodiments, provided herein is a system for degrading cellulose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: endo-l,4-β-glucanase, exo-l ,4-β-D- glucanase, and β-D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the
original sequence of the enzyme. In some aspects the endo-l ,4-β-glucanase retains at least 75% of the enzymatic activity of wild-type endoglucanase (SEQ ID NO: 182) under normal physiological conditions.
[0126] In some embodiments are provided a endoglucanase -encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 32- 276 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 32-276 of SEQ ID NO: 182 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 32-276 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 32-276 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair with the highest z score when expressed in the native organism.
[0127] In some embodiments are provided a endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 1- 32 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at
least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-32 of SEQ ID NO: 182 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-32 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-32 of SEQ ID NO: 182 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair with the highest z score when expressed in the native organism.
[0128] In some embodiments are provided a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: AGTGAC (nucleotides 58 - 63 ); AAGGGC (nucleotides 148 - 153 ); GCAAGA (nucleotides 172 - 177 ); GACCAA (nucleotides 406 - 411 ); AGCGGT (nucleotides 442 - 447 ); TTGAAT (nucleotides 493 - 498 ). In some such sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT; AAGGGC (nucleotides 148 - 153 ) replaced with AAAGGT; GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA; GACCAA (nucleotides 406 - 411 ) replaced with GATCAA; AGCGGT (nucleotides 442 - 447 ) replaced with TCTGGA; TTGAAT (nucleotides 493 - 498 ) replaced with TTAAAC. In certain aspects, the nucleotide sequence is optimized for expression in 5". cerevisiae.
[0129] In some embodiments are provided a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTGG (nucleotides 25 - 30 ); CTGGAA (nucleotides 91 - 96 ); GGCGGT (nucleotides 127 - 132 ); GGCTGG (nucleotides 151 - 156 ); CTCGGC (nucleotides 352 - 357 ); TACTGG (nucleotides 412 - 417 ); CGCCAG (nucleotides 424 - 429 ); ACCAGC
(nucleotides 439 - 444 ); GCCTGG (nucleotides 475 - 480 ). In some such sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG; CTGGAA (nucleotides 91 - 96 ) replaced with CTGGAG; GGCGGT (nucleotides 127 - 132 ) replaced with GGCGGC; GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG; CTCGGC (nucleotides 352 - 357 ) replaced with CTGGGT; TACTGG (nucleotides 412 - 417 ) replaced with TATTGG; CGCCAG (nucleotides 424 - 429 ) replaced with CGTCAG; ACCAGC (nucleotides 439 - 444 ) replaced with ACCTCT; GCCTGG (nucleotides 475 - 480 ) replaced with GCGTGG. In certain aspects, the nucleotide sequence is optimized for expression in E. coli.
[0130] In some embodiments are provided a A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: CACGAT (nucleotides 31 - 36 ); AGTGAC (nucleotides 58 - 63 ); GAGTAT (nucleotides 259 - 264 ); AACTTT (nucleotides 277 - 282 ); GTCAAC (nucleotides 370 - 375 ); GTCAAC (nucleotides 499 - 504 ). In some such sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CACGAT (nucleotides 31 - 36 ) replaced with CATGAT; AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT; GAGTAT (nucleotides 259 - 264 ) replaced with GAATAT; AACTTT (nucleotides 277 - 282 ) replaced with AATTTC; GTCAAC (nucleotides 370 - 375 ) replaced with GTTAAT; GTCAAC (nucleotides 499 - 504 ) replaced with GTGAAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0131] In some embodiments are provided a A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions
thereof: GGCTGG (nucleotides 25 - 30 ); GGCTGG (nucleotides 151 - 156 ); GCAAGA (nucleotides 172 - 177 ); GGTGTT (nucleotides 193 - 198 ); AACTTT (nucleotides 277 - 282 ); GACCAA (nucleotides 406 - 41 1 ); GGTACC (nucleotides 445 - 450 ); TTGAAT (nucleotides 493 - 498 ); ACCGTT (nucleotides 568 - 573 ). In some such sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG; GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG; GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA; GGTGTT (nucleotides 193 - 198 ) replaced with GGTGTT; AACTTT (nucleotides 277 - 282 ) replaced with AATTTC; GACCAA (nucleotides 406 - 411 ) replaced with GATCAA; GGTACC (nucleotides 445 - 450 ) replaced with GGTACA; TTGAAT (nucleotides 493 - 498 ) replaced with TTAAAT; ACCGTT (nucleotides 568 - 573 ) replaced with ACTGTT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0132] In some embodiments are provided a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GAAGGC (nucleotides 94 - 99 ); GCAAGA (nucleotides 172 - 177 ); AACAGC (nucleotides 214 - 219 ); ACCTAT (nucleotides 286 - 291 ); TCCGGT (nucleotides 301 - 306 ); GCAACG (nucleotides 529 - 534 ); GGCTAT (nucleotides 553 - 558 ). In some such sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGGC (nucleotides 94 - 99 ) replaced with GAAGGA; GCAAGA (nucleotides 172 - 177 ) replaced with GCTCGT; AACAGC (nucleotides 214 - 219 ) replaced with AATTCT; ACCTAT (nucleotides 286 - 291 ) replaced with ACGTAT; TCCGGT (nucleotides 301 - 306 ) replaced with TCTGGT; GCAACG (nucleotides 529 - 534 ) replaced with GCCACC; GGCTAT (nucleotides 553 - 558 ) replaced with GGTTAT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0133] Also provided herein is a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0134] Also provided herein is a xylanase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaco, fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0135] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0136] In some embodiments are provided a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 31-221 of SEQ ED NO: 194 have been replaced with different codon pairs encoding identical amino acids or
conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 31-221 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 31-221 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 31-221 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair with the highest z score when expressed in the native organism.
[0137] In some embodiments are provided a xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 1-31 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-31 of SEQ ID NO: 194 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-31 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-31 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair highest z score when expressed in the native organism.
[0138] Also provided herein are isolated polynucleotides comprising the any of the nucleotide sequences provided herein. Also provided herein are isolated polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51 , 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203. Also provided herein are isolated polypeptides encoded by the any of the nucleotide sequences provided herein, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194.
[0139] Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes the any of the polynucleotides provided herein operably linked to an expression control sequence. Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides provided herein, each polynucleotide being operably linked to the same or different expression control sequences. Also provided herein are expression systems for degrading cellulose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes: endo-l ,4-β-glucanase, exo-l,4-β-D-glucanase, and β-D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. Also provided herein are expression systems for metabolizing lignin, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: laccase, Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some such
systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21, 23, 171 , 173, 175, 177, 179, 183, 185, 187, 189 or 191. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21, 23, 171, 173, 175, 177, 179, 183, 185, 187, 189 or 191. In some such systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53, 55, 57, 59, 6L 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 11 1, 113, 1 15, 1 17, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 1 11, 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167. In some such systems, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some such systems, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme. In some such systems, each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194) under normal physiological conditions.
[0140] Also provided herein are cells comprising any of the polynucleotides provided herein. In some such cells, the cell expresses the polypeptide encoded by said polynucleotide.
[0141] Also provided herein are methods of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with any of the polynucleotides provided herein under conditions that permit the polynucleotide to be introduced into the host cell.
[0142] Also provided herein are methods of expressing a polypeptide comprising: providing a cell comprising any of the polynucleotides provided herein; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
[0143] Also provided herein are methods of hydrolyzing a carbohydrate comprising: providing a carbohydrate comprising at least one glycosidic bond; providing a polypeptide encoded by any of the polynucleotides provided herein; and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one covalent bond of said carbohydrate, whereby at least one covalentbond of said carbohydrate is hydrolyzed.
[0144] Also provided herein are integrable polynucleotides for modifying an endogenous nucleotide sequence in a cell comprising: a removable selectable marker cassette comprising a selectable marker flanked by a 5' site-specific recombinase recognition site and a 3' site-specific recombinase recognition site, wherein said removable selectable marker cassette is flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence. Some such integrable polynucleotides further comprise a heterologous nucleic acid flanked by said 5' nucleic acid sequence with homology to an endogenous sequence and said 3' nucleic acid sequence with homology to an endogenous sequence. In some such integrable polynucleotides, the heterologous nucleic acid comprises a sequence encoding a polypeptide. In some such integrable polynucleotides, the heterologous nucleic acid comprises a regulatory sequence. In some such integrable polynucleotides, the sequence encoding a polypeptide is operatively linked to said regulatory sequence. In some such integrable polynucleotides, the regulatory sequence comprises a promoter sequence and a terminator sequence. In some such integrable polynucleotides, the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some embodiments, the heterologous nucleic acid encodes a polypeptide that degrades cellulose and/or lignin. In some such integrable polynucleotides, the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 11 1, 1 13, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203. In some such integrable polynucleotides, the selectable marker can be selected for or can be selected against. In some such integrable polynucleotides, the selectable marker can be selected for and can be selected against. In some such integrable polynucleotides, the selectable mark is selected from the group consisting of URA3, TRPl, CANl, KIURA3, CYH2, LYS2 and METl 5. In some such
integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises a genomic repetitive element. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises TyI DNA or Ty3 DNA. In some such integrable polynucleotides, the site- specific recombinase recognition site comprises a loxP sequence. In some such integrable polynucleotides, the site-specific recombinase recognition site comprises a frt sequence. In some such integrable polynucleotides, the integrable polynucleotide comprises a PCR product.
[0145] Also provided herein are cells comprising any of the integrable polynucleotides provided herein. Some such cells comprise a gene encoding a site- specific recombinase. In some such cells, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. Some such cells are S. cerevisiae cells.
[0146] Also provided herein are methods of modifying an endogenous sequence in a cell comprising: providing a cell with at least one of the integrable polynucleotides provided; and selecting for a cell comprising said at least one integrable polynucleotide integrated therein to the genome of the cell. Some such methods further comprise excising at least one selectable marker from said at least one cell comprising said at least one integrable polynucleotide integrated therein; and selecting for a cell in which said at least one selectable marker has been excised. In some such methods, the excising said selectable marker comprises providing said cell with a site-specific recombinase. In some such methods, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. In some such methods, the site-specific recombinase is expressed from an endogenous gene or from a heterologous nucleic acid. In some such methods, the providing a cell with at least one integrable polynucleotide comprises providing a cell with a plurality of integrable polynucleotides, wherein said plurality of integrable polynucleotides comprises at least a first integrable polynucleotide comprising a first selectable marker and a second integrable polynucleotide comprising a second selectable marker. In some such methods, the plurality comprises 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides. Also provided are cells comprising an endogenous sequence modified by any of such methods provided herein. In some such cells, the modified endogenous sequence comprises an insertion, a deletion or a mutation.
[0147] Also provided are cells comprising a removable selectable marker cassette integrated into said cell comprising a selectable marker flanked by a 5' site-
specific recombinase recognition site and a 3' site-specific recombinase recognition site; and a heterologous nucleic acid integrated into said cell, wherein said removable selectable marker is juxtaposed to said heterologous nucleic. Also provided are cells comprising: a heterologous nucleic acid integrated into said cell, and a site-specific recombinase recognition site integrated into said cell, wherein said site-specific recombinase recognition site is juxtaposed to said heterologous nucleic acid. In some such cells, the site-specific recombinase recognition site comprises a loxP or frt sequence. In some such cells, the cell is a S. cerevisae cell. In some such cells, the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some such cells, the heterologous nucleic acid encodes a polypeptide that degrades cellulose and/or lignin. In some such cells, the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 111, 113, 1 15, 1 17, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151 , 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203.
BRIEF DESCRIPTION OF THE DRAWINGS
[0148] Figure 1 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in T. Reesei of nucleic acid sequences encoding the cellobiohydrolase-II enzyme of T. Reesei (TrCBH-II), plotted as a function of codon pair position.
[0149] Figures 2-6 depicts effects of Translational Engineering™ on protein expression levels. Each of Figures 2-6 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding TrCBH-II, plotted as a function of codon pair position.
[0150] Figure 2 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TrCBH-II protein. Figure 2B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0151] Figure 3A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TrCBH-II protein. Figure 3B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the
TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0152] Figure 4A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TrCBH-II protein. Figure 4B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0153] Figure 5A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TrCBH-II protein. Figure 5B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0154] Figure 6A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the TrCBH-II protein. Figure 6B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the TrCBH-II which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0155] Figures 7-11 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 7-11 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the laccase enzyme of P. sanguineus (LCC), plotted as a function of codon pair position.
[0156] Figure 7A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein. Figure 7B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0157] Figure 8A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein. Figure 8B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0158] Figure 9A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein. Figure 9B depicts a graphical
display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0159] Figure 1OA depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein. Figure 1OB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0160] Figure HA depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein. Figure HB depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
[0161] Figures 12-16 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 12-16 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the lignin peroxidase enzyme of T. versicolor (LIP), plotted as a function of codon pair position.
[0162] Figure 12A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LIP protein. Figure 12B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0163] Figure 13A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LIP protein. Figure 13B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0164] Figure 14A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LIP protein. Figure 14B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0165] Figure 15A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LIP protein. Figure 15B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0166] Figure 16A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LIP protein. Figure 16B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LIP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0167] Figures 17-21 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 17-21 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the Mn-dependent peroxidase enzyme of T. versicolor (MnP), plotted as a function of codon pair position.
[0168] Figure 17A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the MnP protein. Figure 17B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0169] Figure 18A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the MnP protein. Figure 18B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0170] Figure 19A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the MnP protein. Figure 19B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0171] Figure 2OA depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the MnP protein. Figure 2OB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the MnP
which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0172] Figure 21 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the MnP protein. Figure 21 B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the MnP which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
[0173] Figure 22 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in N. crassa of nucleic acid sequences encoding the laccase enzyme of TV. crassa (LCC), plotted as a function of codon pair position.
[0174] Figures 23-27 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 23-27 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LCC, plotted as a function of codon pair position.
[0175] Figure 23A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein. Figure 23B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0176] Figure 24A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein. Figure 24B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0177] Figure 25A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein. Figure 25B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0178] Figure 26A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein. Figure 26B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
-11-
[0179] Figure 27A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein. Figure 27B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0180] Figures 28-32 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 28-32 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the laccase enzyme of P. cinnabarinus (LCC), plotted as a function of codon pair position.
[0181] Figure 28A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein. Figure 28B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0182] Figure 29A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein. Figure 29B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0183] Figure 30A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein. Figure 30B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0184] Figure 31 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein. Figure 3 IB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0185] Figure 32A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LCC protein. Figure 32B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the
LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0186] Figures 33-37 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 33-37 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding the laccase enzyme of P. coccineus (LCC), plotted as a function of codon pair position.
[0187] Figure 33A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LCC protein. Figure 33B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0188] Figure 34A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LCC protein. Figure 34B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0189] Figure 35A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LCC protein. Figure 35B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0190] Figure 36A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LCC protein. Figure 36B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0191] Figure 37A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LCC protein. Figure 37B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LCC which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0192] Figure 38 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in T. Reesei of nucleic acid sequences encoding
the cellobiohydrolase-I enzyme of T. Reesei (TrCBH-I), plotted as a function of codon pair position.
[0193] Figures 39-43 depict effects of Translational Engineering™ on protein expression levels. Each of Figures 39-43 depict graphical displays of z scores of translational kinetics values for codon pair utilization of nucleic acid sequences encoding TrCBH-II, plotted as a function of codon pair position.
[0194] Figure 39A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TrCBH-I protein. Figure 39B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0195] Figure 4OA depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TrCBH-I protein. Figure 4OB depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0196] Figure 41 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TrCBH-I protein. Figure 41 B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0197] Figure 42A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TrCBH-I protein. Figure 42B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0198] Figure 43 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the TrCBH-I protein. Figure 43B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the TrCBH-I which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
[0199] Figures 44-48 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 1-3 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences
encoding the endoglucanase enzyme of T. aurantiacus (EGl), plotted as a function of codon pair position.
[0200] Figure 44A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the EGl protein. Figure 44B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0201] Figure 45A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the EGl protein. Figure 45B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0202] Figure 46A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the EGl protein. Figure 46B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0203] Figure 47A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the EGl protein. Figure 47B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0204] Figure 48A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the EGl protein. Figure 48B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the EGl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
[0205] Figures 49-53 depict effects of Translational eEngineering™ on protein expression levels. Each of Figures 1-3 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the xylanase enzyme of T. lanuginosis (XynA), plotted as a function of codon pair position.
[0206] Figure 49A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XynA protein. Figure 49B depicts a
graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0207] Figure 5OA depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XynA protein. Figure 50B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0208] Figure 51 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XynA protein. Figure 51B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0209] Figure 52A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XynA protein. Figure 52B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0210] Figure 53A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein. Figure 53B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
[0211] Figure 54A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein. Figure 54B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
DETAILED DESCIPRTION
[0212] Biomass is the earth's most attractive alternative among fuel sources and most sustainable energy resource and is reproduced by the bioconversion of carbon dioxide. Ethanol produced from biomass is today the most widely used biofuel when blended with gasoline. As the carbon dioxide released by combustion is recycled into
biomass, the use of biofuels can significantly reduce the accumulation of greenhouse gas. Ethanol is just one example of the uses of biomass harvesting using industrial enzymes. The technologies associated with biomass harvesting are similarly applicable in the production of other biofuels, fine chemicals as well as other diverse applications.
[0213] Cellulose is the major polysaccharide of plants, where it plays a predominantly structural role. In recent years, it has been proposed that waste cellulosic biomass could be used as a cheap and readily available sugar to replace starchy materials in fermentation. Many researchers have previously tried to develop an efficient and inexpensive process for ethanol and other biofuels production from such waste by using recombinant bacteria and yeast (e.g., Saccharomyces cerevisiae) (Aristidou and Penttila (2000) Curr. Opin. Biotechnol. 1 1 :187-198; Bothast et al. (1999) Biotechnol. Prog. 15:867-875; Ingram et al. (1998) Biotechnol. Bioeng. 58:204-214), but so far with limited success. A process of this kind can address environmental problems such as global warming and lessen dependence on fossil fuels.
[0214] A variety of highly specialized microorganisms have evolved to produce enzymes that either synergistically or in complexes can carry out the complete hydrolysis of cellulose. The anaerobic bacteria Clostridium thermocellum and Clostridium cellulovorans and the filamentous fungus Trichoderma reesei are known as cellulolytic and xylanolytic microorganisms. The bacteria C. thermocellum and C cellulovorans produce a cellulosome complex consisting of cellulase and hemicellulase organized on the cell surface (Doi and Tamaru (2001) Chem. Rec. 1 :24-32; Shoham et al. (1999) Trends Microbiol. 7:275-281). In an exemplary, well-characterized organism, T. reesei, three types of cellulolytic enzyme are extracellularly secreted, including five endoglucanases (EG [EC 3.2.1.4]) (Okada et al (1998) Appl. Environ. Microbiol. 64:555- 563), two cellobiohydrolases (CBH [EC 3.2.1.91]) (Henrissat et al. (1985) Bio/Technology 3:722-726; Teeri et al. (1987) Gene 51 :43-52), and two β-glucosidases (BGL [EC 3.2.1.21]) (Chen et al. (1992) Biochim. Biophys. Acta 1121 :54-60). Endoglucanases act randomly against the amorphous region of the cellulose chain to produce reducing and nonreducing ends for cellobiohydrolases, which produce cellobiose from reducing or nonreducing ends of crystalline cellulose. Exoglucanase enzymes, including CBH-I and CBH-II, liberate the disaccharide D-cellobiose from 1 ,4-β-glucans. Cellulose chains are thus efficiently degraded to soluble cellobiose and cellooligosaccharides by the endo-exo synergism of EG and CBH (Henrissat et al. (1985) Bio/Technology 3:722-726). In the last step of enzymatic cellulose degradation,
cellooligosaccharides are hydrolyzed to glucose by β-glucosidase. In addition to endo-exo synergism, exo-exo synergism between the two cellobiohydrolases has also been reported (Teeri, T. T. (1997) Trends Biotechnol. 15:160-167).
[0215] The predominant polysaccharide in the primary cell wall of biomass is cellulose, the second most abundant is hemi-cellulose, and the third is pectin. The secondary cell wall, produced after the cell has stopped growing, also contains polysaccharides and is strengthened through polymeric lignin covalently cross-linked to hemicellulose. Cellulose is a homopolymer of anhydrocellobiose and thus a linear β-(l- 4)-D-glucan, while hemicelluloses include a variety of compounds, such as xylans, xyloglucans, arabinoxylans, and mannans in complex branched structures with a spectrum of substituents. Although generally polymorphous, cellulose is found in plant tissue primarily as an insoluble crystalline matrix of parallel glucan chains. Hemicelluloses usually hydrogen bond to cellulose, as well as to other hemicelluloses, which helps stabilize the cell wall matrix.
[0216] DNA constructs encoding cellulase enzymes, including cellobiohydrolases, are known in the art. For example, U.S. Patent No. 5,686,593 relates to cellulose- or hemicellulose-degrading enzymes that are derivable from a fungus other than Trichoderma or Phanerochaete, and which comprise a carbohydrate binding domain homologous to a terminal A region of T. reesei cellulases.
[0217] Such cellulolytic enzymes have been expressed in bacteria (Wood et al. (1992) Appl. Environ. Microbiol. 58:2103-2110) and yeast (Cho et al. (1999) J. Microbiol. Biotechnol. 9:340-345.) as a way of reducing the cost of cellulase production and other pretreatments in the process of ethanol production from cellulosic materials. Ethanologenic bacteria (Guedon et al. (2002) Appl. Environ. Microbiol. 68:53-58) and yeast (Fujita et al. (2002) Appl. Environ. Microbiol. 68:5136-5141) have been prepared that can produce ethanol from cellulosic materials, although ethanol yield is poor (Zhou and Ingram (2001) Biotechnol. Lett. 23:1455-1462).
[0218] It is known that when using other recombinant ethanologenic bacteria or yeast to ferment cellulose, addition of commercial cellulase is necessary for ethanol production. For example, when T. reesei endoglucanase II and CBH-II, and Aspergillus aculeatus β-glucosidase 1 , were simultaneously co-displayed on the cell surface of a yeast strain, the yeast strain was able to directly produce ethanol from cellulose, (whereas a yeast strain co-displaying only β-glucosidase 1 and endoglucanase II could not) indicating
the key role of CBH-II in the industrial conversion of cellulose to ethanol (Fujita et al. (2004) Appl Environ Microbiol. 70:1207-1212).
[0219] Commercial bioconversion of lignocellulosic biomass to ethanol requires the efficient fermentation of sugar mixtures. Lignocellulosic biomass is composed predominantly of cellulose, hemicellulose, and lignin. Lignin is a complex, highly cross-linked polyphenolic heteropolymer, and is naturally resistant to chemical and biologic conversion. An economical biomass-to-ethanol process critically depends on the rapid and efficient conversion of all of the sugars present in both its cellulose and hemicellulose fractions.
[0220] Although cellulose and hemicellulose are readily degraded by fungal and bacterial pathways, lignin is extremely recalcitrant. Furthermore, because of its cross-linking with the other cell wall components, lignin minimizes the accessibility of cellulose and hemicellulose to microbial enzymes. Hence, lignin is generally associated with reduced digestibility of the overall plant biomass.
[0221] Because of the importance of wood and other lignocellulosics as a renewable resource for the production of paper products, feeds, chemicals, and fuels, there has been an increasing research emphasis on the fungal degradation of lignin. White rot fungi are believed to be the most effective lignin-degrading microbes in nature. These white-rot fungi secrete one or more of three extracellular enzymes that are essential for lignin degradation. They are often referred to as lignin-modifying enzymes or LMEs. The three enzymes comprise two glycosylated heme-containing peroxidases: lignin peroxidase (LIP); Mn-dependent peroxidase (MNP); and, a copper-containing phenoloxidase Laccase (LCC).
[0222] Although the details of the reaction scheme of lignin biodegradation are not fully understood to date, without being bound by theory, it is suggested that these enzymes employ free radicals for depolymerization reactions.
[0223] Laccase. Laccases are copper containing oxidase enzymes that are found in many plants, fungi and microorganisms. Laccases are enzymatically active on phenols and similar molecules and perform a one electron oxidation. Laccases can be polymeric and the enzymatically active form can be a dimer or trimer.
[0224] Mn-dependent peroxidase. The enzymatic activity of Mn-dependent peroxidase (MnP) in is dependent on Mn2+. Without being bound by theory, it has been suggested that the main role of this enzyme is to oxidize Mn2+ to Mn3+ (Glenn et al.
(1986) Arch. Biochem. Biophys. 251 :688-696). Subsequently, phenolic substrates are oxidized by the Mn3+ generated.
[02251 Lignin peroxidase. Lignin peroxidase is an extracellular heme that catalyses the oxidative depolymerization of dilute solutions of polymeric lignin in vitro. Some of the substrates of LiP, most notably 3,4-dimethoxybenzyl alcohol (veratryl alcohol, VA), are active redox compounds that have been shown to act as redox mediators. VA is a secondary metabolite produced at the same time as LiP by ligninolytic cultures of P. chrysosporium and without being bound by theory, has been proposed to function as a physiological redox mediator in the LiP-catalysed oxidation of lignin in vivo (Harvey, et al. (1986) FEBS Lett. 195, 242-246).
[0226] Despite knowledge in the art related to expression of a foreign or synthetic gene in a host organism, many hydrolysis enzymes do not express well in host organisms such as E. coli or S. cerevisiae. Accordingly, provided herein are hydroysis enzyme-encoding nucleotide sequences and methods of making the same for improved expression of hydrolysis enzymes.
[0227] Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pauses can improve protein expression.
[0228] Methods of determining patterns of codon pair utilization are known in the art, as exemplified by U.S. Patent Number 5,082,767 (which is incorporated by reference herein in its entirety), which describes analysis of patterns of nonrandom codon pair usage. The information obtained from codon pair utilization analysis can be used to construct and express altered or synthetic genes having desired levels of translational efficiency, to introduce translational pause sites into heterologous genes, and to ascertain relationship or ancestral origin of nucleotide sequences in accordance with the methods provided herein and the knowledge in the art.
[0229] A translational pause can serve to slow translation of the nascent amino acid chain. In some instances when such translational pauses arise in translation in native genes in the native organism, the pause(s) can serve to facilitate proper polypeptide folding, post-translational modification, re-organization/folding at protein domain boundaries, or other steps toward arriving at the native, active wild type protein. Accordingly, in some embodiments provided herein, one or more pauses that are predicted to be present in native translation of hydrolysis enzymes is/are preserved in a modified hydrolysis-encoding polynucleotide provided in accordance with the teachings herein. For example, a codon pair in the modified hydrolysis enzyme-encoding polynucleotide can be selected to have a predicted translational kinetics value that is at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 99% that of the native codon pair whose predicted pause is to be preserved; further, the codon pair in the modified hydrolysis enzyme -encoding polynucleotide can be selected to be located within 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 codons of the native codon pair whose predicted pause is to be preserved.
[0230] Accordingly, as used herein, Translation Engineering™ refers to a process used to modify the translational kinetics of a polypeptide-encoding nucleic sequence. For example, Translation Engineering™ can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism. In another example, Translation Engineering™ can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism. In some embodiments, this process alters the polypeptide-encoding nucleic sequence to optimize codon usage and codon pair optimization in the organism in which the polypeptide-encoding nucleic sequence is expressed. For example, sequence modifications can be made to place or prevent restriction sites in the sequence, eliminate strong RNA secondary structures and avoid inadvertent Shine-Delgarno sequences. Additionally, Translation Engineering™ involves modifying the translational kinetics of a polypeptide-encoding nucleic sequence by removing, preserving, and/or inserting translational pauses into the polypeptide-encoding nucleic sequence.
[0231] In accordance with the above, provided herein are hydrolysis enzyme - encoding nucleotide sequences with refined translational kinetics and methods of making same. In one embodiment, provided is a hydrolysis enzyme -encoding DNA sequence, wherein the encoded sequence has amino acid sequence identity with wild-type
hydrolysis enzyme, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing input-sequence codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the resultant hydrolysis enzyme -encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where inappropriate or excessive translation pauses causes expression of inactive, insoluble or aggregated hydrolysis enzyme . In some embodiments, expression of the resultant hydrolysis enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where one or more predicted pauses are preserved from the native expression profile or are added to preserve expression of active and/or soluble hydrolysis enzyme . Thus, the hydrolysis enzyme -encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels; higher enzymatic activity; greater protein stability and resistance to degradation; and increased solubility.
[0232] As used herein the term hydrolysis enzyme refers to the enzymes encoded by the nucleotide sequences provided herein, and includes cellobiohydrolase-II, laccase, lignin peroxidase, Mn-dependent peroxidase, cellobiohydrolase-I, endoglucanase and xylanase enzymes.
[0233] Accordingly, nucleic acid sequences encoding the cellobiohydrolase-II enzyme of T. Reesei (TrCBH-II) are provided. The nucleotide sequences provided herein include the native sequence from T. Reesei shown in the sequence listing (SEQ ID NO: 1) which encodes the TrCBH-II amino acid sequence (SEQ ID NO: 2).
[0234] Further, nucleic acid sequences encoding the laccase enzyme of P. sanguineus (LCC) are provided. The nucleotide sequences provided herein include the native sequence from P. sanguineus shown in the sequence listing (SEQ ID NO: 25) which encodes the LCC amino acid sequence (SEQ ID NO: 26).
[0235] Further, nucleic acid sequences encoding the lignin peroxidase enzyme of T. versicolor (LIP) are provided. The nucleotide sequences provided herein include the native sequence from T. versicolor shown in the sequence listing (SEQ ID NO: 49) which encodes the LIP amino acid sequence (SEQ ID NO: 50).
[0236] Further, nucleic acid sequences encoding the Mn-dependent peroxidase enzyme of T. versicolor (MnP) are provided. The nucleotide sequences provided herein include the native sequence from T. versicolor shown in the sequence listing (SEQ ID NO: 73) which encodes the MnP amino acid sequence (SEQ ID NO: 74).
[0237] Further, nucleic acid sequences encoding the laccase enzyme of N. crassa (LCC) are provided. The nucleotide sequences provided herein include the native sequence from N. crassa shown in the sequence listing (SEQ ID NO: 1) which encodes the LCC amino acid sequence (SEQ ID NO: 98).
[0238] Further, nucleic acid sequences encoding the laccase enzyme of P. cinnabarinus (LCC) are provided. The nucleotide sequences provided herein include the native sequence from P. cinnabarinus shown in the sequence listing (SEQ ID NO: 121) which encodes the LCC amino acid sequence (SEQ ID NO: 122).
[0239] Further, nucleic acid sequences encoding the laccase enzyme of P. coccineus (LCC) are provided. The nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 145) which encodes the LCC amino acid sequence (SEQ ID NO: 146).
[0240] Further, nucleic acid sequences encoding the cellobiohydrolase-I enzyme of T. Reesei (TrCBH-I) are provided. The nucleotide sequences provided herein include the native sequence from T. Reesei shown in the sequence listing (SEQ ID NO: 169) which encodes the TrCBH-I amino acid sequence (SEQ ID NO: 170).
[0241] Further, nucleic acid sequences encoding the endoglucanase enzyme of T. aurantiacus (EGl) are provided. The nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 181) which encodes the LCC amino acid sequence (SEQ ID NO: 182).
[0242] Further, nucleic acid sequences encoding the xylanase enzyme of T. lanuginosus (XynA) are provided. The nucleotide sequences provided herein include the native sequence from P. coccineus shown in the sequence listing (SEQ ID NO: 193) which encodes the LCC amino acid sequence (SEQ ID NO: 194).
[0243] Further, provided herein are nucleic acid sequences encoding hydrolysis enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 3, 27, 51, 75, 99, 123, 147, 171, 183 and 195), E. coli (SEQ ID NOS: 9, 33, 57, 81, 105, 129, 153, 173, 185 and 197), P. pastoris (SEQ ID NOS: 15, 39, 63, 87, 1 1 1 , 135, 159, 175, 187 and 199), K. lactis (SEQ ID NOS: 21 , 45, 69, 93, 1 17, 141, 165, 177, 189 and 201. Also provided herein are sequences where additional sequence has
been added to the 3 'or 5' ends, or both. As will be understood by one of skill in the art, nucleotide sequences may be added 3' or 5' of any nucleic acid, for example, to facilitate hybridization of PCR primers, to add cloning restriction sites or other sites that facilitate cloning and/or expression. Accordingly, provided in the sequence listing are nucleic acid sequences with additional 5' and 3' cloning and/or PCR sequences, and which encode hydrolysis enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 5, 7, 29, 31, 53, 55, 77, 79, 101 , 103, 125, 127, 149, 151), E. coli (SEQ ID NOS: 11, 13, 35, 37, 59, 61 , 83, 85, 107, 109, 131, 133, 155, 157) and P. pastoris (SEQ ID NOS: 17, 19, 41, 43, 65, 67, 89, 91, 1 13, 115, 137, 139, 161, 163).
[0244] Further, provided in the sequence listing are hydrolysis enzyme amino acid sequences encoded by the nucleotide sequences with refined translational kinetics described herein. Thus, hydrolysis enzyme nucleic acid sequences with refined translational kinetics (SEQ ID NOS: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91 , 93, 95, 99, 101, 103, 105, 107, 109, 1 1 1, 113, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 and 203) respectively encode the amino acid sequences shown in the sequence listing (SEQ ID NOS: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 100, 102, 104, 106, 108, 110, 1 12, 1 14, 1 16, 118, 120, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 172, 174, 176, 178, 180, 184, 186, 188, 190, 192, 196, 198, 200, 202, 204).
[0245] Also provided herein are hydrolysis enzyme-encoding DNA sequences, wherein the encoded sequence has amino acid sequence identity with an original hydrolysis enzyme polypeptide and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly- overrepresented therein. In some embodiments, the host organism is not human, E. coli or S. cerevisiae.
[0246] As used herein, a laccase nucleotide sequences encodes a polypeptide having laccase activity. Laccase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin. A method for measuring laccase activity is exemplified by a known method in which an enzymatic reaction is carried out using 2,6- dimethoxyphenol (DMP) as a substrate and 2,2',6,6'-demethoxydiphenoquinone absorbance at 468nm is monitored by spectrophotometry, as described in de Jong et al. ((1992) Mycol. Res. 96:1098-1 104), hereby incorporated by reference in its entirety.
[0247] As used herein, a cellobiohydrolase nucleotide sequences encodes a polypeptide having cellobiohydrolase activity. Cellobiohydrolase, exoglucanase, exo- 1 ,4-β-D-glucanase and like terms refers to the enzymatic hydrolysis of a glucoside bond in a polysaccharide or an oligosaccharide containing D-glucose subunits bonded through β-1 ,4 bonds, to release cellobiose, a disaccharide in which D-glucose is bonded through a β-1,4 bond. A method for measuring the cellobiohydrolase activity is exemplified by a known method in which an enzymatic reaction is carried out using phosphoric acid- swollen cellulose as a substrate and the existence of cellobiose in the reaction is confirmed by thin-layer silica gel chromatography, as described in U.S. Patent No. 6,566,113, hereby incorporated by reference in its entirety.
[0248] As used herein, a lignin peroxidase nucleotide sequences encodes a polypeptide having lignin peroxidase activity. Lignin peroxidase, diarylpropane peroxidase, ligninase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin. A method for measuring lignin peroxidase activity is exemplified by a known method in which an enzymatic reaction is carried out and veratryl alcohol absorbance at 310 nm is monitored by spectrophotometry, as described by Linko and Haapala. ((1993) Biotechnol. Techniques. 7:75-80), hereby incorporated by reference in its entirety.
[0249] As used herein, a Mn-dependent peroxidase nucleotide sequences encodes a polypeptide having Mn-dependent peroxidase activity. Mn-dependent peroxidase and like terms refers to the enzymes involved in the oxidative depolymerization of lignin. A method for measuring Mn-dependent peroxidase activity is exemplified by a known method in which an enzymatic reaction is carried out and production of oxidized 3-methyl-2-benzothiazolinone hydrazone hydrachloride (MBTH) plus 3-dimethylaminobenzoic acid (DMAB) absorbance at 590 nm is monitored by spectrophotometry, as described in Daniel et al. ((1994) Appl. Environ. Microbiol. 60:2524-2532), hereby incorporated by reference in its entirety.
[0250] As used herein, an endoglucanase nucleotide sequence encodes an endo-l,4-β-glucanase polypeptide having endo-l,4-β-glucanase activity. Endoglucanase and like terms refer to the enzymes involved in the enzymatic hydrolysis of a glucoside bond in a polysaccharide or an oligosaccharide containing D-glucose subunits bonded through β-1,4 bonds, to release cellobiose, a disaccharide in which D-glucose is bonded through a β-1 ,4 bond. Endoglucanases act randomly against the amorphous region of the cellulose chain to produce reducing and nonreducing ends for cellobiohydrolases, which produce cellobiose from reducing or nonreducing ends of crystalline cellulose.
[0251] As used herein, a xylanase nucleotide sequence encodes a xylanase polypeptide having xylanase activity. Xylanase and like terms refer to a class of enzymes which degrade the linear polysaccharide beta-l,4-xylan into xylose, thus breaking down hemi cellulose, which is a major component of the cell wall of plants.
[0252] The polynucleotides provided herein encode polypeptides that have hydrolysis activity. Thus, a hydrolysis enzyme-encoding polynucleotide comprising any of the DNA sequences provided herein can be transcribed and the resulting RNA translated to produce a polypeptide with hydrolysis enzyme activity.
[0253] As used herein, the term nucleotide sequence is used to refer to any polynucleotide sequence. The term "DNA sequence" is used herein to refer to the nucleotide sequences presented herein. As will be understood by one of skill in the art an RNA equivalent nucleotide sequences are also described by DNA sequences presented herein. As is well-known in the art, an equivalent RNA sequence can be substituted for a DNA sequecne by a T to U substitution, (i.e., replacing thymine in the DNA sequence with uracil in the RNA sequence).
[0254] In some embodiments, the hydrolysis enzyme-encoding DNA sequence is adapted for expression in a heterologous host organism. As used herein, a DNA sequence that has been adapted for expression is a DNA sequence that has been inserted into an expression vector or otherwise modified to contain regulatory elements necessary for expression of the DNA in the host cell, positioned in such a manner as to permit expression of the DNA in the host cell. Such regulatory elements required for expression include promoter sequences, transcription initiation sequences and, optionally, enhancer sequences. For example, a DNA sequence may be inserted into a plasmid vector adapted for expression in a bacterial cell, such as E. coli, or a eukaryotic cell, such as S. cerevisiae or other yeast, or any other host organism.
[0255] A heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism. In certain aspects, the host organism is not human, E. coli or S. cerevisiae.
[0256] In some embodiments, polynucleotides provided herein also encode polypeptides that have other lignin-metabolizing activities such as a lignin peroxidase and a Mn-dependent peroxidase activity.
Changes to translational kinetics
[0257] The methods and sequences provided herein permit modification of the translational kinetics of an mRNA into a hydrolysis enzyme-encoding polypeptide. Translational kinetics of an mRNA into polypeptide can be changed in order to achieve any of a variety of expression profiles. For example, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all translational pauses predicted to occur within an autonomous folding unit of a nascent protein. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all over-represented codon pairs.
[0258] It is proposed herein that the presence of a pause or translation slowing codon pair can queue ribosomes back to the beginning of the coding sequence, thereby inhibiting further ribosome attachment to the message which can result in down- regulation of protein expression levels as the rate of translation initiation readily saturates and the slowest translation step time becomes rate limiting. It is also proposed herein that the presence of a pause or translational slowing codon pair can stall or detach a ribosome. It is also proposed herein that the presence of a pause or translational slowing codon pair can expose naked mRNA, which is then subject to message degradation. It is also proposed herein that the presence of a pause or translational slowing codon pair can decouple translation from transcription, leading to protein expression failure. For these reasons and more, methods for analyzing, designing and producing gene sequences and polynucleotides to remove or decrease in number, or selectively preserve or insert, pauses, or to replace or modify translational slowing codon pairs, have great utility.
[0259] Organism-specific codon usage and codon pair usage, and the presence of organism-specific pause sites, result in gene translation that is highly adapted to the original host organism. For example, ribosomal pausing sites that may be functional in a
human cell will typically be scrambled, random, or not appropriate or not recognized in the proper context in a bacterium or other non-native host. A heterologous cDNA or synthetic polynucleotide has a random but high probability of inadvertently encoding a pause site somewhere, often leading to protein expression and/or activity failure.
[0260] Differences between codon pair (pause signal) coding among bacteria or among vertebrates are sufficient to make cross-family gene expression unpredictable. For example, in various organisms such as bacteria, a significant pause or translational slowing can result in premature transcription termination and/or messenger degradation. Even in eukaryotes there is a coupling between export of mRNA from the nucleus and translation; thus a different, but still effective system of clearing untranslated mRNA exists in eukaryotes.
[0261] Methods for refining translational kinetics of an mRNA into polypeptide can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2008/0046192, published on February 21, 2008, which is incorporated by reference herein in its entirety. For example, a polypeptide-encoding nucleotide can be designed to be predicted to be translated rapidly along its entire length. Thus, some polypeptide-encoding nucleotides provided herein are those that have been engineered to remove all predicted pauses. Expression of such a polypeptide-encoding nucleotide can result in improved protein expression levels and improved levels of active and/or natively folded polypeptide expression.
[0262] Further methods of refining translational kinetic values are contemplated herein, as can be seen in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, each of which is incorporated by reference herein in its entirety.
[0263] As provided herein, a test of translation pausing or slowing as a result of codon pair usage can be performed by comparing a series of genes that have random pauses with modified genes where codon pairs predicted to cause translational pauses are replaced. Unmodified genes moved from their source organism and expressed in a heterologous host can have an altered set of codon pairs predicted to cause a translational pause or ribosomal slowing (e.g., an altered set of over-represented codon pairs), resulting in altered configuration and location of presumed pause sites. Creation of synthetic codon-pair-optimized genes can have a dramatic effect on expression: expression of difficult-to-express genes can be seen for the first time, or improved at least 2-fold, 3-
fold, 4-fold, 5-fold, 6-fold, 7-fold, 8-fold, 9-fold, 10-fold, 12-fold, 15-fold, 20-fold, 25- fold, 30-fold, or more, relative to unmodified polypeptide-encoding nucleic acid sequences.
[0264] In some embodiments, translational kinetics of an mRNA into hydrolysis enzyme-encoding polypeptide can be changed in order to remove some or all translational pauses or replace other codon pairs that cause translational slowing, message instability and degradation, and poor protein translation, expression, and functional properties. While not intending to be limited to the following, it is believed that, for at least some proteins, reduction or elimination of translational pauses can serve to increase the expression level and/or quality and characteristics of the protein. Accordingly, by removing some or all translational pauses or replacing other codon pairs that cause translational slowing, the expression levels and/or quality of an expressed protein can be increased.
[0265] For example, the hydrolysis enzyme-encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels, higher enzymatic activity, greater protein stability, resistance to degradation, and increased solubility compared to the original native gene when expressed in a heterologous host.
[0266] Thus, also provided herein are hydrolysis enzyme -encoding nucleotide sequences that have been modified to have one or more transcriptional pauses or slowing sites removed by modifying one or more codon pairs to a corresponding codon pair that is less likely to cause a translational pause or slowing. While in some embodiments it is preferred to replace all codon pairs predicted to cause a translational pause or slowing, in other embodiments, it is sufficient to replace a subset of codon pairs predicted to cause a translational pause or slowing. For example, expression levels can be increased by replacing at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more codon pairs predicted to cause a translational pause or slowing. In another example, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs predicted to cause a translational pause or slowing are replaced by, for example, substituting different codon pairs that encode the same amino acids.
[0267] In some embodiments, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses predicted to occur within an autonomous folding unit of a protein. As used herein, an autonomous folding unit of a protein refers to an element of the overall protein structure that is self-
stabilizing and often folds independently of the rest of the protein chain. Such autonomous folding units typically correspond to a protein domain. As provided herein, expression of a gene in a heterologous host organism can result in translational pauses located in regions that inhibit protein expression and/or protein folding. Since the presence of codon pairs predicted to cause a translational pause or slowing in protein- encoding regions separating regions encoding different autonomous folding units of the protein can serve to pause or slow translation, it is also contemplated that removal of translational pauses predicted to occur within an autonomous folding unit of a protein, particularly for heterologously-expressed proteins, can result in improved expression levels and/or folding of expressed proteins. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by removing some or all translational pauses predicted to occur within an autonomous folding unit of a protein, thereby increasing expression levels and/or improving the folding of the expressed protein.
[0268] It is further contemplated that preserving or inserting a translational pause in a region predicted to separate autonomous folding units of a protein, particularly for heterologously-expressed proteins, can result in improved folding and/or solubility of expressed proteins. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by preserving, relative to native, or inserting one or more translational pauses in one or more regions predicted to separate autonomous folding units of a protein, thereby increasing improving the folding and/or solubility of the expressed protein.
[0269] In the methods provided herein that include changing translational kinetics of an mRNA into polypeptide by modifying codon pairs with regard to their location within or outside of autonomous folding units of proteins, one step can include identifying predicted autonomous folding units of a protein. Methods for identifying predicted autonomous folding units of a protein or protein domains are known in the art, and include alignment of amino acid sequences with protein sequences having known structures, and threading amino acid sequences against template protein domain databases. Such methods can employ any of a variety of software algorithms in searching any of a variety of databases known in the art for predicting the location of protein domains. The results of such methods will typically include an identification of the amino acids predicted to be present in a particular domain, and also can include an
identification of the domain itself, and an identification of the secondary structural element, if any, in which each amino acid sequence of a domain is located.
[0270] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to remove a translational pause not present in the expression profile of the polypeptide in the native host organism. For example, there may be no codon pairs that are not predicted to cause a translational pause or slowing and that encode a corresponding pair of amino acids. In such instances, several options are available: the codon pair that is least likely to cause a translational pause or slowing can be selected; an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made. One option in a computational method is to request human input in order to resolve the issue. The computational method may, for example, involve the use of a computer that is programmed to request human input. Alternatively, the computer may be programmed to make a selection, or combination of selections, such that multiple genes, or Ordered Gene Sets or small permutation libraries are designed and synthetically produced for use in expression analysis. In methods in which an amino acid insertion, deletion or mutation is made in order to change translational kinetics, it is preferable to select a change that is predicted not to substantially influence the final three-dimensional structure of the protein and/or the activity of the protein. Such an amino acid insertion, deletion or mutation can include, for example, a conservative amino acid substitution such as the conservative substitutions shown in Table 1. The substitutions shown are based on amino acid physical-chemical properties, and as such, are independent of organism. In some embodiments, the conservative amino acid substitution is a substitution listed under the heading of exemplary substitutions.
Table 1
Original Conservative Exemplary Residue Substitutions Substitutions
Ala (A) val; leu; ile val Arg (R) lys; gin; asn lys Asn (N) gin; his; lys; arg gin Asp (D) glu glu Cys (C) ser ser GIn (Q) asn asn GIu (E) asp asp GIy (G) pro; ala ala His (H) asn; gin; lys; arg arg He (I) leu; val; met; ala; phe leu Leu (L) ile; val; met; ala; phe ile Lys (K) arg; gin; asn arg Met (M) leu; phe; ile leu Phe (F) leu; val; ile; ala; tyr leu Pro (P) ala ala Ser (S) thr thr Thr (T) ser ser Tip (W) tyr; phe tyr Tyr (Y) trp; phe; thr; ser phe VaI (V) ile; leu; met; phe; ala leu
[0271] While in some embodiments, all codon pairs predicted to cause a translational pause or slowing are treated equally, in other embodiments, one or more different threshold levels can be established for differential treatment of codon pairs, where codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing, and succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing. Based on the codon pair groupings, different numbers or
percentages of codon pairs can be replaced for each of these different threshold-based groups. For example, 95% or more codon pairs above a highest threshold level can be replaced, while 90% or less of all codon pairs between that level and an intermediate threshold level are replaced. As contemplated herein, codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, three or more different threshold-based groups, four or more different threshold-based groups, five or more different threshold-based groups, six or more different threshold-based groups, or more. Discussion of specific thresholds are provided elsewhere herein; however, typically the higher the threshold, the higher the likelihood of a translational pause or slowing caused by a codon pair with a translational kinetics value greater than the threshold. In embodiments in which codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, different numbers or percentages of codon pairs can be replaced for each codon pair group. For example, in one embodiment, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs above a highest threshold are replaced, while the same or a lower percentage of codon pairs are replaced from codon pair groups corresponding to one or more lower thresholds. Typically, for each successively lower threshold group, the same or a lower percentage of codon pairs are replaced. In one example, all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair is located within an autonomous folding unit. In another example, all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair can be replaced without requiring a change in the encoded polypeptide sequence. In another example, all codon pairs above a highest threshold are replaced, while a codon pair above a first higher intermediate threshold is replaced only if the codon pair can be replaced without changing the encoded polypeptide sequence or with only a conservative change to the encoded polypeptide sequence, while a codon pair above a second lower intermediate threshold is replaced only if the codon pair can be replaced without requiring any change in the encoded polypeptide sequence. While the above discussion has been applied to the use of a plurality of threshold levels, it will be readily apparent to one skilled in the art that, in the place of using threshold levels, an evaluation method can be used that determines the degree to which a codon pair should be replaced according to the translational kinetics value of the codon pair, where the degree to which the codon pair should be replaced can be counterbalanced by any of a
variety of user-determined factors such as, for example, presence of the codon pair within or between autonomous folding units, and degree of change to the encoded polypeptide sequence.
[0272] In accordance with the methods and sequences provided herein, a translational kinetics value of a codon pair is a representation of the degree to which it is expected that a codon pair is associated with a translational pause. Methods of determining the translational kinetics value of a codon pair are discussed elsewhere herein. Such translational kinetics values can be normalized to facilitate comparison of translational kinetics values between species. In some embodiments, the translational value can be the degree of over-representation of a codon pair. An over-represented codon pair is a codon pair which is present in a protein-encoding sequence in higher abundance than would be expected if all codon pairs were statistically randomly abundant. When translational kinetics values of codon pairs are determined, a codon pair predicted to cause a translational pause or slowing is a codon pair whose likelihood of causing a translational pause or slowing is at least one standard deviation above the mean translational kinetics value, where a particular translational kinetics value above the mean translational kinetics value in this context refers to a translational kinetics value indicative of a greater likelihood of causing translational pausing or slowing, relative to a mean translational kinetics value, and is not strictly limited to a particular mathematical relationship (e.g., greater than the mean) since the depiction of propensity to cause a translational pause by a translational kinetics value can be selected to be negative or positive, based on the selected implementation by one skilled in the art. For example, over-represented codon pairs may be graphically displayed as a positive function in a SpeedPlot™, as depicted in Figure 1, where a positive deflection or peak above a selected threshold describes a translational pause or slowing at the exact nucleotide location as defined by the abscissa. In the methods provided herein, a threshold for the translational kinetics value of codon pairs that are predicted to cause a translational pause or slowing can be set in accordance with the method and level of stringency desired by one skilled in the art. For example, when it is desired to identify only a small number of the codon pairs most likely to cause a translational pause or slowing, a threshold value can be set to 5, or 3, or 2, or 1.5 standard deviations or more above the mean. Typical threshold values can be at least 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean. As provided herein, a plurality of thresholds can be applied in the herein-provided methods in segregating codon pairs into a plurality of groups. Each
threshold of such a plurality can be a different value selected from 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean.
[0273] In some embodiments, translational kinetics of an mRNA into polypeptide can be changed to add or retain one or more translational pauses predicted to occur before, after or within an autonomous folding unit of a protein, or between autonomous folding units. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure in the domain prior to further downstream translation and reorganization or reconfiguration of the growing polypeptide or domain. By modifying the translational kinetics of complex multi-domain proteins it may be possible to experimentally alter the time each domain has available to organize. Folding of a heterologously-expressed gene having two or more independent domains can be altered by the presence of pause sites between the domains. Refolding studies indicate that the time it takes for a protein to settle into its final configuration may take longer than the translation of the protein. Pausing may allow each domain to partially organize and commit to a particular, independent fold. Other co- translational events, such as those associated with co-factors, protein subunits, protein complexes, membranes, chaperones, secretion, or proteolysis complexes, also can depend on the kinetics of the emerging nascent polypeptide. Pauses can be introduced by engineering one codon pair predicted to cause a translational pause or slowing, or two or more such codon pairs into the sequence to facilitate these co-translational interactions.
[0274] As such, provided herein is the recognition that the presence of codon pairs predicted to cause a translational pause or slowing in protein-encoding regions separating regions encoding different autonomous folding units of the protein can serve to pause translation and facilitate folding of the nascent translated protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by including or preserving one or more translational pauses predicted to occur before, after, or between autonomous folding units of a protein, thereby increasing the likelihood that the translated protein will be properly folded. In such embodiments, typically a translational pause is preserved, which refers to maintaining the same codon pair for a polypeptide-encoding nucleotide sequence that is expressed in the native host organism,
or, when the polypeptide-encoding nucleotide sequence is heterologously expressed, changing the codon pair as appropriate to have a translational kinetics value comparable to or closest to the translational kinetics value of the native codon pair in the native host organism.
[0275] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to preserve or insert a translational pause without causing a change to the encoded amino acid sequence. For example, there may be no codon pairs that are predicted to cause a translational pause or slowing and that encode the same pair of amino acids as encoded in the original sequence. In such instances, several options are available. First, proximal codon pairs can be selected to be replaced in order to introduce a translational pause or slowing. For example, one of the 1, 2, 3, 4 or 5 most proximal codon pairs upstream (5' of the desired pause site) or one of the 1, 2, 3, 4 or 5 most proximal codon pairs downstream (3' of the desired pause site) can be chosen for replacement to introduce the translational pause or slowing. Typically in such instances, the selected codon pair for replacement to introduce the translational pause or slowing is the codon pair closest to the originally desired codon pair location of the translational pause or slowing, provided the desired translational pause or slowing can be attained (e.g., 1 codon pair upstream or downstream is typically selected instead of 2 codon pairs upstream or downstream, provided the desired translational pause or slowing can be attained). Alternatively, a translational pause or slowing can be introduced by selecting a replacement codon pair encoding a conservative amino acid substitution, such as the conservative substitutions shown in Table 1. In some embodiments, replacement of a proximal codon pair to introduce a translational pause or slowing is preferred over replacement of a codon pair resulting in a change in the encoded amino acid sequence.
[0276] Further methods of modifying polypeptide encoding nucleotide sequences are contemplated herein, as can be seen in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, each of which is incorporated by reference herein in its entirety.
[0277] Further, provided herein is the recognition that predicted pause sites may be conserved across different proteins in the same species, or in related proteins across two or more species. In some embodiments, graphical displays of translational kinetics values of one or more proteins can be used to provide information to assist in the selection of a translational pause or slowing to preserve or insert in a redesigned
polypeptide-encoding nucleotide sequence. In particular, graphical displays of translational kinetics values can permit, for example, alignment of homologous proteins from different species and an identification, based on this alignment, of predicted translational pause or slowing sites that are conserved in the aligned proteins. Such predicted translational pause or slowing sites can be preserved or inserted in a redesigned polypeptide-encoding nucleotide sequence. In another example, regions between autonomous folding units in one or more proteins within a particular species can be graphically examined for the presence or absence of predicted pause sites. Such graphical display methods can result in an identification of a region between autonomous folding units in which a translational pause or slowing is desirably preserved in a redesigned polypeptide-encoding sequence.
[0278] Methods for identifying and selecting conserved translational pauses can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007. For example, the codon pair translation kinetics values can be compared with a database of related gene sequences and conserved pause sites can be identified. Additionally, a synthetic gene can be designed wherein at least one conserved pause site is maintained to provide a synthetic gene with modified translation kinetics.
Redesign of polypeptide-encoding nucleotide sequence
[0279] As provided herein, codon pairs are associated with translational pauses, and can thereby influence translational kinetics of an mRNA into polypeptide. Thus, the methods of changing translational kinetics provided herein will typically be performed by modifying or designing one or more nucleotide sequences encoding a polypeptide to be expressed. Accordingly, provided herein are methods of modifying a gene or designing a synthetic nucleotide sequence encoding the polypeptide encoded by the gene, collectively referred to herein as redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence. Also included in the various embodiments provided herein are redesigned gene sequences encoding polypeptides that are not identical to the original gene.
[0280] In some embodiments are provided a hydrolysis enzyme-encoding DNA sequence, wherein the encoded sequence has at least a 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or
99% amino acid sequence identity to the wild type hydrolysis polypeptide sequence as set forth in SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194.
[0281] In certain embodiments, at least 1, 2 or 3 codon pairs of a polynucleotide sequence encoding the hydrolysis enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In certain aspects, the DNA sequence is optimized for expression in S. cerevisiae, E. coli, P. pastoris, K. lactis or Z mobilis.
[0282] In some embodiments, provided is a hydrolysis enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the a functional domain of the hydrolysis enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for functional domains are known in the art.
[0283] Typically in such embodiments, the replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. That is, the embodiments in which one or more codon pairs encoding amino acids of the a functional domain of one of the encoded polypeptides provided herein have been replaced include embodiments in which the nucleotide sequence encoding the functional domain is changed to increase the predicted translational kinetics of translation of the functional domain. As provided herein, incomplete translation, improper folding, or other protein expression shortcomings can result from the presence of one or more translational pauses in a heterologously-expressed polypeptide. In some embodiments, removal of one or more of these pauses can increase the speed of translation of the functional domain, and thereby increase the quantity of protein produced and/or increase the amount of stable, properly folded, active, and/or soluble protein produced.
[0284] In such embodiments, the replacement codons, i.e., the codons added as replacements for the wild type codons, are typically predicted to be less likely to cause a translational pause. For example, the replacement codon can have a translational kinetics value in the heterologous host organism that is 95%, 90%, 85%, 80%, 75%, 70%,
or less, than the translational kinetics value of the wild type codon pair when expressed in the heterologous host organism. In some embodiments, the replacement codon is selected to have a translational kinetics value similar to the translational kinetics value of the wild type codon pair in the native organism. For example, the z score of at least one replacement codon pair when expressed in the heterologous host organism can be no more than 250%, 200%, 150%, 125% or 100% of the z score for the wild type codon pair when expressed in the native organism.
[0285] In some embodiments, provided is a hydrolysis enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between domains of the hydrolysis enzyme, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the domains are known in the art and are described in detail below.
[0286] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the cellulose binding domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for cellulose binding domains are known in the art. In the case of the cellobiohydrolase of SEQ ID NO: 2, the cellulose binding domain includes at least amino acids 35-58, 30- 61 or 27-62.
[0287] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art. In the case of the cellobiohydrolase of SEQ ID NO: 2, the glycosyl hydrolase domain includes at least amino acids 124-437, 1 15-450 or 107-471.
[0288] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art. In the case of the laccase of SEQ ID NO: 26, the Cu-oxidase-3 domain includes at least amino acids 29-151 or 28-152.
[0289] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art. In the case of the laccase of SEQ ID NO: 26, the Cu-oxidase domain includes at least amino acids 162-304 or 161-305.
[0290] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art. In the case of the laccase of SEQ ID NO: 26, the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
[0291] In some embodiments, provided is a lignin peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the haem peroxidase domain of the lignin peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for haem peroxidase domains are known in the art. In the case of the lignin peroxidase of SEQ ID NO: 50, the haem peroxidase domain includes at least amino acids 47-286 or 46- 287.
[0292] In some embodiments, provided is a Mn-dependent peroxidase- encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide
sequence and which encode the haem peroxidase domain of the Mn-dependent peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for haem peroxidase domains are known in the art. In the case of the Mn-dependent peroxidase of SEQ ID NO: 74, the haem peroxidase domain includes at least amino acids 46-283 or 45-284.
[0293] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art. In the case of the laccase of SEQ ID NO: 98, the Cu- oxidase-3 domain includes at least amino acids 91-211 or 90-212.
[0294] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art. In the case of the laccase of SEQ ID NO: 98, the Cu-oxidase domain includes at least amino acids 217-366 or 216-367.
[0295] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art. In the case of the laccase of SEQ ID NO: 98, the Cu-oxidase-2 domain includes at least amino acids 427-569 or 426-570.
[0296] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs
encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art. In the case of the laccase of SEQ ID NO: 122, the Cu-oxidase-3 domain includes at least amino acids 30-152 or 29-153.
[0297] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains are known in the art. In the case of the laccase of SEQ ID NO: 122, the Cu-oxidase domain includes at least amino acids 163-305 or 162-306.
[0298] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art. In the case of the laccase of SEQ ID NO: 122, the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
[0299] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-3 domains are known in the art. In the case of the laccase of SEQ ID NO: 146, the Cu-oxidase-3 domain includes at least amino acids 30-152 or 29-153.
[0300] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase domains
are known in the art. In the case of the laccase of SEQ ID NO: 146, the Cu-oxidase domain includes at least amino acids 163-305 or 162-306.
[0301] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for Cu-oxidase-2 domains are known in the art. In the case of the laccase of SEQ ID NO: 146, the Cu-oxidase-2 domain includes at least amino acids 365-492 or 364-493.
[0302] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the cellulose binding domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for cellulose binding domains are known in the art. In the case of the cellobiohydrolase of SEQ ID NO: 170, the cellulose binding domain includes at least amino acids 465-493.
[0303] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art. In the case of the cellobiohydrolase of SEQ ID NO: 170, the glycosyl hydrolase domain includes at least amino acids 1-434.
[0304] In some embodiments, provided is a endoglucanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the endoglucanase domain of the endoglucanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for endoglucanase domains are known in the art. In the case of the endoglucanase of SEQ ID NO: 181, the endoglucanase domain includes at least amino acids 32-276.
[0305] In some embodiments, provided is a xylanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the glycosyl hydrolase domain of the xylanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for glycosyl hydrolase domains are known in the art. hi the case of the xylanase of SEQ ID NO: 193, the glycosyl hydrolase domain includes at least amino acids 31-221.
[0306] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the cellulose binding domain and the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the cellulose binding domain and glycosyl hydrolase domain are described hereinabove.
[0307] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-3 domain are described hereinabove.
[0308] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase-3 and the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase domain are described hereinabove.
[0309] In some embodiments, provided is a laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase and the Cu-oxidase-2 domain of the laccase, have
been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-2 domain are described hereinabove.
[0310] In some embodiments, provided is a lignin peroxidase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the haem peroxidase domain of the lignin peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the haem peroxidase domain are described hereinabove.
[0311] In some embodiments, provided is a Mn-dependent peroxidase- encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the haem peroxidase domain of the Mn-dependent peroxidase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the haem peroxidase domain are described hereinabove.
[0312] In some embodiments, provided is a N. crassa, P. sanguineus, P. cinnabarinus or P. coccineus laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the Cu-oxidase-3 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-3 domain are described hereinabove.
[0313] In some embodiments, provided is a N. crassa, P. sanguineus, P. cinnabarinus or P. coccineus laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase-3 and the Cu-oxidase domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid
substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase domain are described hereinabove.
[0314] In some embodiments, provided is a N. crassa, P. sanguineus, P. cinnabaήnus or P. coccineus laccase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the Cu-oxidase and the Cu-oxidase-2 domain of the laccase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the Cu-oxidase-2 domain are described hereinabove.
[0315] In some embodiments, provided is a cellobiohydrolase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the cellulose binding domain and the glycosyl hydrolase domain of the cellobiohydrolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the cellulose binding domain and glycosyl hydrolase domain are described hereinabove.
[0316] In some embodiments, provided is a endoglucanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the endoglucanase domain of the endoglucanse enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the endoglucanase domain are described hereinabove.
[0317] In some embodiments, provided is a xylanase-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the glycosyl hydrolase domain of the xylanase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the glycosyl hydrolase domain are described hereinabove.
[0318] Thus, provided herein are methods for redesigning the polypeptide- encoding nucleotide sequence provided herein to modify the translational kinetics of the polypeptide-encoding nucleotide sequence, where the polypeptide-encoding nucleotide sequence is altered such that one or more codon pairs have a decreased likelihood of causing a translational pause or slowing relative to the unaltered polypeptide-encoding nucleotide sequence. For example, one or more nucleotides of a polypeptide-encoding nucleotide sequence can be changed such that a codon pair containing the changed nucleotides has a translational kinetics value indicative of a decreased likelihood of causing a translational pause or slowing relative to the unchanged polypeptide-encoding nucleotide sequence.
[0319] While it will be understood by those of skill in the art that a redesigned polypeptide-encoding nucleotide sequence need not possess a high degree of identity to the polypeptide-encoding nucleotide sequence of the original gene, in some embodiments, the redesigned polypeptide-encoding nucleotide sequence will have at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% nucleotide identity with the polypeptide-encoding nucleotide sequence of the original gene. As used herein an original gene refers to a gene for which codon pair refinement is to be performed; such original genes can be, for example, wild type genes, native genes, naturally occurring mutant genes, other mutant genes such as site-directed mutant genes or engineered or completely synthetic genes. In other embodiments, the polynucleotide sequence will be completely synthetic, and will bear much lower identity with the original gene, e.g., no more than 90%, 80%, 70%, 60%, 50%, 40%, or lower.
[0320] Because of the redundancy of the triplet genetic code it is possible to preserve amino acid sequence coding while redesigning the polypeptide-encoding gene nucleotide sequence. Polypeptide-encoding nucleotide sequences can be redesigned to be convenient to work with and specifically tailored to a particular host and vector system of choice. The resulting sequence can be designed to: (1) reduce or eliminate translational problems caused by inappropriate ribosome pausing, such as those caused by over- represented codon pairs or other codon pairs with translational values predictive of a translational pause; (2) have codon usage refined to avoid over-reliance on rare codons; (3) reduce in number or remove particular restriction sites, splice sites, internal Shine- Dalgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism; or (4) have controlled RNA secondary structure to avoid detrimental translational termination effects, translation initiation effects, or RNA
processing, which can arise from, for example, RNA self-hybridization. When a synthetic polypeptide-encoding nucleotide sequence is to be used, this sequence also can be designed to avoid oligonucleotides that mis-hybridize, resulting in genes that can be assembled from refined oligonucleotides that by thermodynamic necessity only pair up in the desired manner, using methods known in the art, as exemplified in U.S. Patent Publication No. 2005/0106590, which is hereby incorporated by reference in its entirety.
[0321] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to suitably modify the translational kinetics of the mRNA into polypeptide without modifying the amino acid sequence of the encoded polypeptide. In such instances, an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made. In methods in which an amino acid insertion, deletion or mutation is made in order to change translational kinetics, the change is preferably predicted to not substantially influence the final three-dimensional structure of the protein and/or the activity of the protein. Such non-identical polypeptides can vary by containing one or more insertions, deletions and/or mutations. Although the nature and degree of change to the polypeptide sequence can vary according to the purpose of the change, typically such a change results in a polypeptide that is at least 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical to the wild type polypeptide sequence.
[0322] In some embodiments, the sequence of the polynucleotide can be generated, optionally in conjunction with optimization of a plurality of parameters where one such parameter can be codon pair usage, where the resultant polynucleotide can be prepared by assembly of a plurality of oligonucleotides sufficiently small to be synthesized by known oligonucleotide synthetic methods. Methods known in the art for optimizing multiple parameters in synthetic nucleotide sequences can be applied to optimizing the parameters recited in the present claims. Such methods may advantageously include those exemplified in U.S. Patent App. Publication No. 2005/0106590, U.S. Patent App. Publication No. 2007/0009928, and R. H. Lathrop et al. "Multi-Queue Branch-and-Bound Algorithm for Anytime Optimal Search with Biological Applications" in Proc. Intl. Conf. on Genome Informatics, Tokyo, Dec. 17-19, 2001 pp. 73-82; in Genome Informatics 2001 (Genome Informatics Series No. 12), Universal Academy Press, which are incorporated herein by reference in their entireties. Briefly, in addition to optimizing the various parameters, an exemplary method for generating a
sequence can also include dividing the desired sequence into a plurality of partially overlapping segments; optimizing the melting temperatures of the overlapping regions of each segment to disfavor hybridization to the overlapping segments which are non- adjacent in the desired sequence; allowing the overlapping regions of single stranded segments which are adjacent to one another in the desired sequence to hybridize to one another under conditions which disfavor hybridization of non-adjacent segments; and filling in, ligating, or repairing the gaps between the overlapping regions, thereby forming a double-stranded DNA with the desired sequence. This process can be performed manually or can be automated, e.g., in a general purpose digital computer. In one embodiment, the search of possible codon assignments is mapped into an anytime branch and bound computerized algorithm developed for biological applications.
[0323] Accordingly, provided herein are methods of designing a synthetic nucleotide sequence for the polynucleotides provided herein, where the synthetic nucleotide sequence also is typically designed to have desirable translational kinetics properties, such as the removal of some or all codon pairs predicted to result in a translational pause or slowing. Such design methods include determining a set of partially overlapping segments with optimized melting temperatures, and determining the translational kinetics of the synthetic sequence, where if it is desired to change the translational kinetics of the synthetic gene, the sequences of the overlapping segments are modified and refined in order to approximate the desired translational kinetics while still possessing acceptable hybridization properties. In some embodiments, this process is performed iteratively. In some embodiments, a criterion is established for selecting codon pairs having high translational kinetics values to be replaced with codon pairs having lower the translational kinetics values unless a codon pair of this group is the site of a planned pause. For example, the top 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% of codon pairs ranked by translational kinetics values can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value equal to or below the translational kinetics values of codon pairs not in the top selected percentage, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced). In another example all codon pairs above a user-selected translational kinetics value, such as more than 5, 4.5, 4, 3.5, 3, 2.5 or 2 standard deviations above the mean translational kinetics value can be replaced by codon pairs having lower translational kinetics values, such as
translational kinetics value below a user defined level that can be, for example, a translational kinetics value that is 4, 3.5, 3, 2.5, 2, 1.5 or 1 standard deviations less than the mean translational kinetics value, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced). Further synthetic nucleotide sequence refinement methods can be employed where additional properties of the synthetic nucleotide sequence can be refined in addition to hybridization and codon pair usage properties, where such properties can include, for example, codon usage, reduced number of restriction sites or Shine-Dalgarno sequences, or reduced detrimental RNA secondary structure, as described above.
[0324] Those skilled in the art will recognize that various optimization methods can be used, e.g., simulated annealing, genetic algorithms, branch and bound techniques, hill-climbing, Monte Carlo methods, other search strategies, and the like. Thus, the methods provided herein for designing the polynucleotide sequences provided herein, that include optimization of a plurality of parameters, where one such parameter is codon pair usage, can be implemented in by applying those parameters to art-recognized algorithms or techniques. Advantageously, sequence design is performed using an optimization method that designs a synthetic nucleotide sequence encoding the polypeptide to be expressed.
[0325] The polynucleotide sequences design methods provided herein can be employed where a plurality of properties of the polynucleotide sequences can be refined in addition to codon pair usage properties, where such properties can include, but are not limited to, melting temperature gap between oligonucleotides of synthetic gene, average codon usage, average codon pair chi-squared (e.g., z score), worst codon usage, worst codon pair (e.g., z score), maximum usage in adjacent codons, Shine-Dalgarno sequence (for E. coli expression), occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's, long exactly repeated subsequences, cloning restriction sites, user-prohibited sequences (e.g., other restriction sites), codon usage of a specific codon above user-specified limit, and out-of-frame stop codons (framecatchers). In embodiments that include expression in a eukaryotic host organism, additional properties that can be considered in a process of designing a polynucleotide sequence include, but are not limited to, occurrences of RNA splice sites, occurrences of polyA sites, and occurrence of ribosome binding sequence. For example, a process of designing a poly nucleotide sequence can include constraints including, but not limited to, minimum melting temperature gap between oligonucleotides of synthetic
gene, minimum average codon usage, maximum average codon pair chi-squared (z score), minimum absolute codon usage, maximum absolute codon pair (z score), minimum maximum usage in adjacent codons, no Shine-Dalgarno sequence (for E. coli expression), no occurrences of 5 consecutive G's or 5 consecutive Cs, no occurrences of 6 consecutive A's or 6 consecutive T's no long exactly repeated subsequences, no cloning restriction sites, no user-prohibited sequences (e.g., other restriction sites), and optionally no codon usage of a specific codon above user-specified limit. In embodiments that include expression in a eukaryotic host organism, additional constraints can include, but are not limited to, minimum occurrences of RNA splice sites, minimum occurrences of polyA sites, and occurrence of ribosome binding sequence. A process of designing a polynucleotide sequence can include preferences including, but not limited to, prefer high average codon usage, prefer low average codon pair chi-squared, prefer larger melting temperature gap, prefer more out of frame stop codons (framecatchers), and optionally prefer evenly distributed codon usage. Any of a variety of nucleotide sequence refinement/optimization methods known in the art can be used to refine the polynucleotide sequence according to the codon pair usage properties, and according to any of the additional properties specifically described above, or other properties that are refined in nucleotide sequence redesign methods known in the art. In some embodiments, a branch and bound method is employed to refine the polynucleotide sequence according to codon pair usage properties and at least one additional property, such as codon usage.
[0326] In some embodiments, the methods provided herein can further include analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that stop codons are added to at least one said frame shift. In additional embodiments, the generating step further includes analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that one or more stop codons in one, two or three reading frames are added downstream of polypeptide-encoding region of the nucleotide sequence.
[0327] In some embodiments, methods are provided for redesigning a polypeptide-encoding gene for expression in a host organism, by providing a data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate
nucleotides to select, where possible, codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
[0328] Also provided herein are methods for redesigning a polypeptide- encoding gene for expression in a host organism, by providing a first data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a second data set representative of at least one additional desired property of the synthetic gene, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, both (i) codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the first data set, and (ii) nucleotides that provide a desired property, with reference to the second data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide. In some embodiments, a branch and bound method is employed to refine the polypeptide- encoding nucleotide sequence according to codon pair usage properties of the first data set and according to the properties of the second data set. In some embodiments, the second data set contains codon preferences representative of codon usage by the host organism, including the most common codons used by the host organism for a given amino acid.
[0329] Accordingly, provided herein is a hydrolysis enzyme -encoding DNA sequence, wherein the encoded sequence has at least a 50%, 60%, 70%, 75%,80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type hydrolysis enzyme polypeptide sequence as set forth in the sequence listing. In certain aspects of the above embodiments, the polynucleotide provided herein is adapted for expression in a heterologous host organism. A heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism. In certain aspects, the host organism is not human, E. coli or S. cerevisiae.
[0330] In certain aspects of the above embodiments, at least 1 , 2 or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism
are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein. As described further below, a highly- overrepresented codon pair is a codon pair that has a translational kinetics value greater than a designated threshold, wherein a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
[0331] Also provided herein is a hydrolysis enzyme -encoding DNA sequence, having at least a 75% sequence identity with an original hydrolysis enzyme polypeptide sequence as set forth in the sequence listing and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organisms are selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 ΕDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster and Schizosaccharomyces pombe.
[0332] Thus, the methods provided herein can include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold. As described elsewhere herein, the likelihood that a particular codon pair will cause translational pausing or slowing in an organism (or the relative predicted magnitude thereof) can be represented by a translational kinetics value. The translational kinetics value can be expressed in any of a variety of manners in accordance with the guidance provided herein. In one example, a translational kinetics value can be expressed in terms of the mean translational kinetics value and the corresponding standard deviation for all codon pairs in an organism. For example, the translational kinetics value for a particular codon pair can be expressed in terms of the number of standard deviations that separate the translational kinetics value of the codon pair from the mean translational kinetics value. In methods that include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold, a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value. Although such a method is described in terms of a binary
scoring of a codon pair as either at least or less than the threshold value, one skilled in the art, in view of the teachings herein, will recognize that multiple thresholds can be used, or methods can be used that weight a codon pair along a continuum according to the translational kinetics value, based on the teachings provided herein and the general knowledge in the art.
[0333] In some embodiments, in addition to generating a candidate nucleotide sequence according to codon pair usage properties, the methods provided herein also include generating a candidate nucleotide sequence according to codon usage. As is known in the art, different organisms can have different preference for the three- nucleotide codon sequence encoding a particular amino acid. As a result, translation can often be improved by using the most common three-nucleotide codon sequence encoding a particular amino acid. Thus, some methods provided herein also include generating a candidate nucleotide sequence such that codon utilization is non-randomly biased in favor of codons most commonly used by the host organism. Codon usage preferences are known in the art for a variety of organisms and methods for selecting the more commonly used codons are well known in the art.
[0334] In some embodiments, the methods of redesigning a polypeptide- encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize the predicted translational kinetics. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, the conflict is resolved by selecting the nucleotide sequence predicted to be translated more rapidly, for example, due to fewer predicted translational pauses. In some embodiments, the methods of redesigning a polypeptide-encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize codon pair usage preferences. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, codon pair usage will be accorded more weight in order to resolve the conflict between the more than one possible nucleotide sequences. In one example, the methods provided herein can include identifying at least one instance of a conflict between selecting common codons and avoiding codon pairs predicted to cause a translational
pause; in such instances, the conflict is resolved in favor of avoiding codon pairs predicted to cause a translational pause.
[0335] Some embodiments provided herein include generating a candidate polynucleotide sequence encoding the polypeptide sequence, the candidate polynucleotide sequence having a non-random codon pair usage, such that the codon pairs encoding any particular pair of amino acids have the lowest translational kinetics values. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the encoded amino acid sequence is not altered. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the three dimensional structure of the encoded polypeptide is not substantially altered. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that no more than conservative amino acid changes are made to the encoded polypeptide.
[0336] The methods provided herein can further include a step of refining or altering the candidate polynucleotide sequence in accordance with a second nucleotide sequence property to be refined. For example, in embodiments in which codon usage is also refined, the methods further include generating or refining a candidate polynucleotide sequence encoding a polypeptide sequence such that the candidate polynucleotide sequence has a non-random codon usage, where the most common codons used by the host organism are over-represented in the candidate polynucleotide sequence. The methods can include refining or altering the candidate polynucleotide sequence in accordance with any of a variety of additional properties provided herein, including but not limited to, melting temperature gap between oligonucleotides of synthetic gene, Shine-Dai garno sequence, occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's long exactly repeated subsequences, cloning restriction sites, or any other user-prohibited sequences. Further, any of a variety of combinations of these properties can be additionally included in the nucleotide sequence refinement methods provided herein.
[0337] The method provided herein can further include an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined. In such methods, it is possible to compare the candidate sequence to the data set in order to determine whether or not the candidate sequence possesses the desired or acceptable
properties with respect to the data set. For example, subsequent to a round of nucleotide sequence refinement, it can be evaluated whether or not the codon pairs of the candidate sequence have acceptable translational kinetics values. If the values are deemed to be acceptable or desired, no further sequence alteration is required with respect to the property. In view of the methods provided herein which can be directed to the refinement or optimization of a plurality of properties, the candidate nucleotide sequence can be compared to each property considered in the refinement, and, if the values for all properties are deemed to be acceptable or desired, no further sequence alteration is required. If the values for fewer than all properties are deemed to be acceptable or desired, the candidate nucleotide sequence can be subjected to further sequence alteration and evaluation.
[0338] Thus, it is contemplated herein that the sequence alteration steps of methods provided herein can be performed iteratively. That is, one or more steps of altering the nucleotide sequence can be performed, and the candidate nucleotide sequence can be evaluated to determine whether or not further sequence alteration is necessary and/or desirable. These steps can be repeated until values for all properties are deemed to be acceptable or desired, or until no further improvement can be achieved.
Determination of translational kinetics values for codon pairs
[0339] The methods and sequences provided herein include determination and use of translational kinetics values for codon pairs. As provided herein, such a translational kinetics value can be calculated and/or empirically measured, and the final translational kinetics value used in graphical displays and methods of predicting translational kinetics can be a refined value resultant from two or more types of codon pair translational kinetics information. The various types of codon pair translational kinetics information that can be used in refining or replacing a translational kinetics value for a codon pair include, for example, values of observed versus expected codon pair frequencies in a particular organism, normalized values of observed versus expected codon pair frequencies in a particular organism, the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species, the degree to which observed versus expected codon pair frequency values are conserved at predicted pause sites such as boundaries between autonomous folding units in related proteins across two or more species, the degree to which codon pairs are
conserved at predicted pause sites across different proteins in the same species, and empirical measurement of translational kinetics for a codon pair.
[0340] The values of observed versus expected codon pair frequencies in a host organism can be determined by any of a variety of methods known in the art for statistically evaluating observed occurrences relative to expected occurrences. Regardless of the statistical method used, this typically involves obtaining codon sequence data for the organism, for example, on a gene-by-gene basis. In some embodiments, the analysis is focused only on the coding regions of the genome. Because the analysis is a statistical one, a large database is preferred. Initially, the total number of codons is determined and the number of times each of the 61 non-terminating codons appears is determined. From this information, the expected frequency of each of the 3721 (612) possible non- terminating codon pairs is calculated, typically by multiplying together the frequencies with which each of the component codons appears. This frequency analysis can be carried out on a global basis, analyzing all of the sequences in the database together; however, it is typically done on a local basis, analyzing each sequence individually. This will tend to minimize the statistical effect of an unusually high proportion of rare codons in a sequence. After the frequency data is obtained, for each sequence in the database, the expected number of occurrences of each codon pair is calculated by, for example, multiplying the expected frequency by the number of pairs in the sequence. This information can then be added to a global table, and each next succeeding sequence can be analyzed in like manner. This analysis results in a table of expected and observed values for each of the 3271 non-terminating codon pairs. The statistical significance of the variation between the expected and observed values can then be calculated, and the resulting information can be used in further practice of the various examples and embodiments provided herein.
[0341] In some embodiments, the values of observed versus expected codon pair frequencies are chi-squared values, such as chi-squared 2 (chisq2) values or chi- squared 3 (chisq3) values. Methods for calculating chi-squared values can be performed according to any method known in the art, as exemplified in U.S. Patent No. 5,082,767, which is incorporated by reference herein in its entirety. The result of chi-squared calculations is a list of 3,721 non-terminating codon pairs, each with an expected and observed value, together with a value for chi-squared (chisql): chisql = (observed-expected)2 / expected
[0342] In order to remove the contribution to chi-squared of non-randomness in amino acid pairs, a new value chi-squared 2 (chisq2) can be calculated as follows. For each group of codon pairs encoding the same amino acid pair (i.e., 400 groups), the sums of the expected and observed values are tallied; any non-randomness in amino acid pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi- squared, chisq2, is evaluated using these new expected values. Calculation methods for removing the contribution to chi-squared of non-randomness in amino acid pairs are known in the art, as exemplified in Gutman and Hatfield, Proc. Natl. Acad. Sci. USA, (1989) 86:3699-3703.
[0343] Further, in order to remove the contribution to chi-squared of non- randomness in dinucleotides, a new value chi-squared 3 (chisq3) can be calculated. Correction is made only for those dinucleotides formed between adjacent codon pairs; any bias of dinucleotides within codons (codon triplet positions I-II and II-III) will directly affect codon usage and is, therefore, automatically taken into account in the underlying calculations. For each dinucleotide pair formed between adjacent codon pairs (i.e., 16 pairs), the sums of the expected and observed values are tallied; any non- randomness in dinucleotide pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi-squared, chisq3, is evaluated using these new expected values.
[0344] As provided herein, and as will be readily apparent to those skilled in the statistical art, that further values chi-squared N (chisqN) could be calculated similarly by removing one or more other variables in like fashion.
[0345] Analyses of the E. coli, S. cerevisiae, and human databases illustrate two important features. First, there is a highly significant codon pair bias in all three species, even after the amino acid nearest neighbor bias (chisq2) and the dinucleotide bias (chisq3) are discounted. Second, the effect associated with dinucleotide bias, i.e., the difference between chisq2 and chisq3, is much more pronounced in eukaryotes than in E. coli. It is by far the predominant effect in mammals, representing two thirds of the amount of chisq2 in excess of its expectation in human. Mouse and rat data exhibit a very similar pattern. Dinucleotide bias represents a smaller effect in yeast, and only a
very minor one in E. coll. Although the predominant dinucleotide bias in human is the well-known CpG deficit, other dinucleotides are also very highly biased. For example, there is a deficit of TA, as well as an excess of TG, CA and CT. Overall, the deficit of CpG contributes only 35% of the total dinucleotide bias in the human database, and 17% in yeast.
[0346] As provided herein, the values of observed versus expected codon pair frequencies in a host organism herein can be normalized. Normalization permits different sets of values of observed versus expected codon pair frequencies to be compared by placing these values on the same numerical scale. For example, normalized codon pair frequency values can be compared between different organisms, or can be compared for different codon pair frequency value calculations within a particular organism (e.g., different calculations based on input sequence information or based on different calculations such as chisql or chisq2 or chisq3). Typically, normalization results in codon pair frequency values that are described in terms of their mean and standard deviation from the mean.
[0347] An exemplary method for normalizing codon pair frequency values is the calculation of z scores. The z score for an item indicates how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. The mathematics of the z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will have a mean of zero and a standard deviation of one. The z scores transformation can be especially useful when seeking to compare the relative standings of items from distributions with different means and/or different standard deviations, z scores are especially informative when the distribution to which they refer is normal. In a normal distribution, the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve.
[0348] An exemplary method for determining z scores for codon pair chi- squared values is as follows: First, a list of all 3721 possible non-terminating codon pairs is generated. Second, for the iΛ codon pair, the Ith chi-squared value is calculated, where the iΛ chi-squared value is denoted c,. The chi-squared value, C1, is given the sign of (observed - expected), so that over-represented codon pairs are assigned a positive c, and under-represented codon pairs are assigned a negative C1. The formula for c, is: c, = sgn(obs, - exp,) * (obs, - exp,)2 / exp,
[0349] Third, the mean chi-squared value is calculated where the mean is denoted m. The formula for the mean is: m = (I1 C1) / 3721 where Σ1 means sum over i. Fourth, the standard deviation of the chi-squared values is calculated, where the standard deviation is denoted s. The formula for the standard deviation is: s = V(Σ' (c, - m)2 / 3721 ) where V means square root. Fifth, for the ith chi-squared value c,, a z score is calculated by subtracting the mean then dividing by the standard deviation, wherein the ith z score is denoted z,. The formula for the z score is:
s
[0350] The above-described values of observed codon pair frequency versus expected codon pair frequency can be used as first approximations of translational kinetics of a polypeptide-encoding nucleotide sequence. However, such values are not true predictors of translational kinetics, and refinement of such values to more accurately predict translational kinetics can be performed according to the methods provided herein. Thus, provided herein are methods of refining the predictive capability of a translational kinetics value of a codon pair in a host organism by providing an initial translational kinetics value based on the value of observed codon pair frequency versus expected codon pair frequency for a codon pair in a host organism, providing additional translational kinetics data for the codon pair in the host organism, and modifying the initial translational kinetics value according to the additional codon pair translational kinetics data to generate a refined translational kinetics value for the codon pair in the host organism. The translational kinetics data that can be used to refine translational kinetics values and methods of modifying translational kinetics values according to such additional translational kinetics data to generate a refined translational kinetics value for a codon pair in a host organism are provided below.
[0351] In one embodiment, translational kinetics data that can be used to refine translational kinetics values are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair. Recurrence-based refinement of translational kinetics values is based on the investigation of multiple polypeptide-encoding nucleotide sequences to determine whether or not there are multiple occurrences of either codon pairs or predicted translational kinetics values in those sequences. Recurrence-based refinement of translational kinetics can be performed
using any of a variety of known sequence comparison methods consistent with the examples provided herein. For purposes of exemplification, and not for limitation, the following example of recurrence-based refinement of translational kinetics is provided.
[0352] In one exemplary embodiment, the predicted translational kinetics value for a codon pair can be refined according to the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species. As provided herein, related proteins are proteins having homologous amino acid sequences and/or similar three dimensional structures. Related proteins having homologous amino acid sequences will typically have at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% sequence identity. Related proteins having similar three dimensional structures will typically share similar secondary structure topology and similar relative positioning of secondary structural elements; exemplary related proteins having three dimensional structures are members of the same SCOP- classified Family (see, e.g., Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. MoI. Biol. 247, 536-540.).
[0353] The observed versus expected codon pair frequency values for any given codon pair can vary from species to species. However, as provided herein, evolutionarily related proteins in different species will typically conserve some or all translational pause or slowing sites. Based on this, an observed conservation of one or more predicted translational pause or slowing sites in evolutionarily related proteins of different species can confirm or increase the likelihood that a translational pause or slowing site is a functional translational kinetics signal. The codon pair located at the position on a protein that is confirmed as, or considered to have an increased likelihood of, containing an actual translational pause or slowing can itself be confirmed as being, or considered to have an increased likelihood of being, a functional translational kinetics signal. Similarly, a codon pair located at a position on a protein that is confirmed as not containing, or considered to have a decreased likelihood of containing, an actual translational pause or slowing, can itself be confirmed as not acting, or considered to have an decreased likelihood of acting, as a functional translational kinetics signal. Accordingly, initially predicted translational kinetics data, e.g., data based on values of observed codon pair frequency versus expected codon pair frequency, can be modified according to conserved codon pair frequency values across two or more species, which can lead to the codon pair being confirmed as: being a functional translational kinetics
signal; being considered to have an increased likelihood of being a functional translational kinetics signal; being confirmed as not acting as an actual translational pause codon pair; or being considered to have a decreased likelihood of being a functional translational kinetics signal.
[0354] In another embodiment, the predicted translational kinetics value for a codon pair can be refined according to the presence of the codon pair at a location predicted by methods other than codon pair frequency methods to contain a translational pause or slowing site. One example of such a predicted location is a boundary location between autonomous folding units of a protein. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a secondary structural element of a protein and/or a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure by the nascent protein prior to further downstream translation, and thereby allowing each domain to partially organize and commit to a particular, independent fold. As such, it is proposed herein that codon pairs can be associated with translational pauses between autonomous folding units of a protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain. Thus, the presence of a codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likelihood that the codon pair acts to pause or slow translation. Accordingly, predicted translational kinetics data, e.g., data based on values of observed codon pair frequency versus expected codon pair frequency, can be modified according to the presence of the codon pair at a boundary location between autonomous folding units of a protein, which can increase the likelihood of the codon pair acts to pause or slow translation. For example, an over-represented codon pair that is present at a boundary location between autonomous folding units of a protein can be confirmed as acting as a translational pause or slowing codon pair.
[0355] In the above embodiment, a single observation of the codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likely translational pause or slowing properties of a codon pair. However, typically a plurality of observations will be used to more accurately estimate the translational pause or slowing properties of a codon pair. Thus, methods of using, for example, predicted boundary locations can be combined with methods that are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a
codon pair in methods of refining a predicted translational kinetics value for a codon pair. For example, a protein present in two or more species can have conserved boundary locations between autonomous folding units of the protein, and recurrent presence of an over-represented codon pair at the boundary locations can confirm the likelihood of an actual translational pause at that boundary location, leading to confirmation, or increased likelihood, that the corresponding codon pair for the respective species acts as a translational pause or slowing codon pair. In another example, two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of an over-represented codon pair at the boundary locations can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
[0356] Such recurrence-based methods also can be used to confirm or indicate increased likelihood that a non-over-represented codon pair (e.g., an under-represented codon pair or a represented-as-expected codon pair) acts as a translational pause or slowing codon pair. For example, two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of a non- over-represented codon pair at the boundary locations, particularly if no over-represented codon pair is present, can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
[0357] Such recurrence-based methods also can be used to confirm or indicate the likelihood that a codon pair, such as an over-represented codon pair, does not act as a translational pause or slowing codon pair. For example, two or more proteins of the same species can have boundary locations between autonomous folding units, and consistent absence of a non-over-represented codon pair at the boundary locations can confirm or indicate increased likelihood that the codon pair does not act as a translational pause or slowing codon pair.
[0358] In another embodiment, the predicted translational kinetics value for a codon pair can be refined according to empirical measurement of translational kinetics for a codon pair. The influence of a codon pair on translational kinetics can be experimentally measured, and these experimental measurements can be used to refine or replace the predicted translational kinetics values for a codon pair. Several methods of
experimentally measuring the translational kinetics of a codon pair are known in the art, and can be used herein, as exemplified in Irwin et al., J. Biol. Chem., (1995) 270:22801. One such exemplary assay is based on the observation that a ribosome pausing at a site near the beginning of an mRNA coding sequence can inhibit translation initiation by physically interfering with the attachment of a new ribosome to the message, and, thus, the codon pair to be assayed can be placed at the beginning of a polypeptide-encoding nucleotide sequence and the effect of the codon pair on translational initiation can be measured as an indication of the ability of the codon pair to cause a translational pause. Another such exemplary assay is based on the fact that the transit time of a ribosome through the leader polypeptide coding region of the leader RNA of the trp operon sets the basal level of transcription through the trp attenuator, and, thus, the codon pair to be assayed can be placed into a trpLep leader polypeptide codon region, and level of expression can be inversely indicative of the translational pause properties of the codon pair, due to a faster translation causing formation of a stem-loop attenuator in the leader RNA, which results in transcriptional attenuation.
[0359] As will be apparent to one skilled in the art, the methods provided herein for calculation of translational kinetics values can be applied to the native organism of the polypeptide of SEQ ID NOS: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194, and also can be applied to a selected organism in which the polypeptide of SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194, or a modification thereof, is to be heterologously expressed. For example, the nucleotide sequence information of an organism can be used to calculate chi-squared values in accordance with the methods provided herein, and the translational kinetics values can be based on these chi-squared values as well as on additional translational kinetics information provided herein, including, but not limited to, codon pairs conserved in domain boundaries and empirically measured translational kinetics for a codon pair. Exemplary organisms for which translational kinetics values can be calculated and used to prepare a nucleotide sequence encoding a hydrolysis enzyme protein provided herein incude Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster and Schizosaccharomyces pombe.
Calculation methods of modifying translational kinetics values based on additional translational kinetics data
[0360] The translational kinetics data described herein can be combined in such a manner as to provide a refined translational kinetics value for a codon pair in a host organism. Methods of combining predictive data to arrive at a refined predictive value are known in the art and can be used herein.
[0361] Estimates for translational kinetics values are informed by a number of knowledge sources known to those skilled in the art, including but not limited to experimental measurement, conservation at protein structural boundaries and across homologous families, statistical inference from genomic sequence data, and the like as provided elsewhere herein. All these disparate knowledge sources must be integrated into an overall estimate for purposes of gene design and engineering. The general problem of integrating diverse and disparate knowledge sources is ubiquitous and well-studied in many different engineering fields, e.g., distributed sensor fusion in remote sensing, bagging classifiers in machine learning, heterogeneous database integration in data warehouses, or perceptual integration in artificial intelligence. Many useful and applicable approaches are known to the art.
[0362] While many approaches are possible, those skilled in the art agree that the method of Bayes [Bayes, T., 1764. An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53:370-418. Reprinted pp. 131-153 in "Studies in the History of Statistics and Probability," (ed. Pearson, E.S., Kendall, M. G.), Charles Griffin, London, 1970.] has rigorous foundations in probability and many successes in bioinformatics [Baldi, P., and Brunak, S., 2001. Bioinformatics: The Machine Learning Approach, MIT Press, Cambridge, MA, USA]. Using the Bayesian approach as an example here, without intending to exclude other well-known approaches, the Bayesian approach seeks to choose a hypothesis H that is most probable given the observed data D.
[0363] Operationally, this means to choose H so as to maximize the probability of H given D, written P(H|D). By Bayes's rule, this may be rewritten as P(H |D) = P(D|H) * P(H) / P(D). This is equivalent to maximizing P(D|H) * P(H) because P(D) is constant for all H. The term P(H) is identified with the degree of belief in hypothesis H before the data was observed. The term P(D|H), read "the probability of D given H," is identified with how well hypothesis H predicts the observed data D. Thus,
the Bayesian approach seeks to find an hypothesis that is a priori likely and also explains the data well.
[0364] In this example, an hypothesis H is that a given sequence feature, e.g., a given codon pair, has utility for translational kinetics engineering, e.g., creates a translational pause site. The observed data D may have several observations, e.g., D = Dl & D2 & D3 & D4, where Dl = an experimental measurement, D2 — conserved at protein structural domain boundaries, D3 = conserved across homologous protein families, and D4 = indicated as over-represented by statistical analysis that yields a high chisq3 value. In this case, the term P(D|H) = P(Dl & D2 & D3 & D4 | H), which indicates to choose an hypothesis that explains each of the observed datum. Of course, different data sources have different rates and magnitudes of observational error. This falls naturally into the Bayesian approach because the probability framework extends naturally to encompass the probability of observational error, as P(D|H) = P(D|H) * P(D is correct) + P(not D|H) * P(D is not correct). For example, an experimental measurement Dl that has been confirmed by replicate testing would have a very low probability of error, and therefore it would dominate the estimate if available.
[0365] In the general case, where no experimental measurement is available, several Bayesian approaches are commonly employed. The simplest, which often works well, is named "Naive Bayes" because it assumes conditional independence among the individual observed data items. In this case, P(D|H) = P(Dl & D2 & D3 & D4 | H) = P(Dl |H) * P(D2|H) * P(D3|H) * P(D4|H), where each of the individual terms is further expanded as P(Di|H) = P(Di|H) * P(Di is correct) + P(not Di|H) * P(Di is not correct) as indicated above. The terms P(Di is correct) and P(Di is not correct) can be estimated a priori by the correlation of Di with previous experimental measurements. The terms P(Di|H) and P(not Di|H) are obtained by observing whether or not hypothesis H is consistent with observed data item Di. More complex and powerful Bayesian approaches are also well known to the art. The fully general approach rewrites P(D |H) = P(Dl & D2 & D3 & D4 I H) = P(D4 | D3 & D2 & Dl & H) * P(D3 | D2 & Dl & H) * P(D2 | Dl & H) * P(Dl [ H). Many other approaches, both Bayesian and others, are well known to the art.
[0366] By way of example, the translational kinetics values for a codon pair can be refined by consideration of, for example, chi-squared value of observed versus expected codon pair frequency and the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, for example, at protein
structure domain boundaries. An over-represented codon pair which is present with above-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting higher predicted translational pause properties of the codon pair. In contrast, an over- represented codon pair which is present with below-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting lower predicted translational pause properties of the codon pair.
[0367] As another example, the translational kinetics values for a codon pair can be refined by consideration of, for example, experimentally measured translation step times in one species and the degree to which codon pairs that correspond to measured pause sites in the first species are conserved across homologous proteins in other species, for example, in a multiple sequence alignment. When an over-represented codon pair in another species is aligned with above-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting higher predicted translational pause properties of that codon pair in the other species. In contrast, when an over-represented codon pair in another species is aligned with below-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting lower predicted translational pause properties of that codon pair in the other species.
[0368] In various embodiments described herein, translational kinetics values for codon pairs, including refined translational kinetics values, can be determined. The translational kinetic values can be organized according to the likelihood of causing a translational pause or slowing based on any method known in the art. In one example, the translational kinetic values for two or more codon pairs, up to all codon pairs, in an organism are determined, and the mean translational kinetics value and associated standard deviation are calculated. Based on this, the translational kinetics value for a particular codon pair can be described in terms of the multiple of standard deviations the translational kinetics value for the particular codon pair differs from the mean translational kinetics value. Accordingly, reference herein to mean translational kinetics values and standard deviations, whether or not applied to a particular expression of translational kinetics value, can be applied to any of a variety of expressions of translational kinetics values provided herein.
Graphical analysis of translational kinetics
[0369] Also provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide encoded by a gene in a host organism by determining translational kinetics values for codon pairs in the host organism and generating a graphical display of the translational kinetics values of actual codon pairs of an original polypeptide-encoding nucleotide sequence of a heterologous gene as a function of codon position. Such a graphical display provides a visual display of the predicted translational influence, including translational pause or slowing for numerous or all codon pairs of a polypeptide-encoding nucleotide sequence. This visual display can be used in methods of modifying polypeptide-encoding nucleotide sequences in order to thereby modify the predicted translational kinetics of the mRNA into polypeptide in methods such as those provided herein. For example, the graphical displays can be used to identify one or more codon pairs to be modified in a polypeptide-encoding nucleotide sequence. The graphical displays can be used in analyzing a polypeptide-encoding nucleotide sequence prior to modifying the polypeptide-encoding nucleotide sequence, or can be used in analyzing a modified polypeptide-encoding nucleotide sequence to determine, for example, whether or not further modifications are desired.
[0370] Methods for creating and using graphical displays can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, which are incorporated by reference herein in their entireties. In particular, graphical displays as described therein can be created to illustrate the translational kinetics of an original or redesigned polypeptide- encoding nucleotide sequence in the native or a heterologous organism, or to illustrate differences and/or similarities of translation kinetic of a polypeptide-encoding nucleotide sequence in which one or more codon pairs have been modified. Additionally, numerous normalized graphical displays can be created to illustrate differences and/or similarities of translation kinetics of a polypeptide-encoding nucleotide sequence when expressed in two or more different organisms.
[0371] The graphical displays can be created using translational kinetics values based on any of the methods for determining translational kinetics values provided herein or otherwise known in the art. For example, chi-squared as a function of codon pair position, chi-squared 2 as a function of codon position, or chi-squared 3 as a function
of codon pair position, translational kinetics values thereof, empirical measurement of translational pause of codon pairs in a host organism, estimated translational pause capability based on observed presence and/or recurrence of a codon pair at predicted pause site, and variations and combinations thereof as provided herein.
[0372] The exact format of the graphical displays can take any of a variety of forms, and the specific form is typically selected for ease of analysis and comparison between plots. For example, the abscissa typically lists the position along the nucleotide sequence or polypeptide sequence, and can be represented by nucleotide position, codon position, codon pair position, amino acid position, or amino acid pair position. In such instances, the ordinate typically lists the translational kinetics value of the codon pair, such as, but not limited to, a translational kinetics value of codon pair frequency, including, but not limited to the z score of chisql , the z score of chisq2, the z score of chisq3, the empirically measured value, and the refined translational kinetics value. In alternative embodiments, the sequence position can be plotted along the ordinate and the translational kinetics value can be plotted along the abscissa.
[0373] As an example, a graphical display of translational kinetics is depicted in Figure 1, where each positive deflection or peak describes a predicted translational pause or slowing at the nucleotide location as defined by the abscissa. Comparing plots
[0374] Also contemplated herein are methods in which a set of graphical displays, including at least a first graphical display and a second graphical display, are prepared. These sets of displays can be compared in order to determine the difference in predicted translational efficiency or translational kinetics of the two plots. The plots can differ according to any of a variety of criteria. For example, each plot can represent a different polypeptide-encoding nucleotide sequence, each plot can represent a different host organism, each plot can represent differently determined translational kinetics values, or any combination thereof. As will be apparent to one skilled in the art, any number of different graphical displays can be compared in accordance with the methods provided herein, for example, 2, 3, 4, 5, 6, 7, 8 or more different graphical displays can be compared. Typically, two plots will represent different polypeptide-encoding nucleotide sequences, the same sequence in different host organisms, or different sequences in different host organisms.
[0375] Comparison of different graphical displays can be used to analyze the predicted change in translational kinetics as a result of the difference represented by the
graphical displays. For example, comparison of the same polypeptide-encoding nucleotide sequence in different host organisms can be used to analyze any predicted transcriptional pauses that can be removed. Accordingly, provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide in a host organism by comparing two graphical displays to understand or predict the differences in translational kinetics of the mRNA into polypeptide, where the differences in the graphical displays can be as a result of, for example, a difference in the polypeptide-encoding nucleotide sequence or a difference in the host organism. Upon determination of the differences in translational kinetics, it can be evaluated whether or not the change in translational kinetics as a result of the underlying difference between the two graphical displays is desirable. Such comparison methods also can lead to an identification of further modifications, e.g., further modifications to the polypeptide-encoding nucleotide sequence to further improve translational kinetics. Accordingly, it is contemplated herein that such comparison methods can be carried out iteratively.
[0376] In embodiments where it is desired to improve expression of a polypeptide-encoding nucleotide sequence in a particular heterologous host, a graphical display of the translational kinetics values of codon pairs for the original polypeptide- encoding nucleotide sequence in the heterologous host can be compared to a graphical display of the translational kinetics values of codon pairs for a modified polypeptide- encoding nucleotide sequence in the heterologous host, and it can be determined whether or not the modification to the polypeptide-encoding nucleotide sequence resulted in improved translational kinetics.
Methods of inserting polynucleotide into vector, transforming cells, expressing polynucleotide, and purifying polypeptide
[0377] The nucleic acid sequences provided herein can be present in a polynucleotide (e.g., DNA or RNA molecule). Thus, in one embodiment, provided are polynucleotides containing the nucleic acid sequences provided herein. The polynucleotides can be inserted into a replicable vector for cloning (e.g., amplification of the DNA) or for expression. Various vectors are publicly available and are known in the art. The vector can, for example, be in the form of a plasmid, cosmid, viral particle, or phage. The appropriate nucleic acid sequence can be inserted into the vector by any of a variety of procedures known in the art. Typically, DNA is inserted into an appropriate restriction endonuclease site(s) using techniques known in the art or the DNA is inserted
by any of a variety of PCR methodologies. Vector components can generally include, but are not limited to, one or more of a signal sequence, an origin of replication, one or more marker genes, an enhancer element, a promoter, and a transcription termination sequence. Construction of suitable vectors containing one or more of these components employs standard ligation techniques which are known to the skilled artisan.
[0378] The encoded polypeptide can be produced recombinantly not only directly, but also as a fusion polypeptide with a heterologous polypeptide, which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide. In general, the signal sequence can be a component of the vector, or it can be a part of the polynucleotide that is inserted into the vector. The signal sequence can be a prokaryotic signal sequence selected, for example, from the group of the alkaline phosphatase, penicillinase, lpp, or heat-stable enterotoxin II leaders. For yeast secretion the signal sequence can be, e.g., the yeast invertase leader, alpha factor leader (including Saccharomyces and Kluyveromyces α-factor leaders, the latter described in U.S. Patent No. 5,010,182), or acid phosphatase leader, the C. albicans glucoamylase leader (EP 362,179 published 4 April 1990), or the signal described in WO 90/13646 published 15 November 1990. In mammalian cell expression, mammalian signal sequences can be used to direct secretion of the protein, such as signal sequences from secreted polypeptides of the same or related species, as well as viral secretory leaders.
[0379] Both expression and cloning vectors contain a polynucleoitde that permits the vector to replicate in one or more selected host cells. Such sequences are well known for a variety of bacteria, yeast, and viruses. The origin of replication from the plasmid pBR322 is suitable for most Gram-negative bacteria, the 2μ plasmid origin is suitable for yeast, and various viral origins (SV40, polyoma, adenovirus, VSV or BPV) are useful for cloning vectors in mammalian cells.
[0380] Expression and cloning vectors will typically contain a selection gene, also termed a selectable marker. Typical selection genes encode proteins that (a) confer resistance to antibiotics or other toxins, e.g., ampicillin, neomycin, methotrexate, or tetracycline, (b) complement auxotrophic deficiencies, or (c) supply critical nutrients not available from complex media, e.g., the gene encoding D-alanine racemase for Bacilli.
[0381] An example of suitable selectable markers for mammalian cells are those that enable the identification of cells competent to take up the polynucleotide- containing vector, such as DHFR or thymidine kinase. An appropriate host cell when
wild-type DHFR is employed is the CHO cell line deficient in DHFR activity, prepared and propagated as described by Urlaub et al., Proc. Natl. Acad. Sci. USA, 77:4216 (1980). A suitable selection gene for use in yeast is the trpl gene present in the yeast plasmid YRp7 [Stinchcomb et al., Nature, 282:39 (1979); Kingsman et al., Gene, 7:141 (1979); Tschemper et al., Gene, 10: 157 (1980)]. The trpl gene provides a selection marker for a mutant strain of yeast lacking the ability to grow in tryptophan, for example, ATCC No. 44076 or PEP4-1 [Jones, Genetics, 85:12 (1977)].
[0382] Expression and cloning vectors usually contain a promoter operably linked to the polynucleotide provided herein to direct mRNA synthesis. Promoters recognized by a variety of potential host cells are well known. Promoters suitable for use with prokaryotic hosts include the β-lactamase and lactose promoter systems [Chang et al., Nature, 275:615 (1978); Goeddel et al., Nature, 281 :544 (1979)], alkaline phosphatase, a tryptophan (trp) promoter system [Goeddel, Nucleic Acids Res., 8:4057 (1980); EP 36,776], and hybrid promoters such as the tac promoter [deBoer et al., Proc. Natl. Acad. Sci. USA, 80:21-25 (1983)]. Promoters for use in bacterial systems also will contain a Shine-Dalgarno (S. D.) sequence operably linked to the polynucleotide provided herein.
[0383] Examples of suitable promoting sequences for use with yeast hosts include the promoters for 3-phosphoglycerate kinase [Hitzeman et al., J. Biol. Chem., 255:2073 (1980)] or other glycolytic enzymes [Hess et al., J. Adv. Enzyme Reg., 7:149 (1968); Holland, Biochemistry, 17:4900 (1978)], such as enolase, glyceraldehyde-3- phosphate dehydrogenase, hexokinase, pyruvate decarboxylase, phosphofructokinase, glucose-6-phosphate isomerase, 3-phosphoglycerate mutase, pyruvate kinase, triosephosphate isomerase, phosphoglucose isomerase, and glucokinase.
[0384] Other yeast promoters, which are inducible promoters having the additional advantage of transcription controlled by growth conditions, are the promoter regions for alcohol dehydrogenase 2, isocytochrome C, acid phosphatase, degradative enzymes associated with nitrogen metabolism, metallothionein, glyceraldehyde-3- phosphate dehydrogenase, and enzymes responsible for maltose and galactose utilization. Suitable vectors and promoters for use in yeast expression are further described in EP 73,657.
[0385] Transcription from vectors in mammalian host cells is controlled, for example, by promoters obtained from the genomes of viruses such as polyoma virus, fowlpox virus (UK 2,211,504 published 5 July 1989), adenovirus (such as Adenovirus T),
bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepatitis-B virus and Simian Virus 40 (SV40), from heterologous mammalian promoters, e.g., the actin promoter or an immunoglobulin promoter, and from heat-shock promoters, provided such promoters are compatible with the host cell systems.
[0386] Transcription by higher eukaryotes can be increased by inserting an enhancer sequence into the vector. Enhancers are cis-acting elements of DNA, usually about from 10 to 300 bp, that act on a promoter to increase its transcription. Many enhancer sequences are now known from mammalian genes (globin, elastase, albumin, α- fetoprotein, and insulin). Typically, however, one will use an enhancer from a eukaryotic cell virus. Examples include the S V40 enhancer on the late side of the replication origin (bp 100-270), the cytomegalovirus early promoter enhancer, the polyoma enhancer on the late side of the replication origin, and adenovirus enhancers. The enhancer can be spliced into the vector at a position 5' or 3' to the polynucleotide provided herein, but is preferably located at a site 5' from the promoter.
[0387] Expression vectors used in eukaryotic host cells (yeast, fungi, insect, plant, animal, human, or nucleated cells from other multicellular organisms) will also contain sequences necessary for the termination of transcription and for stabilizing the mRNA. Such sequences are commonly available from the 5' and, occasionally 3', untranslated regions of eukaryotic or viral DNAs or cDNAs. These regions contain nucleotide segments transcribed as polyadenylated fragments in the untranslated portion of the mRNA transcribed from the polynucleotide provided herein.
[0388] Still other methods, vectors, and host cells suitable for adaptation to the synthesis of the encoded proteins in recombinant vertebrate cell culture are described in Gething et al., Nature, 293:620-625 (1981); Mantei et al., Nature, 281 :40-46 (1979); EP 1 17,060; and EP 1 17,058.
[0389] Host cells are transfected or transformed with expression or cloning vectors described herein for polypeptide production and cultured in conventional nutrient media modified as appropriate for inducing promoters, selecting transformants, or amplifying the genes encoding the desired sequences. The culture conditions, such as media, temperature, pH and the like, can be selected by the skilled artisan without undue experimentation. In general, principles, protocols, and practical techniques for maximizing the productivity of cell cultures can be found in Mammalian Cell Biotechnology: a Practical Approach, M. Butler, ed. (IRL Press, 1991) and Sambrook et al., supra.
[0390] Methods of eukaryotic cell transfection and prokaryotic cell transformation are known to the ordinarily skilled artisan, for example, CaCl2, CaPO4, liposome-mediated and electroporation. Depending on the host cell used, transformation is performed using standard techniques appropriate to such cells. The calcium treatment employing calcium chloride, as described in Sambrook et al., supra, or electroporation is generally used for prokaryotes. Infection with Agrobacterium tumefaciens is used for transformation of certain plant cells, as described by Shaw et al., Gene, 23:315 (1983) and WO 89/05859 published 29 June 1989. For mammalian cells without such cell walls, the calcium phosphate precipitation method of Graham and van der Eb, Virology, 52:456-457 (1978) can be employed. General aspects of mammalian cell host system transfections have been described in U.S. Patent No. 4,399,216. Transformations into yeast are typically carried out according to the method of Van Solingen et al., J. Bact., 130:946 (1977) and Hsiao et al., Proc. Natl. Acad. Sci. (USA), 76:3829 (1979). However, other methods for introducing DNA into cells, such as by nuclear microinjection, electroporation, bacterial protoplast fusion with intact cells, or polycations, e.g., polybrene, polyorni thine, can also be used. For various techniques for transforming mammalian cells, see Keown et al., Methods in Enzymology, 185:527-537 (1990) and Mansour et al., Nature, 336:348-352 (1988).
[0391] Suitable host cells for cloning or expressing the DNA in the vectors herein include prokaryote, yeast, or higher eukaryote cells. Suitable prokaryotes include but are not limited to eubacteria, such as Gram-negative or Gram-positive organisms, for example, Enterobacteriaceae such as E. coli. Various E. coli strains are publicly available, such as E. coli Kl 2 strain MM294 (ATCC 31,446); E. coli Xl 776 (ATCC 31,537); E. coli strain W31 10 (ATCC 27,325) and K5 772 (ATCC 53,635). Other suitable prokaryotic host cells include Enterobacteriaceae such as Escherichia, e.g., E. coli, Enterobacter, Erwinia, Klebsiella, Proteus, Salmonella, e.g., Salmonella typhimurium, Serratia, e.g., Serratia marcescans, and Shigella, as well as Bacilli such as B. subtilis and B. licheniformis (e.g., B. licheniformis 41P disclosed in DD 266,710 published 12 April 1989), Pseudomonas such as P. aeruginosa, and Streptomyces. These examples are illustrative rather than limiting. Strain W3110 is one particularly preferred host or parent host because it is a common host strain for recombinant DNA product fermentations. Preferably, the host cell secretes minimal amounts of proteolytic enzymes. For example, strain W3110 can be modified to effect a genetic mutation in the genes encoding proteins endogenous to the host, with examples of such hosts including E. coli
W31 10 strain 1A2, which has the complete genotype tonA ; E. coli W3110 strain 9E4, which has the complete genotype tonA ptr3; E. coli W31 10 strain 27C7 (ATCC 55,244), which has the complete genotype tonA ptr3 phoA El 5 (argF-lac)169 degP ompT kanr; E. coli W31 10 strain 37D6, which has the complete genotype tonA ptr3 phoA El 5 (argF- lac)169 degP ompT rbs7 ilvG kanr; E. coli W31 10 strain 40B4, which is strain 37D6 with a non-kanamycin resistant degP deletion mutation; and an E. coli strain having mutant periplasmic protease disclosed in U.S. Patent No. 4,946,783 issued 7 August 1990. Alternatively, in vitro methods of cloning, e.g., PCR or other nucleic acid polymerase reactions, are suitable.
[0392] In addition to prokaryotes, eukaryotic microbes such as filamentous fungi or yeast are suitable cloning or expression hosts for polynucleoitide-containing vectors. Saccharomyces cerevisiae is a commonly used lower eukaryotic host microorganism. Others include Schizosaccharomyces pombe (Beach and Nurse, Nature, 290: 140 [1981]; EP 139,383 published 2 May 1985); Kluyveromyces hosts (U.S. Patent No. 4,943,529; Fleer et al., Bio/Technology, 9:968-975 (1991)) such as, e.g., K. lactis (MW98-8C, CBS683, CBS4574; Louvencourt et al., J. Bacterid., 154(2):737-742 [1983]), K. fragilis (ATCC 12,424), K. bulgaricus (ATCC 16,045), K. wickeramii (ATCC 24,178), K. waltii (ATCC 56,500), K. drosophilarum (ATCC 36,906; Van den Berg et al., Bio/Technology, 8:135 (1990)), K. thermotolerans, and K. marxianus; yarrowia (EP 402,226); Pichia pastoris (EP 183,070; Sreekrishna et al., J. Basic Microbiol., 28:265-278 [1988]); Candida; Trichoderma reesia (EP 244,234); Neurospora crassa (Case et al., Proc. Natl. Acad. Sci. USA, 76:5259-5263 [1979]); Schwanniomyces such as Schwanniomyces occidentalis (EP 394,538 published 31 October 1990); and filamentous fungi such as, e.g., Neurospora, Penicillium, Tolypocladium (WO 91/00357 published 10 January 1991), and Aspergillus hosts such as A. nidulans (Ballance et al., Biochem. Biophys. Res. Commun., 1 12:284-289 [1983]; Tilburn et al., Gene, 26:205-221 [1983]; Yelton et al., Proc. Natl. Acad. Sci. USA, 81 : 1470-1474 [1984]) and A. niger (Kelly and Hynes, EMBO J., 4:475-479 [1985]). Methylotropic yeasts are suitable herein and include, but are not limited to, yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula. A list of specific species that are exemplary of this class of yeasts can be found in C. Anthony, The Biochemistry of Methylotrophs, 269 (1982).
[0393] Suitable host cells for the expression of glycosylated polypeptides are derived from multicellular organisms. Examples of invertebrate cells include insect cells
such as Drosophila S2 and Spodoptera Sf9, as well as plant cells. Examples of useful mammalian host cell lines include Chinese hamster ovary (CHO) and COS cells. More specific examples include monkey kidney CVl line transformed by SV40 (COS-7, ATCC CRL 1651); human embryonic kidney line (293 or 293 cells subcloned for growth in suspension culture, Graham et al., J. Gen Virol., 36:59 (1977)); Chinese hamster ovary cells/-DHFR (CHO, Urlaub and Chasin, Proc. Natl. Acad. Sci. USA, 77:4216 (1980)); mouse Sertoli cells (TM4, Mather, Biol. Reprod., 23:243-251 (1980)); human lung cells (Wl 38, ATCC CCL 75); human liver cells (Hep G2, HB 8065); and mouse mammary tumor (MMT 060562, ATCC CCL51). The selection of the appropriate host cell is deemed to be within the skill in the art.
[0394] Gene amplification and/or expression can be measured in a sample directly, for example, by conventional Southern blotting, Northern blotting to quantitate the transcription of mRNA [Thomas, Proc. Natl. Acad. Sci. USA, 77:5201 5205 (1980)], dot blotting (DNA analysis), or in situ hybridization, using an appropriately labeled probe, based on the sequences provided herein. Alternatively, antibodies can be employed that can recognize specific duplexes, including DNA duplexes, RNA duplexes, and DNA RNA hybrid duplexes or DNA protein duplexes. The antibodies in turn can be labeled and the assay can be carried out where the duplex is bound to a surface, so that upon the formation of duplex on the surface, the presence of antibody bound to the duplex can be detected.
[0395] Gene expression, alternatively, can be measured by immunological methods, such as immunohistochemical staining of cells or tissue sections and assay of cell culture or body fluids, to quantitate directly the expression of gene product. Antibodies useful for immunohistochemical staining and/or assay of sample fluids can be either monoclonal or polyclonal, and can be prepared in any mammal. Conveniently, the antibodies can be prepared against any polypeptide provided herein or against a synthetic peptide based on the sequences provided herein or against exogenous sequence fused to the polypeptide or fragment thereof and encoding a specific antibody epitope.
[0396] Polypeptides can be recovered from culture medium or from host cell lysates. If membrane-bound, it can be released from the membrane using a suitable detergent solution (e.g. Triton-X 100) or by enzymatic cleavage. Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
[0397] It may be desired to purify polyeptpides. The following procedures are exemplary of suitable purification procedures: by fractionation on an ion-exchange column; ethanol precipitation; reverse phase HPLC; chromatography on silica or on a cation-exchange resin such as DEAE; chromatofocusing; SDS-PAGE; ammonium sulfate precipitation; gel filtration using, for example, Sephadex G-75; protein A Sepharose columns to remove contaminants such as IgG; and metal chelating columns to bind epi tope-tagged forms of the polypeptide. Various additional known methods of protein purification can be employed; exemplary methods are described in Deutscher, Methods in Enzymology, 182 (1990); Scopes, Protein Purification: Principles and Practice, Springer- Verlag, New York (1982). The purification step(s) selected will depend, for example, on the nature of the production process used and the particular polypeptide produced.
[0398] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes a DNA sequence of the embodiments provided herein operably linked to an expression control sequence. As used herein, an expression vector is a DNA or RNA vector that is capable of transforming a host cell and of effecting expression of a specified nucleic acid molecule. Typically, the expression vector is also capable of replicating within the host cell. Expression vectors can be either prokaryotic or eukaryotic, and are typically viruses or plasmids.
[0399] The term operably linked refers to functional linkage between a nucleic acid expression control sequence (such as a promoter, or array of transcription factor binding sites) and a second nucleic acid sequence, wherein the expression control sequence directs transcription of the nucleic acid corresponding to the second sequence. An operably linked expression vector can also include secretion signals and other modifying sequences, and can encode chaperones and proteins for a variety of organisms and systems.
[0400] Also provided herein are methods of expressing a polypeptide- encoding nucleotide sequence generated by the methods provided herein. Methods of expressing polypeptides from polypeptide-encoding nucleotide sequences are known in the art, as exemplified, for example, by the techniques described in Maniatis et al., 1989, Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, N.Y. and Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y. The methods include inserting a polypeptide- encoding nucleotide sequence designed by the methods provided herein into a cell, and
expressing the polypeptide-encoding nucleotide sequence under conditions suitable for gene expression. Additionally provided expression methods include cell-free expression systems as known in the art, where such methods include providing a polypeptide- encoding nucleotide sequence designed by the methods provided herein and contacting the polypeptide-encoding nucleotide sequence with a cell-free expression system under conditions suitable for protein translation.
Metabolic Engineering
[0401] In certain embodiments, the expression levels of one or more enzymes in a metabolic pathway are individually manipulated. Differential metabolic expression levels can be manipulated using methods known in the art. For example, by selecting a specific promoter with a desired transcriptional level, one can vary the expression level of the gene that is operably linked to the promoter. Similarly, one may select an expression vector that produces the desired levels of expression.
[0402] Accordingly, one can manipulate expression of the various components of the metabolic systems described herein by selecting a specific promoter with a desired level of transcriptional activation. Additionally, one can predict and manipulate expression of various components of the systems provided herein using a mathematical tool for modeling a metabolic pathway. Such tools are known in the art, for example, as described by Yang et al. (J. Biol. Chem (2005) 280(12):l 1224-32) and by Yang et al. (Bioinformatics (2005) 6:774-780), each of which is hereby incorporated by reference in its entirety.
Vectors for insertion of polynucleotide into cells
[0403] Nucleic acid constructs, methods and systems for modifying endogenous sequences also are provided herein. Endogenous sequences include genomic sequences of a cell. Such genomic sequences can include sequences previously modified by the constructs, methods and systems provided herein. Modifications of endogenous sequences can include insertions, deletions and mutations. In some embodiments, a modification can include the insertion of a heterologous sequence. Heterologous sequences include exogenous nucleic acid sequences and can include sequences with homology to endogenous sequences.
Integrable polynucleotides
[0404] In some embodiments, integrable polynucleotides for modifying endogenous nucleotide sequences in cell are provided. Such integrable polynucleotides can contain sequences with homology to endogenous sequences and a removable selectable marker cassette. The removable selectable marker cassette can include a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence. In more embodiments, integrable polynucleotides can also contain heterologous sequences. In such embodiments, the heterologous sequences and removable selectable marker cassette can be flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
[0405] In some embodiments, integrable polynucleotides can include episomal nucleic acids, such as plasmids and YACS. In such embodiments, integrable polynucleotides can include autonomous replication sequences such as CoIEl, Ori, oriT, 2 μm, CEN/ARS. In more embodiments, integrable polynucleotides can include linearized episomal nucleic acids, for example, plasmids cut with a restriction enzyme. In certain embodiments, integrable polynucleotides can include PCR products.
[0406] The following describes aspects of integrable polynucleotides, namely, removable selection cassettes, sequences with homology to endogenous sequences, and heterologous sequences contained therein.
[0407] The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification, for example, Sambrook et al., Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000).
Removable selectable marker cassettes
[0408] In some embodiments, a removable selectable cassette can contain a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence. Removable selectable marker cassettes can be used to select for integration of an integrable polynucleotide into the genome of a cell. Subsequent to integration of the integrable polynucleotide, the removable selectable marker cassette can be excised, if desired, from the genome of the cell. Because the number of known selectable markers is limited, one advantage of excising a selectable maker from the genome of a cell is that the selectable marker can be used repeatedly.
That is, after excising the selectable marker of a first integrable polynucleotide from a cell, the same selectable marker can be used in a second integrable polynucleotide to modify the genome of a cell previously modified by the first integrable polynucleotide.
[0409] In some embodiments, the selectable marker can allow selection for a cell in which the selectable marker has integrated into the cell's genome. Selectable markers can be antibiotic resistance genes against compounds, for example, kanamycin, ampicillin, tetracycline, chloramphenicol, spectinomycin, gentamycin, zeomycin, or streptomycin. More selectable markers can be genes capable of complementing strains of yeast having well characterized metabolic deficiencies, for example, tryptophan or histidine deficient mutants. In more embodiments, a selectable marker can be used to select against cells that retain the selectable marker. In such embodiments, cells which do not express the selectable marker will be selected for. In further embodiments, a selectable marker can be selected for and against. Examples of selectable markers that can be used in conjunction with the constructions and methods described herein can include, but are not limited to, URA3 (Boeke, J. D. , LaCroute, F. , and Fink, G. R. (1984). A positive selection for mutants lacking orotidine-5 '-phosphate decarboxylase activity in yeast: 5-fluoro-orotic acid resistance. MoI. Gen. Genet. 197, 345-346), TRPl (Toyn, J. H., Gunyuzlu, P. L., White, W. H., Thompson , L. A., and Hollis, G. F. (2000). A counterselection for the tryptophan pathway in yeast: 5-fluoroanthranilic acid resistance. Yeast 16, 553-560), CANl (Whelan, W. L., Gocke, E., and Manney, T. R. (1979). The CANl locus of Saccharomyces cerevisiae: fine-structure analysis and forward mutation rates. Genetics 35-51), KIURA3, CYH2, LYS2 and MET15 (Singh, A. and Sherman, F. (1975). Genetic and physiological characterization of metl5 mutants of Saccharomyces cerevisiae: a selective system for forward and reverse mutations. Genetics 75-97). Such examples can typically be used in conjunction with specific strains of Saccharamyces cerevisiae which are non-functional for specific genes. In embodiments in which the selectable marker can be selected for or selected against, a first selection of the selectable marker can be made to select for incorporation of the selectable marker and a second selection of the selectable marker can be made to select against maintaining the selectable marker. Such embodiments can find particular application when the same selectable marker is utilized iteratively, namely, two or more times, for the separate incorporation of two or more heterologous polynucleotides into the host organism.
[0410] In some embodiments, the selectable marker can be flanked by site- specific recombinase recognition sequences. Such sequences allow a site-specific
recombinase to excise the selectable marker from an integrable polynucleotide integrated into the genome of a cell. Examples of sequence-specific recombinase target sites include, but are not limited to, loxP sites, fit sites, att sites and dif sites. In certain embodiments, the site-specific recombinase recognition sequences can be loxP sites recognized by the CRE recombinase. In further embodiments, the CRE recombinase can be a CRE recombinase optimized for expression in a particular organism, for example, S. cerevisiae, using methods known in the art. In more embodiments, the site-specific recombinase recognition sequence can be frt sites recognized by the FLP recombinase.
[0411] To excise an intervening piece of DNA, for example, DNA encoding a selectable marker, the flanking loxP sites or flanking frt sites should be in the same orientation, that is, the sites should be in tandem orientation. CRE recombinase or FLP recombinase expressed in a cell can excise the sequence between loxP sites or frt sites, respectively. In some embodiments, the site-specific recombinase can be expressed from a plasmid. In other embodiments, the site-specific recombinase can be expressed from an inducible endogenous gene. The use of an inducible CRE recombinase in yeast to delete endogenous sequences flanked by loxP sites is known in the art, as exemplified in Sauer B. Functional expression of the cre-lox site-specific recombination system in the yeast Saccharamyces cerevisiae. MoI. Cell Biol. (1987) 7, 2087-2096.
Sequences with homology to endogenous sequences
[0412] In some embodiments, integration of an integrable polynucleotide into the genome of a cell can be mediated by a variety of processes. Such processes can include, but are not limited to, random integration, homologous recombination, or site- specific recombination.
[0413] In some embodiments, integrable polynucleotides can contain sequences with homology to endogenous sequences. Such sequences with homology to endogenous sequences can direct integration of integrable polynucleotides to certain locations in a cell's genome, specifically, the location of the endogenous sequence. One advantage of directing integration of integrable polynucleotides to particular locations of the genome is that the integrable polynucleotides can be directed to locations of the genome that, for example, can contain enhancer elements, locus control regions, or can be more permissive for expression of a heterologous sequence contained within an integrable polynucleotide. In certain embodiments, sequences with homology to endogenous sequences can be more than about 5 nucleotides, more than about 10 nucleotides, more than about 15 nucleotides, more than about 20 nucleotides, more than about 25
nucleotides, more than about 30 nucleotides, more than about 35 nucleotides, more than about 40 nucleotides, more than about 45 nucleotides, more than about 50 nucleotides, more than about 100 nucleotides, more than 500 nucleotides, more than about 1 kilobases, more than about 2 kilobases, more than about 3 kilobases, more than about 4 kilobases, or more than about 5 kilobases in length. Sequences with homology to endogenous sequences can be 100% identical or can have at least 99 %, 98 %, 97 %, 96 %, 95 %, 94 %, 93 %, 92 %, 91 %, 90 %, 85 %, 80 %, 70 %, or 70% identity to the endogenous sequence.
[0414] In particular embodiments, the sequences with homology to endogenous sequences can contain sequences with homology to genomic repetitive elements, such as long interspersed repeats (LINEs), short interspersed repeats (SINEs), or retrotransposon DNA, such as long terminal repeats (LTR). In certain embodiments, genomic repetitive elements can be TyI or Ty3 elements. In some embodiments, integrable polynucleotides containing sequences with homology to genomic repetitive elements may integrate at more than one site in the genome of a cell. In further embodiments, sequences with homology to endogenous sequences can contain δ sequences, δ sequences are a component of the LTR of the TyI retrotransposon and are distributed throughout the S. cerevisiae genome. Vectors containing δ sequences for integration into S. cerevisiae are known in the art, as exemplified in Lee F.W. and Da Dilva N.A., Sequential delta-integration for the regulated insertion of cloned genes in Saccharomyces cerevisiae. Biotechnol Prog. (1997) 13(4): 368-373. In certain embodiments, the 5' nucleic acid sequence with homology to an endogenous sequence and the 3' nucleic acid sequence with homology to an endogenous sequence can contain δ sequences. Vectors containing heterologous sequences flanked by δ sequences are known in the art to have an increased stability for expression of heterologous sequences contained therein (Lee F.W. and Da Dilva N.A., Improved efficiency and stability of multiple cloned gene insertions at the delta sequences of Saccharomyces cerevisiae. Appl Microbiol Biotechnol (1997) 48(3): 339-345). Without wishing to be bound to any one theory, the increased stability of integrated vectors containing two δ sequences may be due to the vector integrating into the yeast genome by double-crossover integration.
Heterologous sequences
[0415] In addition to a removable selectable cassette and sequences with homology to endogenous sequences, in some embodiments, an integrable polynucleotide can contain heterologous sequences. Such heterologous sequences can include sequences
encoding polypeptides. In more embodiments, the heterologous sequences can encode genes important in sugar metabolism, cellulose metabolism, arabinose metabolism, and xylose metabolism. In particular embodiments, a heterologous sequence can encode a one or more of the nucleotide sequences provided herein, such as, for example, one or more of SEQ ID NOs:(2x+l), where x=0 to 101.
[0416] In some embodiments, heterologous sequences can contain regulatory elements operatively linked to a sequence encoding a polypeptide. Such regulatory elements can include, for example, promoters, enhancers, and terminator sequences. Promoters may be constitutive or inducible. Suitable promoters for use in prokaryotic hosts include, but are not limited to, the trp, lac and phage promoters, tRNA promoters and glycolytic enzyme promoters. Useful yeast promoters include, but are not limited to, the promoter regions for metallothionein, 3-phosphoglycerate kinase or other glycolytic enzymes such as enolase or glyceraldehyde-3 -phosphate dehydrogenase and the enzymes responsible for maltose and galactose utilization. Appropriate mammalian promoters include, but are not limited to, the early and late promoters from SV40 and promoters derived from murine Moloney leukemia virus (MLV), mouse mammary tumor virus (MMTV), avian sarcoma viruses, adenovirus II, bovine papilloma virus and polyomas. In certain embodiments, a heterologous sequence can contain the PGKl promoter, the TEFl promoter, the CYCl terminator, and combinations thereof.
[0417] In some embodiments, heterologous sequences encode and express the gene of interest in a cell in which the heterologous sequence has integrated.
Cells
[0418] In some embodiments, a cell can contain any of the integrable polynucleotides described herein. Such a cell can be a prokaryotic cell or a eukaryotic cell. Examples of prokaryotic cells include Escherichia coli, and Clostridium species. Examples of eukaryotic cells include, but are not limited to, fungi and yeast cells, such as, Saccharomyces cerevisiae, Pichia pastoήs, Zymomonas mobilis, Kluyveromyces lactis, Kluveromyces marxianus, Trichoderma species, and Aspergillus species; mammalian cells, such as Chinese hamster cells; avian cells; and insect cells.
[0419] In some embodiments, the cell can contain an integrable polynucleotide integrated into the genome of a cell. In such embodiments, a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which the removable selectable marker is juxtaposed to said heterologous nucleic acid. A removable selectable marker can be juxtaposed to a heterologous nucleic acid where the
removable selectable marker and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the removable selectable marker and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less than about 3 kilobases, less than about 4 kilobases, less than about 5 kilobases, or less than about 10 kilobases.
[0420] In more embodiments, a cell can contain an integrable polynucleotide integrated into the genome of the cell where the removable selectable cassette has been excised from the integrated polynucleotide. In such embodiments, a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which a site-specific recombinase recognition site is juxtaposed to the heterologous nucleic acid. A site- specific recombinase recognition site can be juxtaposed to a heterologous nucleic acid where the site-specific recombinase recognition site and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the site-specific recombinase recognition site and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less than about 3 kilobases, less than about 4 kilobases, less than about 5 kilobases, or less than about 10 kilobases.
[0421] In further embodiments, a cell can contain a plurality of integrable polynucleotides. In such embodiments, a cell can contain a plurality of different integrable polynucleotides containing different selectable markers. Typically, a cell contains no more than about 1, no more than about 2, no more than about 3, no more than about 4, no more than about 5, no more than about 6, no more than about 7, no more than about 8, no more than about 8, or no more than about 10 different selectable markers.
However, it is contemplated that the number of selectable markers a cell can contain can include the number of different selectable markers compatible with the methods and compositions described herein. In some embodiments, a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell. In such embodiments, a cell can contain 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 45 or more, or 50 or more different integrable polynucleotides that have integrated into the genome of the cell. In more embodiments, a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell where some integrable polynucleotides contain selectable markers, and some integrable polynucleotides have no selectable marker. In even more embodiments, a cell can contain a plurality of different integrable polynucleotides where some or all of the selectable markers have been excised.
Methods of modifying endogenous sequences
[0422] In addition to the nucleic acids and compositions described, also provided are methods of modifying endogenous sequences in cells. In some embodiments, methods to modify an endogenous sequence in a cell can include providing a cell with any integrable polynucleotide described herein, and selecting for at least one cell containing the integrable polynucleotide integrated into the genome of the cell.
[0423] In some embodiments, a plurality of different integrable polynucleotides can be provided to a cell. In such embodiments, the plurality of different integrable polynucleotides can include 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
[0424] In certain embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different selectable makers. One advantage of providing a cell with a plurality of polynucleotides with different selectable markers includes the ability to make more than one modification to endogenous sequences in a cell simultaneously. Thus, also contemplated herein, are methods that include providing a cell with a plurality of different integrable polynucleotides simultaneously. In more embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different heterologous sequences. In even more embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different flanking sequences with homology to endogenous sequences.
[0425] In some embodiments, at least one selectable marker can be used iteratively. In such embodiments, a cell can be produced from a first round of modification(s) using the methods described herein. In other words, a cell can be provided with a first integrable polynucleotide containing a selectable marker, a cell can be selected for containing the integrable polynucleotide integrated into the cell's genome, the selection cassette can be excised from a cell containing an integrated integrable polynucleotide, and a cell can be selected for having the selection cassette excised. Subsequent to the first round of modifications, a cell containing the modifications of the first round, can undergo at least a second round of modifications using a second integrable polynucleotide containing the same selectable marker as the first integrable polynucleotide. As such, a selectable marker can be reused and is used iteratively. In more embodiments, a cell can be provided with a plurality of integrable polynucleotides containing set of different selectable markers in a first round of modifications. In at least a second subsequent round of modifications, a cell containing the modifications of the first round of modifications, can be provided with a plurality of integrable polynucleotides containing the same set of different selectable markers as the first round of modifications.
[0426] In certain embodiments, the integrable polynucleotide can be provided to a cell as a linearized plasmid.
[0427] In more embodiments, the integrable polynucleotide can be provided to a cell as a PCR product. Methods of PCR are well known in the art. In such embodiments, the template for the PCR can comprise a sequence for an integrable polynucleotide, for example, a vector containing the integrable polynucleotide sequence. In more embodiments, the initial template for PCR may not contain the entire sequence for an integrable polynucleotide. One advantage of using PCR to generate the integrable polynucleotide includes the ability to incorporate additional sequences to the ends of the initial PCR template. This ability to incorporate additional sequences reduces the number of subcloning steps required to generate an integrable polynucleotide. For example, PCR primers with tails can be designed and used to amplify the initial PCR template and incorporate the additional sequences in the tails into the amplified product. Such additional tail sequences can be 2 nucleotides, 3 nucleotides, 4 nucleotides, 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 1 1 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22
nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 31 nucleotides, 32 nucleotides, 33 nucleotides, 34 nucleotides, 35 nucleotides, 36 nucleotides, 37 nucleotides, 38 nucleotides, 39 nucleotides, 40 nucleotides, or more than 40 nucleotides in length. In certain embodiments, primers for the PCR can be designed to add sequences with homology to endogenous sequences to the initial PCR template. In such embodiments, an integrable polynucleotide with flanking sequences with homology to endogenous sequences can be generated. In particular embodiments, additional tail sequences can include TyI sequences.
[0428] In some embodiments, methods to modify an endogenous sequence in a cell can also include excising the selectable marker from the integrable polynucleotide integrated into the genome of the cell. One advantage of excising a selectable marker integrated into the genome of a cell is that the selectable marker can be re-used to select for another modification in a subsequent round of modifications. In certain embodiments, a selectable marker can be excised from an integrated site by site-specific recombination using a site-specific recombinase expressed in the cell. Site-specific recombinases can include CRE recombinase to excise sequences between tandem loxP sites, and FLP recombinase to excise sequences between tandem frt sites. In some embodiments, the site- specific recombinase can be expressed from a plasmid transformed into the cell. Alternatively, the site-specific recombinase can be expressed from an inducible endogenous gene. It is contemplated that in instances where more than one type of different selectable makers have integrated into the cell's genome, all the different selectable makers can be excised simultaneously by the expression of at least one type of site-specific recombination. For example, the selectable markers of an integrable polynucleotide containing the URA3 marker flanked by loxP sites, and an integrable polynucleotide containing the TRPl marker flanked by loxP sites, can both be excised from sites where the integrable polynucleotides have integrated into the cell by expression in the cell of CRE recombinase. In other embodiments, a cell can be provided with a plurality of integrable polynucleotides which contain different recombinase recognition sequences. In other words, the plurality of integrable polynucleotides can include some integrable polynucleotides that contain one type of recombinase recognition sequences, such as loxP sites, and some integrable polynucleotides can contain another type of recombinase recognition sequences, such as frt sites.
[0429] In some embodiments, a cell in which a selectable marker has been excised can be identified by selecting against cells that retain the marker. Methods for such negative selection are well known in the art.
Systems and methods for degrading cellulose
[0430] Also provided herein are systems and methods for degrading cellulose, comprising one or more host organisms that collectively include DNA sequences operably encoding at least two of the following enzymes: endo-l,4-β-glucanase, exo-1,4- β-D-glucanase, and β-D-glucosidase. In certain aspects, one or more, or all of the enzymes are heterologous to the one or more host organisms. In certain aspects, the translational kinetics of each of the DNA sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, or 3 codon pairs present in the original sequence for each enzyme. A silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change. In certain aspects, the at least 1 , 2 or 3 substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
[0431] In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0432] In some aspects, each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
[0433] In some aspects, one or more of the endo-l,4-β-glucanase, exo-l ,4-β- D-glucanase, and β-D-glucosidase enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for degradation of cellulose. Methods for measuring the activity of the enzymes in the system are known in the art. For example, the incorporated materials of U.S. Patent No. 6,566,1 13 provide methods for measuring the activity of cellobiohydrolases that have been recombinantly expressed.
[0434] Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed. In some such embodiments, the carbohydrate is cellulose. In some such embodiments, the carbohydrate comprises two or more β-l ,4-linked glucose units. Typically such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
Systems and methods for Lignin Metabolism
[0435] Also provided herein are systems and methods for lignin metabolism, comprising one or more host organisms that collectively include DNA sequences operably encoding at least two enzymes from bacterial or eukaryotic pathways. An exemplary system for lignin metabolism is a cassette of enzymes that can include laccase (LCC), Mn-dependent peroxidase (MnP), and lignin peroxidase (LiP). In certain aspects, one or more, or all of the enzymes are heterologous to the one or more host organisms. In certain aspects, the translational kinetics of each of the DNA sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1, 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme. A silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change. In certain aspects, the at least 1, 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism. In certain aspects, a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
[0436] In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster and Schizosaccharomyces pombe.
[0437] In some aspects, each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%,
96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
[0438] In some aspects, one or more of the enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for metabolism of lignin. Methods for measuring the activity of the enzymes in the system are known in the art.
[0439] Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed. In some such embodiments, the carbohydrate is cellulose. In some such embodiments, the carbohydrate comprises two or more β-l,4-linked glucose units. Typically such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
[0440] The following examples are included for illustrative purposes only and are not intended to limit the scope of the invention.
EXAMPLES
[0441] The methods provided herein below exemplify calculation of nucleotide sequences for improved expression in selected heterologous organisms, and expression of polynucleotides containing such sequences. It will be understood by those skilled in the art that any of a variety of known molecular techniques can be utilized in order to implement the following examples. For example, a polynucleotide containing an improved-expression nucleotide sequence calculated in accordance with the teachings herein can be prepared by known methods, such as, for example, assembly of overlapping oligonucleotides which can be solid phase synthesized, as is described in U.S. Patent Number 7,262,031, and U.S. Patent Publication Numbers 2005/0106590 and 2007/0009928. The prepared polynucleotide can then be amplified by PCR methodologies or by insertion into a vector, transformation into cells, and subsequent harvesting of the vector from the cells. Examples of such methods for amplification of a polynucleotide are provided in Ausubel et al., 2008, Current Protocols in Molecular
Biology, Greene Publishing Associates and Wiley Interscience, N.Y. The polynucleotide itself or amplicon thereof can be inserted into an expression vector configured to produce the polypeptide encoded by the inserted polynucleotide. The expression vector is then inserted into cells, and according to the expression vector used, the cells are treated under conditions suitable for polypeptide expression. Any of a variety of expression vectors, cell types, and polypeptide expression methodologies known in the art can be used, and examples of such methodologies are provided in Ausubel, supra. The expressed polypeptide can be analyzed and manipulated as desired. For example, the expressed polypeptide can be analyzed by Western blot analysis using a known antibody to the expressed polypeptide or using an anti-polypeptide antibody generated by known methods. The expressed polypeptide also can be subjected to one or more purification steps to increase the purity of the expressed polypeptide. Various analytical and purification method, as well as antibody-generation methods are known in the art, as exemplified in Ausubel, supra.
EXAMPLE 1
[0442] This example describes optimization of a DNA sequence encoding TrCBH-II for expression in yeast.
[0443] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0444] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for S. cerevisiae. The DNA sequence encoding
TrCBH-II (SEQ ID NO: 1) was derived from GenBank accession number M 16190 by removing untranslated sequence (5' untranslated region and introns).
[0445] A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position. The graphical display is provided in Figure 1.
[0446] A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2A.
[0447] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 3) was found to encode a protein (SEQ ID NO: 4) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 3) encoding the TrCBH-II protein (SEQ ID NO: 4) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2B.
EXAMPLE 2
[0448] This example describes optimization of a DNA sequence encoding TrCBH-II for expression in bacteria.
[0449] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0450] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in E. coli was prepared
by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3 A.
[0451] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 9) was found to encode a protein (SEQ ID NO: 10) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 9) encoding the TrCBH-II protein (SEQ ID NO: 10) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3B.
EXAMPLE 3
[0452] This example describes optimization of a DNA sequence encoding TrCBH-II for expression in P. pastoris.
[0453] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0454] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4A.
[0455] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 15) was found to encode a protein (SEQ ID NO: 16) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 15) encoding the TrCBH-II protein (SEQ ID NO: 16) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in
P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4B.
EXAMPLE 4
[0456] This example describes optimization of a DNA sequence encoding TrCBH-II for expression in K. lactis.
[0457] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0458] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5A.
[0459] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 21) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21) encoding the TrCBH-II protein (SEQ ID NO: 22) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5B.
EXAMPLE 5
[0460] This example describes optimization of a DNA sequence encoding TrCBH-II for expression in Z. mobilis.
[0461] Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number
of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0462] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 1) encoding the TrCBH-II protein (SEQ ID NO: 2) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6A.
[0463] The nucleotide sequence for the gene encoding the TrCBH-II protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 23) was found to encode a protein (SEQ ID NO: 24) with 100% amino acid sequence identity to wild-type TrCBH-II (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 23) encoding the TrCBH-II protein (SEQ ID NO: 24) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 6B.
EXAMPLE 6
[0464] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 2 and native TrCBH-II protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to ODOOQ of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP-
conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0465] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 7
[0466] This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
[0467] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0468] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
[0469] A graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 7A.
[0470] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in 5. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 27) was found to encode a protein (SEQ ID NO: 28) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26).
A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 27) encoding the LCC protein (SEQ ID NO: 28) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 7B.
EXAMPLE 8
[0471] This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
[0472] Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0473] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 8A.
[0474] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 33) was found to encode a protein (SEQ ID NO: 34) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 33) encoding the LCC protein (SEQ ID NO: 34) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 8B.
EXAMPLE 9
[0475] This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
[0476] Chi-squared values for P. pastoris were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions
for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0477] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 9A.
[0478] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 39) was found to encode a protein (SEQ ID NO: 40) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 39) encoding the LCC protein (SEQ ID NO: 40) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 9B.
EXAMPLE 10
[0479] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
[0480] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0481] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in K. lactis was prepared
by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1OA.
[0482] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 45) was found to encode a protein (SEQ ID NO: 46) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 45) encoding the LCC protein (SEQ ID NO: 46) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1OB.
EXAMPLE 11
[0483] This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
[0484] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0485] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 25) encoding the LCC protein (SEQ ID NO: 26) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 1 IA.
[0486] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 47) was found to encode a protein (SEQ ID NO: 48) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 47) encoding the LCC protein (SEQ ID NO: 48) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 1 IB.
EXAMPLE 12
[0487] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 8 and native LCC protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 6(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0488] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 13
[0489] This example describes optimization of a DNA sequence encoding LIP for expression in yeast.
[0490] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non-
randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0491] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for S. cerevisiae.
[0492] A graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 12A.
[0493] The nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 51) was found to encode a protein (SEQ ID NO: 52) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 51) encoding the LIP protein (SEQ ID NO: 52) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 12B.
EXAMPLE 14
[0494] This example describes optimization of a DNA sequence encoding LIP for expression in bacteria.
[0495] Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0496] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 13 A.
[0497] The nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 57) was found to encode a protein (SEQ ID NO: 58) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 57) encoding the LIP protein (SEQ ID NO: 58) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 13B.
EXAMPLE 15
[0498] This example describes optimization of a DNA sequence encoding LIP for expression in P. pastoris.
[0499] Chi-squared values for P. pastoris were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0500] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 14A.
[0501] The nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the LIP protein (SEQ ID NO: 64) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 14B.
EXAMPLE 16
[0502] This example describes optimization of a DNA sequence encoding LIP for expression in K. lactis.
[0503] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0504] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 15 A.
[0505] The nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 69) was found to encode a protein (SEQ ID NO: 70) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 69) encoding the LIP protein (SEQ ID NO: 70) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 15B.
EXAMPLE 17
[0506] This example describes optimization of a DNA sequence encoding LIP for expression in Z. mobilis.
[0507] Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0508] The nucleotide sequence for the gene encoding the LIP protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 49) encoding the LIP protein (SEQ ID NO: 50) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 16A.
[0509] The nucleotide sequence for the gene encoding the LIP protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 71) was found to encode a protein (SEQ ID NO: 72) with 100% amino acid sequence identity to wild-type LIP (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 71) encoding the LIP protein (SEQ ID NO: 72) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 16B.
EXAMPLE 18
[0510] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 14 and native LIP protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LIP antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0511] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 19
[0512] This example describes optimization of a DNA sequence encoding MnP for expression in yeast.
[0513] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0514] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for S. cerevisiae.
[0515] A graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 17A.
[0516] The nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 75) was found to encode a protein (SEQ ID NO: 76) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 75) encoding the MnP protein (SEQ ID NO: 76) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 17B.
EXAMPLE 20
[0517] This example describes optimization of a DNA sequence encoding MnP for expression in bacteria.
[0518] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0519] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 18A.
[0520] The nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 81) was found to encode a protein (SEQ ID NO: 82) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 81) encoding the MnP protein (SEQ ID NO: 82) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 18B.
EXAMPLE 21
[0521] This example describes optimization of a DNA sequence encoding MnP for expression in P. pastoris.
[0522] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0523] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 19A.
[0524] The nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 87) was found to encode a protein (SEQ ID NO: 88) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 87) encoding the MnP protein (SEQ ID NO: 88) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 19B.
EXAMPLE 22
[0525] This example describes optimization of a DNA sequence encoding MnP for expression in K. lactis.
[0526] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0527] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 2OA.
[0528] The nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 93) was found to encode a protein (SEQ ID NO: 94) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74). A
graphical display for the codon pair utilization-modified gene (SEQ ID NO: 93) encoding the MnP protein (SEQ ID NO: 94) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 2OB.
EXAMPLE 23
[0529] This example describes optimization of a DNA sequence encoding MnP for expression in Z. mobilis.
[0530] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0531] The nucleotide sequence for the gene encoding the MnP protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 73) encoding the MnP protein (SEQ ID NO: 74) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 21 A.
[0532] The nucleotide sequence for the gene encoding the MnP protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 95) was found to encode a protein (SEQ ID NO: 96) with 100% amino acid sequence identity to wild-type MnP (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 95) encoding the MnP protein (SEQ ID NO: 96) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 22B.
EXAMPLE 24
[0533] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 20 and native MnP protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697
galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-MnP antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0534] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 25
[0535] This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
[0536] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfϊeld and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0537] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
[0538] A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in N. crassa was prepared by plotting z scores of translational kinetics values for codon pair utilization in N. crassa as a function of codon pair position. The graphical display is provided in Figure 22.
[0539] A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 23 A.
[0540] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 99) was found to encode a protein (SEQ ID NO: 100) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 99) encoding the LCC protein (SEQ ID NO: 100) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 23B.
EXAMPLE 26
[0541] This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
[0542] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0543] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 24A.
[0544] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The
resulting nucleotide sequence (SEQ ID NO: 105) was found to encode a protein (SEQ ID NO: 106) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 105) encoding the LCC protein (SEQ ID NO: 106) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 24B.
EXAMPLE 27
[0545] This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
[0546] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0547] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 25A.
[0548] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 111) was found to encode a protein (SEQ ID NO: 112) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 111) encoding the LCC protein (SEQ ID NO: 112) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 25B.
EXAMPLE 28
[0549] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
[0550] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0551] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 26A.
[0552] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 117) was found to encode a protein (SEQ ID NO: 118) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 117) encoding the LCC protein (SEQ ID NO: 118) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 26B.
EXAMPLE 29
[0553] This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
[0554] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0555] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 97) encoding the LCC protein (SEQ ID NO: 98) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 27A.
[0556] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 1 19) was found to encode a protein (SEQ ID NO: 120) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 1 19) encoding the LCC protein (SEQ ID NO: 120) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 27B.
EXAMPLE 30
[0557] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 26 and native LCC protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6O0 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0558] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 31
[0559] This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
[0560] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0561] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
[0562] A graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 28A.
[0563] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 123) was found to encode a protein (SEQ ID NO: 124) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 123) encoding the LCC protein (SEQ ID NO: 124) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 28B.
EXAMPLE 32
[0564] This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
[0565] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0566] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 29A.
[0567] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 129) was found to encode a protein (SEQ ID NO: 130) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 129) encoding the LCC protein (SEQ ID NO: 130) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 29B.
EXAMPLE 33
[0568] This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
[0569] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0570] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 30A.
[0571] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 135) was found to encode a protein (SEQ ID NO: 136) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 135) encoding the LCC protein (SEQ ID NO: 136) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 30B.
EXAMPLE 34
[0572] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
[0573] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0574] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 31A.
[0575] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 141) was found to encode a protein (SEQ ID
NO: 142) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 141) encoding the LCC protein (SEQ ID NO: 142) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 3 IB.
EXAMPLE 35
[0576] This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
[0577] Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0578] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 121) encoding the LCC protein (SEQ ID NO: 122) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 32A.
[0579] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 143) was found to encode a protein (SEQ ID NO: 144) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 143) encoding the LCC protein (SEQ ID NO: 144) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 32B.
EXAMPLE 36
[0580] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 32 and native LCC protein is examined by
Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6oo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0581] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 37
[0582] This example describes optimization of a DNA sequence encoding LCC for expression in yeast.
[0583] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0584] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for S. cerevisiae.
[0585] A graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 33 A.
[0586] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 147) was found to encode a protein (SEQ ID NO: 148) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 147) encoding the LCC protein (SEQ ID NO: 148) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 33 B.
EXAMPLE 38
[0587] This example describes optimization of a DNA sequence encoding LCC for expression in bacteria.
[0588] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0589] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 34A.
[0590] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 153) was found to encode a protein (SEQ ID
NO: 154) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 153) encoding the LCC protein (SEQ ID NO: 154) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 34B.
EXAMPLE 39
[0591] This example describes optimization of a DNA sequence encoding LCC for expression in P. pastoris.
[0592] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0593] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 35 A.
[0594] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 159) was found to encode a protein (SEQ ID NO: 160) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 159) encoding the LCC protein (SEQ ID NO: 160) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 35B.
EXAMPLE 40
[0595] This example describes optimization of a DNA sequence encoding LCC for expression in K. lactis.
[0596] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0597] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 36A.
[0598] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 165) was found to encode a protein (SEQ ID NO: 166) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 165) encoding the LCC protein (SEQ ID NO: 166) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 36B.
EXAMPLE 41
[0599] This example describes optimization of a DNA sequence encoding LCC for expression in Z. mobilis.
[0600] Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0601] The nucleotide sequence for the gene encoding the LCC protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 145) encoding the LCC protein (SEQ ID NO: 146) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 37A.
[0602] The nucleotide sequence for the gene encoding the LCC protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 167) was found to encode a protein (SEQ ID NO: 168) with 100% amino acid sequence identity to wild-type LCC (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 167) encoding the LCC protein (SEQ ID NO: 168) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 37B.
EXAMPLE 42
[0603] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 38 and native LCC protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LCC antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0604] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 43
[0605] This example describes optimization of a DNA sequence encoding enzyme of T. Reesei cellobiohydrolase-I (TrCBH-I) for expression in yeast.
[0606] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0607] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for S. cerevisiae. The DNA sequence encoding TrCBH-I (SEQ ID NO: 169) was derived from GenBank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
[0608] A graphical display for the native gene (SEQ ID NO: 169) encoding the protein (SEQ ID NO: 170) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position. The graphical display is provided in Figure 38.
[0609] A graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 39A.
[0610] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in 5. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 171) was found to encode a protein (SEQ ID NO: 172) with 100% amino acid sequence identity to wild-type TrCBH-
I (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 171) encoding the TrCBH-I protein (SEQ ID NO: 172) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 39B.
EXAMPLE 44
[0611] This example describes optimization of a DNA sequence encoding TrCBH-I for expression in bacteria.
[0612] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0613] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 40A.
[0614] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 173) was found to encode a protein (SEQ ID NO: 174) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 173) encoding the TrCBH-I protein (SEQ ID NO: 174) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 4OB.
EXAMPLE 45
[0615] This example describes optimization of a DNA sequence encoding TrCBH-I for expression in P. pastoris.
[0616] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0617] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4 IA.
[0618] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 175) was found to encode a protein (SEQ ID NO: 176) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 175) encoding the TrCBH-I protein (SEQ ID NO: 176) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4 IB.
EXAMPLE 46
[0619] This example describes optimization of a DNA sequence encoding TrCBH-I for expression in K. lactis.
[0620] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared
values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0621] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 42A.
[0622] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 177) was found to encode a protein (SEQ ID NO: 178) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 177) encoding the TrCBH-I protein (SEQ ID NO: 178) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 42B.
EXAMPLE 47
[0623] This example describes optimization of a DNA sequence encoding TrCBH-I for expression in Z. mobilis.
[0624] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0625] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 169) encoding the TrCBH-I protein (SEQ ID NO: 170) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 43A.
[0626] The nucleotide sequence for the gene encoding the TrCBH-I protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 179) was found to encode a protein (SEQ ID NO: 180) with 100% amino acid sequence identity to wild-type TrCBH-I (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 179) encoding the TrCBH-I protein (SEQ ID NO: 180) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 43 B.
EXAMPLE 48
[0627] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 44 and native TrCBH-I protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA b(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 hlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 : 100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0628] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 49
[0629] This example describes optimization of a DNA sequence encoding T. aurantiacus endoglucanase (EGl) for expression in yeast.
[0630] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast,
and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0631] The nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for S. cerevisiae. The DNA sequence encoding EGl (SEQ ID NO: 181) was derived from GenBank accession number M16190 by removing untranslated sequence (5' untranslated region and introns).
[0632] A graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 44A.
[0633] The nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 183) was found to encode a protein (SEQ ID NO: 184) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 183) encoding the EGl protein (SEQ ID NO: 184) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 44B.
EXAMPLE 50
[0634] This example describes optimization of a DNA sequence encoding EGl for expression in bacteria.
[0635] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0636] The nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 45 A.
[0637] The nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 185) was found to encode a protein (SEQ ID NO: 186) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 185) encoding the EGl protein (SEQ ID NO: 186) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 45B.
EXAMPLE 51
[0638] This example describes optimization of a DNA sequence encoding EGl for expression in P. pastoris.
[0639] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0640] The nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in P. pastoris was
prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 46A.
[0641] The nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 187) was found to encode a protein (SEQ ID NO: 188) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 187) encoding the EGl protein (SEQ ID NO: 188) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 46B.
EXAMPLE 52
[0642] This example describes optimization of a DNA sequence encoding EGl for expression in K. lactis.
[0643] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0644] The nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 47A.
[0645] The nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 189) was found to encode a protein (SEQ ID NO: 190) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 189) encoding the EGl protein (SEQ ID NO: 190) expressed in K. lactis was prepared by
plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 47B.
EXAMPLE 53
[0646] This example describes optimization of a DNA sequence encoding EGl for expression in Z mobilis.
[0647] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0648] The nucleotide sequence for the gene encoding the EGl protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 181) encoding the EGl protein (SEQ ID NO: 182) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 48A.
[0649] The nucleotide sequence for the gene encoding the EGl protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 191) was found to encode a protein (SEQ ID NO: 192) with 100% amino acid sequence identity to wild-type EGl (SEQ ID NO: 182). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 191) encoding the EGl protein (SEQ ID NO: 192) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 48B.
EXAMPLE 54
[0650] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 50 and native EGl protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA h(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 : 100 into 5
ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6O0 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0651] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 55
[0652] This example describes optimization of a DNA sequence encoding T. lanuginosis xylanase (XynA) for expression in yeast.
[0653] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0654] The nucleotide sequence for the gene encoding the XynA protein was modified to optimize codon usage for S. cerevisiae. The DNA sequence encoding XynA
(SEQ ID NO: 193) was derived from GenBank accession number M 16190 by removing untranslated sequence (5' untranslated region and introns).
[0655] A graphical display for the native gene (SEQ ID NO: 193) encoding the XynA protein (SEQ ID NO: 194) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 49A.
[0656] The nucleotide sequence for the gene encoding the XynAprotein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 195) was found to encode a protein (SEQ ID NO: 196) with 100% amino acid sequence identity to wild-type XynA(SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 195) encoding the XynAprotein (SEQ ID NO: 196) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 49B.
EXAMPLE 56
[0657] This example describes optimization of a DNA sequence encoding XynA for expression in bacteria.
[0658] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0659] The nucleotide sequence for the gene encoding the XynA protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 193) encoding the XynAprotein (SEQ ID NO: 194) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 5OA.
[0660] The nucleotide sequence for the gene encoding the XynA protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 197) was found to encode a protein (SEQ ID
NO: 198) with 100% amino acid sequence identity to wild-type XynA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 197) encoding the XynA protein (SEQ ID NO: 198) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 50B.
EXAMPLE 57
[0661] This example describes optimization of a DNA sequence encoding XynA for expression in P. pastoris.
[0662] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0663] The nucleotide sequence for the gene encoding the XynA protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 193) encoding the XynA protein (SEQ ID NO: 194) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 5 IA.
[0664] The nucleotide sequence for the gene encoding the XynA protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 199) was found to encode a protein (SEQ ID NO: 200) with 100% amino acid sequence identity to wild-type XynA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 199) encoding the XynA protein (SEQ ID NO: 200) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 5 IB.
EXAMPLE 58
[0665] This example describes optimization of a DNA sequence encoding XynA for expression in K. lactis.
[0666] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0667] The nucleotide sequence for the gene encoding the XynA protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 193) encoding the XynA protein (SEQ ID NO: 194) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 52A.
[0668] The nucleotide sequence for the gene encoding the XynA protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 201) was found to encode a protein (SEQ ID NO: 202) with 100% amino acid sequence identity to wild-type XynA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 201) encoding the XynA protein (SEQ ID NO: 202) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 52B.
EXAMPLE 59
[0669] This example describes optimization of a DNA sequence encoding XynA for expression in Z mobilis.
[0670] Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0671] The nucleotide sequence for the gene encoding the XynA protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 193) encoding the XynA protein (SEQ ID NO: 194) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 53A.
[0672] The nucleotide sequence for the gene encoding the XynA protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 203) was found to encode a protein (SEQ ID NO: 204) with 100% amino acid sequence identity to wild-type XynA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 203) encoding the XynA protein (SEQ ID NO: 204) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 53B.
EXAMPLE 60
[0673] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 56 and native XynA protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 {F-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6O0 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-CBH-II antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0674] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 61
Scheme for introduction of heterologous sequences into S. cerevisiae.
[0675] Nucleic acid constructs can be prepared, for example, as shown in Figure 54A (upper panel). Figure 54A (upper panel) shows nucleic acid constructs for expressing heterologous genes in S. cerevisiae. A yeast copy-number control element (CEN/ARS or 2 μm) was introduced into the EcoRl site of the polylinker of the bacterial vector pUC18. The PGKl promoter sequence (PGKIp) and CYCl terminator (CYCIt) sequences were introduced into a unique site (Sspl, B) separated by a restriction site (D, SpeVXhoϊ) which can be used for cloning of the heterologous gene of interest (GENE X) by ligation or recombination rescue cloning. In order to select for this gene in yeast, the desired nutritional MARKER (URA3, TRPl, CANl , or METl 5) was introduced to the polylinker in the Smal site flanked by recognition sites for the Pl phage Cre recombinase (loxP). Figure 54B (lower panel) shows a scheme for the integration of heterologous gene expression cassettes. Stable expression of combinations of genes is achieved through sequential or simultaneous integration of heterologous genes into yeast chromosomes via recombinational replacement of TyI elements (ending in delta repeats, open boxes) inserted at positions which allow substantial gene expression. Primers containing outside ends with similarity to target genomic sequences (black boxes) and inside ends which overlap the PGKIp and loxP sequences are used in a PCR reaction to amplify a fragment containing GENE X and the selectable MARKER. The PCR fragment is integrated via a double crossover in terminal regions of homology with the genome and integrants are selected, hi order to recycle selectable markers, cells are transformed with a plasmid expressing the Cre recombinase and cells in which the MARKER is lost by Cre-mediated recombination between the flanking loxP sites are selected by growth on medium containing the appropriate reverse selection agent. Construction of vectors containing removable selectable cassettes
[0676] 1. The PGKl promoter region was amplified from genomic S. cerevisiae DNA using primers PGKl-FOR (5'-AATATTaggcattgcaagaattactcgtgagtaagg- 3') and PGKl-REV (5'-ACTAGTatatttgttgtaaaaagtagataattacttcc-3'), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct. Next, the CYC terminator was amplified from plasmid pNB2258 using primers CYCl-FOR (5'- ACT AGTgatatctgcgcaCTCGAGtcatgtaattagttatgtcacgc-3') and CYCl-REV (51- AATATTggccgcaaattaaagccttcgagcgtcccaaaaccttetc-3'). This amplified product therefore has Spel- 12N-Xhol restriction sites at the 5' end and a Sspl site at the 3' end. These two
cassettes were then digested with Sspl and Spel and ligated together. The ligated fragment composed of the PGKl-CYCterm with flanking Sspl sites was then ligated into the Sspl site of pUCI8, creating vectors pXP13 (forward direction) and pXPI4 (reverse direction).
[0677] 2. The TEF-I promoter region was amplified from genomic S. cerevisiae DNA using primers TEF-1-FOR-SspI (5'-AATATTaccgcgaatccttacatcac-3?) and TEFl-REV (5'-ccACTAGTtttgtaattaaaacttagattagattgctatgc-3'), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct. This was digested with Sspl and Spel. Next, the CYCl terminator fragment described in section 1 with SpeI-12N-XhoI restriction sites at the 5' end and a Sspl site at the 3' end was ligated with the TEFl promoter fragment. The ligated fragment composed of the TEFI- CYCl term with flanking Sspl sites was then ligated into the Sspl site of pUC18, creating vectors pXP17 (forward direction) and pXPI8 (reverse direction).
[0678] 3. The 2 μm origin was amplified from plasmid pRS425 using primers 2um-FOR (5'GAATTCaacgaagcatctgtgcttcattttgtagaa-3') and 2um-REV (5'- GAATTCgtatgatccaatatcaaaggaaatgatagc-3'). These primers place EcoRl sites at each side of the 2 μm origin cassette. Following sequence verification, this cassette was ligated into the pXPI3 and pXPI7 vectors described above, creating vectors pXP200 and pXP400, respectively.
[0679] 4. In a separate construction series, the CEN/ARS origin was amplified from plasmid pRS315 using primers CEN/ARS-FOR (51- GAATTCatcacgtgctataaaaataattataattt-3') and CEN/ARS-REV (51-
GAATTCgtaacttacacgcgcctcgtatcttttaatg-3'). These primers place EcoRl sites at each side of the CEN/ARS origin cassette. Following sequence verification, this cassette was ligated into the pXPI3 and pXPI7 vectors described above, creating vectors pXPIOO and pXP300, respectively.
[0680] 5. Each of the four selection markers was amplified from plasmids with the addition of loxP recombination sites at both the 5' and 3' ends, as well as Smal restriction sites for downstream cloning.
[0681] a. The CANl marker was amplified from plasmid pRS319a using primers CANl-FOR (5'-
CCCGGGATACTTCGTATAGCATACATTATACGAAGTTATgggcccattatgaatacgcacct ctatgtatttccg-3') and CANl-REV (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATggtgaatcatcgataaaaata
aatatactgag-3'). This fragment was cloned into the pCRBlunt II cloning vector from Invitrogen and sequence verified. The unique Spel site within this cloned fragment was then replaced by site directed mutagenesis, while preserving the amino acid context using primers CANl -SDM-FOR (5'-cattcaaggtactgaactcgttggtatcactgctggtg-3') and CANl- SDM-REV (5'-caccagcagtgataccaacgagttcagtaccttgaaatg-3'). This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPlOOCAN, pXP100CAN-REV, pXP300CAN, and pXP300CAN-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200CAN, pXP200CAN-REV, pXP400CAN and pXP400CAN- REV.
[0682] b. The METl 5 marker was amplified from plasmid pRS401 using primers MET-FOR (5'-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATgccatcctcatgaaaactgtgt aacataataaccg-3') and MET-REV (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATgtatagtacttgtgagagaaa gtaggttatac-3'). This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPlOOMET, pXP100MET-REV, pXP300MET and pXP300MET-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200MET, pXP200MET-REV, pXP400MET and pXP400MET-REV.
[0683] c. The TRPl marker was amplified from plasmid pRS314 using primers TRP-FOR (5'-
CCCGGGATAACTTCGT ATAGCATACATTATACGAAGTT ATaacgacattactatatatataat ataggaagc-3') and TRP-REV (5?-
CCCGGGATAACITCGTATAGCATACATTATACGAAGTTATcaggcaagtgcacaaacaata cttaaataaatactactc-3'). This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPIOOTRP and pXPIOOTRP-REV, pXP300TRP and pXP300TRP-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200TRP, pXP200TRP- REV, pXP400TRP and pXP400TRP-REV.
[0684] d. The URA3 marker was amplified from plasmid pRSl lό using primers URA-FOR (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATcagggtccataaagctttcaat tcatc-31) and URA-REV (51-
CCCGGGATAACITCGT ATAGCATACATT AT ACGAAGTTATgggtaataactgatataattaaa ttgaagctct-3'). This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPlOOURA, pXP 100URA-REV, pXP300URA and pXP300URA-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200URA, pXP200URA-REV, pXP400URA and pXP400URA-REV.
[0685] Since modifications will be apparent to those of skill in this art, it is intended that this invention be limited only by the scope of the appended claims.
Claims
1. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATATC (nucleotides 1474 - 1479) TTGAAT (nucleotides 802 - 807) ATCAAG (nucleotides 1477 - 1482) GCCAAG (nucleotides 526 - 531).
2. The nucleotide sequence of Claim 1, in which all 4 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
3. The nucleotide sequence of Claim 1, in which at least 3 of the following codon pair replacements have been made:
GATATC (nucleotides 1474 - 1479) replaced with GATATA TTGAAT (nucleotides 802 - 807) replaced with TTAAAT ATCAAG (nucleotides 1477 - 1482) replaced with ATAAAA GCCAAG (nucleotides 526 - 531) replaced with GCAAAA.
4. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCCTC (nucleotides 1405 - 1410) ATCCTC (nucleotides 892 - 897) TTCCAG (nucleotides 190 - 195) TTCCAG (nucleotides 265 - 270) GACAGC (nucleotides 1360 - 1365) TTCCCG (nucleotides 544 - 549) CAGGCG (nucleotides 457 - 462) GCGGCA (nucleotides 589 - 594) TTCCGC (nucleotides 1327 - 1332).
5. The nucleotide sequence of Claim 4, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
6. The nucleotide sequence of Claim 4, in which at least 3 of the following codon pair replacements have been made:
TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG ATCCTC (nucleotides 892 - 897) replaced with ATCCTG TTCCAG (nucleotides 190 - 195) replaced with TTCCAA TTCCAG (nucleotides 265 - 270) replaced with TTTCAG GACAGC (nucleotides 1360 - 1365) replaced with GATTCT TTCCCG (nucleotides 544 - 549) replaced with TTCCCA CAGGCG (nucleotides 457 - 462) replaced with CAAGCG GCGGCA (nucleotides 589 - 594) replaced with GCGGCT TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT.
7. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATATC (nucleotides 1474 - 1479) ATCAAG (nucleotides 1477 - 1482) TTCAAC (nucleotides 1051 - 1056) ATCAAC (nucleotides 205 - 210) ATCAAC (nucleotides 571 - 576) ATCAAC (nucleotides 880 - 885) ATCAAC (nucleotides 1078 - 1083).
8. The nucleotide sequence of Claim 7, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
9. The nucleotide sequence of Claim 7, in which at least 3 of the following codon pair replacements have been made:
GATATC (nucleotides 1474 - 1479) replaced with GACATT ATCAAG (nucleotides 1477 - 1482) replaced with ATTAAA TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT ATCAAC (nucleotides 205 - 210) replaced with ATTAAT ATCAAC (nucleotides 571 - 576) replaced with ATTAAT ATCAAC (nucleotides 880 - 885) replaced with ATTAAT ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT.
10. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AAGAAG (nucleotides 175 - 180 ) TTCCAT (nucleotides 349 - 354 ) GCCAAG (nucleotides 526 - 531 ) TTCCAT (nucleotides 1426 - 1431 ) GATATC (nucleotides 1474 - 1479 ).
1 1. The nucleotide sequence of Claim 10, in which all of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
12. The nucleotide sequence of Claim 10, in which at least 3 of the following codon pair replacements have been made:
AAGAAG (nucleotides 175 - 180 ) replaced with AAAAAG TTCCAT (nucleotides 349 - 354 ) replaced with TTTCAT GCCAAG (nucleotides 526 - 531 ) replaced with GCCAAA TTCCAT (nucleotides 1426 - 1431 ) replaced with TTCCAC GATATC (nucleotides 1474 - 1479 ) replaced with GACATT.
13. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCCGGT (nucleotides 7 - 12 ) ATCGGG (nucleotides 64 - 69 ) CACAGC (nucleotides 385 - 390 ) GCCAAG (nucleotides 526 - 531 ) AAGCTG (nucleotides 529 - 534 ) CGCTAT (nucleotides 643 - 648 ) GTCGAT (nucleotides 727 - 732 ) AACAGC (nucleotides 739 - 744 ) GATGCC (nucleotides 916 - 921 ) GCACCG (nucleotides 940 - 945 ) GTGCCT (nucleotides 1000 - 1005 ) GTCGAT (nucleotides 1027 - 1032 ) GCAGGG (nucleotides 1 165 - 1 170 ) CACAGC (nucleotides 1 192 - 1 197 ) GACAGC (nucleotides 1360 - 1365 ).
14. The nucleotide sequence of Claim 13, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
15. The nucleotide sequence of Claim 13, in which at least 3 of the following codon pair replacements have been made:
TCCGGT (nucleotides 7 - 12 ) replaced with TCTGGT ATCGGG (nucleotides 64 - 69 ) replaced with ATTGGT CACAGC (nucleotides 385 - 390 ) replaced with CATTCT GCCAAG (nucleotides 526 - 531 ) replaced with GCGAAA AAGCTG (nucleotides 529 - 534 ) replaced with AAATTG CGCTAT (nucleotides 643 - 648 ) replaced with CGTTAT GTCGAT (nucleotides 727 - 732 ) replaced with GTTGAT AACAGC (nucleotides 739 - 744 ) replaced with AATTCT GATGCC (nucleotides 916 - 921 ) replaced with GATGCA GCACCG (nucleotides 940 - 945 ) replaced with GCTCCG GTGCCT (nucleotides 1000 - 1005 ) replaced with GTCCCT GTCGAT (nucleotides 1027 - 1032 ) replaced with GTTGAT GCAGGG (nucleotides 1 165 - 1 170 ) replaced with GCTGGC CACAGC (nucleotides 1 192 - 1197 ) replaced with CATTCT GACAGC (nucleotides 1360 - 1365 ) replaced with GACTCT.
16. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
17. The nucleotide sequence of Claim 16, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
18. The nucleotide sequence of Claim 16, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
19. A laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W3110
Escherichia coli UTI89
Escherichia co/7O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
20. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28-152 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
21. The laccase-encoding nucleotide sequence of Claim 20, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
22. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 161-305 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
23. The laccase-encoding nucleotide sequence of Claim 22, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
24. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 364-493 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
25. The laccase-encoding nucleotide sequence of Claim 24, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
26. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-28 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
27. The laccase-encoding nucleotide sequence of Claim 26, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
28. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 152-161 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
29. The laccase-encoding nucleotide sequence of Claim 28, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
30. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 305-364 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
31. The laccase-encoding nucleotide sequence of Claim 30, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
32. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 901 - 906) CTTTCT (nucleotides 19 - 24) GACCGT (nucleotides 547 - 552) TTCCCC (nucleotides 301 - 306) TTCCCC (nucleotides 730 - 735) TTCCCC (nucleotides 988 - 993) TTCCCC (nucleotides 1051 - 1056).
33. The nucleotide sequence of Claim 32, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
34. The nucleotide sequence of Claim 32, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 901 - 906) replaced with TTGTCT CTTTCT (nucleotides 19 - 24) replaced with TTGTCT GACCGT (nucleotides 547 - 552) replaced with GATAGA TTCCCC (nucleotides 301 - 306) replaced with TTTCCA TTCCCC (nucleotides 730 - 735) replaced with TTTCCA TTCCCC (nucleotides 988 - 993) replaced with TTTCCA TTCCCC (nucleotides 1051 - 1056) replaced with TTTCCA.
35. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 901 - 906) TTCCTC (nucleotides 700 - 705) CTCGAC (nucleotides 340 - 345) CTTTCT (nucleotides 19 - 24) TTCCAG (nucleotides 880 - 885) GTCTGG (nucleotides 595 - 600) TTCCCG (nucleotides 1042 - 1047) ATCGCC (nucleotides 229 - 234) ATCGCC (nucleotides 373 - 378).
36. The nucleotide sequence of Claim 35, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
37. The nucleotide sequence of Claim 35, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 901 - 906) replaced with CTGTCT TTCCTC (nucleotides 700 - 705) replaced with TTCTTG CTCGAC (nucleotides 340 - 345) replaced with CTGGAC CTTTCT (nucleotides 19 - 24) replaced with CTGTCT TTCCAG (nucleotides 880 - 885) replaced with TTCCAA GTCTGG (nucleotides 595 - 600) replaced with GTTTGG TTCCCG (nucleotides 1042 - 1047) replaced with TTCCCA ATCGCC (nucleotides 229 - 234) replaced with ATTGCG ATCGCC (nucleotides 373 - 378) replaced with ATCGCT.
38. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCAAG (nucleotides 7 - 12) ATCAAC (nucleotides 922 - 927) GACGAA (nucleotides 343 - 348) CTTTCC (nucleotides 901 - 906).
39. The nucleotide sequence of Claim 38, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
40. The nucleotide sequence of Claim 38, in which at least 3 of the following codon pair replacements have been made:
TTCAAG (nucleotides 7 - 12) replaced with TTTAAA ATCAAC (nucleotides 922 - 927) replaced with ATTAAT GACGAA (nucleotides 343 - 348) replaced with GATGAA CTTTCC (nucleotides 901 - 906) replaced with TTGTCT.
41. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCT (nucleotides 19 - 24 ) TTTGTC (nucleotides 25 - 30 ) TTCCCC (nucleotides 301 - 306 ) GACCGT (nucleotides 547 - 552 ) TTCCCC (nucleotides 730 - 735 ) CTTTCC (nucleotides 901 - 906 ) TTCCCC (nucleotides 988 - 993 ) TTCCCC (nucleotides 1051 - 1056 ).
42. The nucleotide sequence of Claim 41, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
43. The nucleotide sequence of Claim 41, in which at least 3 of the following codon pair replacements have been made:
CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT TTTGTC (nucleotides 25 - 30 ) replaced with TTCGTT TTCCCC (nucleotides 301 - 306 ) replaced with TTCCCT GACCGT (nucleotides 547 - 552 ) replaced with GATAGA TTCCCC (nucleotides 730 - 735 ) replaced with TTCCCT CTTTCC (nucleotides 901 - 906 ) replaced with TTGTCT TTCCCC (nucleotides 988 - 993 ) replaced with TTTCCT TTCCCC (nucleotides 1051 - 1056 ) replaced with TTTCCA.
44. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCT (nucleotides 19 - 24 )
ACGGCT (nucleotides 184 - 189 )
CTGACC (nucleotides 21 1 - 216 )
GCCCGT (nucleotides 376 - 381 )
ATCGGT (nucleotides 424 - 429 )
CTGACC (nucleotides 604 - 609 )
AAGGCT (nucleotides 865 - 870 )
CTTTCC (nucleotides 901 - 906 )
CCCGGA (nucleotides 1063 - 1068 ).
45. The nucleotide sequence of Claim 44, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
46. The nucleotide sequence of Claim 44, in which at least 3 of the following codon pair replacements have been made:
CTTTCT (nucleotides 19 - 24 ) replaced with TTGTCT ACGGCT (nucleotides 184 - 189 ) replaced with ACCGCT CTGACC (nucleotides 211 - 216 ) replaced with TTGACC GCCCGT (nucleotides 376 - 381 ) replaced with GCTCGT ATCGGT (nucleotides 424 - 429 ) replaced with ATTGGA CTGACC (nucleotides 604 - 609 ) replaced with TTGACA AAGGCT (nucleotides 865 - 870 ) replaced with AAAGCC CTTTCC (nucleotides 901 - 906 ) replaced with TTGTCT CCCGGA (nucleotides 1063 - 1068 ) replaced with CCTGGT.
47. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or 5. cerevisiae.
48. The nucleotide sequence of Claim 47, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
49. The nucleotide sequence of Claim 47, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
50. A lignin peroxidase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-372 of wild-type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W3110
Escherichia coli UTI89
Escherichia co/7O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
51. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 46-287 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
52. The lignin peroxidase-encoding nucleotide sequence of Claim 51 , wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
53. A lignin peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-372 of wild- type lignin peroxidase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 49 and which encode amino acids 1-46 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
54. The lignin peroxidase-encoding nucleotide sequence of Claim 53, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
55. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCCCC (nucleotides 130 - 135) TTCCCC (nucleotides 721 - 726) TTCCCC (nucleotides 979 - 984) TTCCCC (nucleotides 1033 - 1038) GCCAAG (nucleotides 247 - 252).
56. The nucleotide sequence of Claim 55, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
57. The nucleotide sequence of Claim 55, in which at least 3 of the following codon pair replacements have been made: TTCCCC (nucleotides 130 - 135) replaced with TTTCCG TTCCCC (nucleotides 721 - 726) replaced with TTCCCA TTCCCC (nucleotides 979 - 984) replaced with TTTCCG TTCCCC (nucleotides 1033 - 1038) replaced with TTCCCA GCCAAG (nucleotides 247 - 252) replaced with GCGAAG.
58. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ATTGCC (nucleotides 289 - 294) CAGGCG (nucleotides 358 - 363) CAGGCG (nucleotides 850 - 855) CAGGCG (nucleotides 1012 - 1017) CTCTCC (nucleotides 991 - 996) ATCGCC (nucleotides 244 - 249) ATCGCC (nucleotides 370 - 375) ATCGCC (nucleotides 610 - 615).
59. The nucleotide sequence of Claim 58, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
60. The nucleotide sequence of Claim 58, in which at least 3 of the following codon pair replacements have been made:
ATTGCC (nucleotides 289 - 294) replaced with ATCGCT CAGGCG (nucleotides 358 - 363) replaced with CAGGCT CAGGCG (nucleotides 850 - 855) replaced with CAGGCT CAGGCG (nucleotides 1012 - 1017) replaced with CAGGCT CTCTCC (nucleotides 991 - 996) replaced with CTGTCT ATCGCC (nucleotides 244 - 249) replaced with ATTGCG ATCGCC (nucleotides 370 - 375) replaced with ATCGCT ATCGCC (nucleotides 610 - 615) replaced with ATTGCT.
61. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74, wherein at least one of the following codon pairs of SEQ ID NO: 73 SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCAAG (nucleotides 7 - 12 )
GACGAG (nucleotides 340 - 345 )
ACCAAG (nucleotides 532 - 537 )
GAGCTG (nucleotides 670 - 675 )
TCTCCC (nucleotides 757 - 762 )
GTCAAC (nucleotides 841 - 846 )
TTCAAG (nucleotides 871 - 876 ).
62. The nucleotide sequence of Claim 61, in which at least five of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
63. The nucleotide sequence of Claim 61, in which at least the following codon pair replacements have been made:
TTCAAG (nucleotides 7 - 12 ) replaced with TTTAAA GACGAG (nucleotides 340 - 345 ) replaced with GATGAA ACCAAG (nucleotides 532 - 537 ) replaced with ACTAAA GAGCTG (nucleotides 670 - 675 ) replaced with GAATTG TCTCCC (nucleotides 757 - 762 ) replaced with TCACCA GTCAAC (nucleotides 841 - 846 ) replaced with GTTAAT TTCAAG (nucleotides 871 - 876 ) replaced with TTTAAA.
64. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCCCC (nucleotides 130 - 135 ) GCCAAG (nucleotides 247 - 252 ) TTCCCC (nucleotides 721 - 726 ) TTCCCC (nucleotides 979 - 984 ) TTCCCC (nucleotides 1033 - 1038 ).
65. The nucleotide sequence of Claim 64, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
66. The nucleotide sequence of Claim 64, in which at least 3 of the following codon pair replacements have been made:
TTCCCC (nucleotides 130 - 135 ) replaced with TTTCCA GCCAAG (nucleotides 247 - 252 ) replaced with GCTAAA TTCCCC (nucleotides 721 - 726 ) replaced with TTTCCA TTCCCC (nucleotides 979 - 984 ) replaced with TTTCCA TTCCCC (nucleotides 1033 - 1038 ) replaced with TTCCCT.
67. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74, wherein at least one of the following codon pairs of SEQ ID NO: 73SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCAAG (nucleotides 247 - 252 ) GCCGGT (nucleotides 412 - 417 ) ATCGGT (nucleotides 421 - 426 ) GATGCC (nucleotides 556 - 561 ) GGAACG (nucleotides 646 - 651 ) CCCGGA (nucleotides 1054 - 1059 ).
68. The nucleotide sequence of Claim 67, in which at least two of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
69. The nucleotide sequence of Claim 67, in which at least the following codon pair replacements have been made:
GCCAAG (nucleotides 247 - 252 ) replaced with GCGAAA GCCGGT (nucleotides 412 - 417 ) replaced with GCTGGT ATCGGT (nucleotides 421 - 426 ) replaced with ATAGGT GATGCC (nucleotides 556 - 561 ) replaced with GATGCT GGAACG (nucleotides 646 - 651 ) replaced with GGCACA CCCGGA (nucleotides 1054 - 1059 ) replaced with CCTGGT.
70. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
71. The nucleotide sequence of Claim 70, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
72. The nucleotide sequence of Claim 70, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
73. A Mn-dependent peroxidase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Otyctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W3110
Escherichia coli UTI89
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
74. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 45-284 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
75. The Mn-dependent peroxidase-encoding nucleotide sequence of Claim 74, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
76. A Mn-dependent peroxidase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 364 of wild-type Mn-dependent peroxidase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 73 and which encode amino acids 1-45 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
77. The Mn-dependent peroxidase-encoding nucleotide sequence of Claim 76, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
78. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGGTTC (nucleotides 1246 - 1251) GCAAGA (nucleotides 1834 - 1839) TTGAAC (nucleotides 1540 - 1545) TCTCCA (nucleotides 193 - 198) GACCGT (nucleotides 694 - 699) TTCCCC (nucleotides 1795 - 1800) GCCAAG (nucleotides 763 - 768) GCCAAG (nucleotides 1585 - 1590).
79. The nucleotide sequence of Claim 78, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
80. The nucleotide sequence of Claim 78, in which at least 3 of the following codon pair replacements have been made:
GGGTTC (nucleotides 1246 - 1251) replaced with GGTTTT GCAAGA (nucleotides 1834 - 1839) replaced with GCTAGA TTGAAC (nucleotides 1540 - 1545) replaced with TTAAAT TCTCCA (nucleotides 193 - 198) replaced with TCACCA GACCGT (nucleotides 694 - 699) replaced with GATAGA TTCCCC (nucleotides 1795 - 1800) replaced with TTTCCA GCCAAG (nucleotides 763 - 768) replaced with GCTAAA GCCAAG (nucleotides 1585 - 1590) replaced with GCTAAA.
81. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGGTG (nucleotides 877 - 882) CTCGAC (nucleotides 1240 - 1245) ATCCTC (nucleotides 1462 - 1467) CTCGGC (nucleotides 652 - 657) CTCGGC (nucleotides 952 - 957) GTCTGG (nucleotides 1252 - 1257) GACAGC (nucleotides 940 - 945) AGCCAG (nucleotides 1495 - 1500) TTCCCG (nucleotides 661 - 666) ATTGCC (nucleotides 16 - 21) ATTGCC (nucleotides 1651 - 1656) CTCGGT (nucleotides 58 - 63) CTCGGT (nucleotides 1465 - 1470) GCCTGG (nucleotides 1654 - 1659) TCGCTG (nucleotides 874 - 879) GTGATG (nucleotides 1312 - 1317) TTCCGC (nucleotides 1609 - 1614).
82. The nucleotide sequence of Claim 81, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
83. The nucleotide sequence of Claim 81 , in which at least 3 of the following codon pair replacements have been made:
CTGGTG (nucleotides 877 - 882) replaced with CTGGTT CTCGAC (nucleotides 1240 - 1245) replaced with CTGGAC ATCCTC (nucleotides 1462 - 1467) replaced with ATCCTG CTCGGC (nucleotides 652 - 657) replaced with CTGGGT CTCGGC (nucleotides 952 - 957) replaced with CTGGGT GTCTGG (nucleotides 1252 - 1257) replaced with GTTTGG GACAGC (nucleotides 940 - 945) replaced with GACTCT AGCCAG (nucleotides 1495 - 1500) replaced with TCTCAG TTCCCG (nucleotides 661 - 666) replaced with TTCCCA ATTGCC (nucleotides 16 - 21) replaced with ATCGCG ATTGCC (nucleotides 1651 - 1656) replaced with ATCGCG CTCGGT (nucleotides 58 - 63) replaced with CTGGGT CTCGGT (nucleotides 1465 - 1470) replaced with CTGGGT GCCTGG (nucleotides 1654 - 1659) replaced with GCGTGG TCGCTG (nucleotides 874 - 879) replaced with AGCCTG GTGATG (nucleotides 1312 - 1317) replaced with GTTATG TTCCGC (nucleotides 1609 - 1614) replaced with TTTCGT.
84. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: AAACTG (nucleotides 403 - 408) TTCAAC (nucleotides 202 - 207) TTCAAC (nucleotides 751 - 756) ATCAAC (nucleotides 208 - 213) ATCAAC (nucleotides 397 - 402) ATCAAC (nucleotides 616 - 621) ATCAAC (nucleotides 841 - 846) ATCAAC (nucleotides 1276 - 1281) ATCAAC (nucleotides 1282 - 1287) GTCAAG (nucleotides 1828 - 1833) GGGTTC (nucleotides 1246 - 1251) TTGAAC (nucleotides 1540 - 1545) TTTGAC (nucleotides 1513 - 1518).
85. The nucleotide sequence of Claim 84, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
86. The nucleotide sequence of Claim 84, in which at least 3 of the following codon pair replacements have been made:
AAACTG (nucleotides 403 - 408) replaced with AAATTA TTCAAC (nucleotides 202 - 207) replaced with TTTAAC TTCAAC (nucleotides 751 - 756) replaced with TTTAAT ATCAAC (nucleotides 208 - 213) replaced with ATTAAT ATCAAC (nucleotides 397 - 402) replaced with ATTAAT ATCAAC (nucleotides 616 - 621) replaced with ATTAAC ATCAAC (nucleotides 841 - 846) replaced with ATTAAT ATCAAC (nucleotides 1276 - 1281) replaced with ATTAAC ATCAAC (nucleotides 1282 - 1287) replaced with ATTAAT GTCAAG (nucleotides 1828 - 1833) replaced with GTTAAA GGGTTC (nucleotides 1246 - 1251) replaced with GGATTT TTGAAC (nucleotides 1540 - 1545) replaced with TTAAAT TTTGAC (nucleotides 1513 - 1518) replaced with TTTGAT.
87. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GACCGT (nucleotides 694 - 699 )
GCCAAG (nucleotides 763 - 768 )
AAGAAG (nucleotides 820 - 825 )
TTCCAA (nucleotides 865 - 870 )
GGTACC (nucleotides 1048 - 1053 )
GGGTTC (nucleotides 1246 - 1251 )
GTGTTT (nucleotides 1510 - 1515 )
TTGAAC (nucleotides 1540 - 1545 )
GCCAAG (nucleotides 1585 - 1590 )
AAGAAG (nucleotides 1735 - 1740 )
TTCCCC (nucleotides 1795 - 1800 ).
88. The nucleotide sequence of Claim 87, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
89. The nucleotide sequence of Claim 87, in which at least 3 of the following codon pair replacements have been made:
GACCGT (nucleotides 694 - 699 ) replaced with GACAGA GCCAAG (nucleotides 763 - 768 ) replaced with GCTAAA AAGAAG (nucleotides 820 - 825 ) replaced with AAAAAG TTCCAA (nucleotides 865 - 870 ) replaced with TTTCAG GGTACC (nucleotides 1048 - 1053 ) replaced with GGAACT GGGTTC (nucleotides 1246 - 1251 ) replaced with GGTTTT GTGTTT (nucleotides 1510 - 1515 ) replaced with GTTTTC TTGAAC (nucleotides 1540 - 1545 ) replaced with TTAAAT GCCAAG (nucleotides 1585 - 1590 ) replaced with GCTAAA AAGAAG (nucleotides 1735 - 1740 ) replaced with AAAAAG TTCCCC (nucleotides 1795 - 1800 ) replaced with TTTCCA.
90. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GCCAAG (nucleotides 763 - 768 ) GACAGC (nucleotides 940 - 945 ) AACAGC (nucleotides 1 198 - 1203 ) GCCTTT (nucleotides 1414 - 1419 ) GCCAAG (nucleotides 1585 - 1590 ) GCCTTT (nucleotides 1741 - 1746 ).
91. The nucleotide sequence of Claim 90, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
92. The nucleotide sequence of Claim 90, in which at least 3 of the following codon pair replacements have been made:
GCCAAG (nucleotides 763 - 768 ) replaced with GCCAAA GACAGC (nucleotides 940 - 945 ) replaced with GATTCT AACAGC (nucleotides 1 198 - 1203 ) replaced with AACTCT GCCTTT (nucleotides 1414 - 1419 ) replaced with GCTTTC GCCAAG (nucleotides 1585 - 1590 ) replaced with GCGAAA GCCTTT (nucleotides 1741 - 1746 ) replaced with GCCTTC.
93. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
94. The nucleotide sequence of Claim 93, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
95. The nucleotide sequence of Claim 93, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
96. A laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fasciculaήs (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia co/zO 157 :H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
97. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 90-212 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
98. The laccase-encoding nucleotide sequence of Claim 97, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
99. The laccase-encoding nucleotide sequence of any of Claims 97-98, wherein no replacement codon encoding amino acids 90-212 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GTCAAC when expressed in the native organism.
100. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 216-367 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
101. The laccase-encoding nucleotide sequence of Claim 100, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
102. The laccase-encoding nucleotide sequence of any of Claims 100-101, wherein no replacement codon encoding amino acids 216-367 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GCCGAC when expressed in the native organism.
103. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 426-570 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
104. The laccase-encoding nucleotide sequence of Claim 103, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
105. The laccase-encoding nucleotide sequence of any of Claims 103-104, wherein no replacement codon encoding amino acids 426-570 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair TTCCGC when expressed in the native organism.
106. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 1-90 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
107. The laccase-encoding nucleotide sequence of Claim 106, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
108. The laccase-encoding nucleotide sequence of any of Claims 106-107, wherein at least one replacement codon encoding amino acids 1-90 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GGTGGT when expressed in the native organism.
109. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 212-216 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
1 10. The laccase-encoding nucleotide sequence of Claim 109, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
11 1. The laccase-encoding nucleotide sequence of any of Claims 109-110, wherein at least one replacement codon encoding amino acids 212-216 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GCCAAC when expressed in the native organism.
112. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-619 of wild-type laccase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 97 and which encode amino acids 367-426 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
1 13. The laccase-encoding nucleotide sequence of Claim 1 12, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
1 14. The laccase-encoding nucleotide sequence of any of Claims 1 12-1 13, wherein at least one replacement codon encoding amino acids 367-426 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair CTCGAC when expressed in the native organism.
1 15. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAA (nucleotides 235 - 240) CTTTCT (nucleotides 670 - 675) TTTGCC (nucleotides 778 - 783) TTCCCC (nucleotides 1240 - 1245) ATCAAG (nucleotides 625 - 630) GCCAAG (nucleotides 529 - 534).
116. The nucleotide sequence of Claim 115, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
1 17. The nucleotide sequence of Claim 115, in which at least 3 of the following codon pair replacements have been made:
TTGAAA (nucleotides 235 - 240) replaced with TTAAAA CTTTCT (nucleotides 670 - 675) replaced with TTGTCT TTTGCC (nucleotides 778 - 783) replaced with TTTGCT TTCCCC (nucleotides 1240 - 1245) replaced with TTTCCA ATCAAG (nucleotides 625 - 630) replaced with ATTAAA GCCAAG (nucleotides 529 - 534) replaced with GCTAAA.
118. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCCTC (nucleotides 1405 - 1410) CTCGAC (nucleotides 1432 - 1437) CTTTCT (nucleotides 670 - 675) TTTGCC (nucleotides 778 - 783) ATCCTC (nucleotides 1126 - 1131) ACGCTG (nucleotides 502 - 507) TTCCAG (nucleotides 10 - 15) TTCCAG (nucleotides 193 - 198) TTCCAG (nucleotides 268 - 273) GTGGTG (nucleotides 139 - 144) GTCAGC (nucleotides 106 - 111) GTCAGC (nucleotides 1339 - 1344) AGCCAG (nucleotides 814 - 819) GCCGGG (nucleotides 1291 - 1296) CAGGCG (nucleotides 1141 - 1 146) CAGGCG (nucleotides 1501 - 1506) GGCGCA (nucleotides 910 - 915) TTCCGC (nucleotides 655 - 660) TTCCGC (nucleotides 1327 - 1332) TTCTGG (nucleotides 379 - 384) CTCTCC (nucleotides 397 - 402).
1 19. The nucleotide sequence of Claim 118, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
120. The nucleotide sequence of Claim 118, in which at least 3 of the following codon pair replacements have been made:
TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG CTCGAC (nucleotides 1432 - 1437) replaced with CTGGAT CTTTCT (nucleotides 670 - 675) replaced with CTGTCT TTTGCC (nucleotides 778 - 783) replaced with TTCGCT ATCCTC (nucleotides 1 126 - 1 131) replaced with ATTCTG ACGCTG (nucleotides 502 - 507) replaced with ACCCTC TTCCAG (nucleotides 10 - 15) replaced with TTTCAG TTCCAG (nucleotides 193 - 198) replaced with TTCCAA TTCCAG (nucleotides 268 - 273) replaced with TTCCAA GTGGTG (nucleotides 139 - 144) replaced with GTTGTT GTCAGC (nucleotides 106 - 111) replaced with GTTAGC GTCAGC (nucleotides 1339 - 1344) replaced with GTGTCT AGCCAG (nucleotides 814 - 819) replaced with TCTCAG GCCGGG (nucleotides 1291 - 1296) replaced with GCTGGT CAGGCG (nucleotides 1 141 - 1146) replaced with CAAGCT CAGGCG (nucleotides 1501 - 1506) replaced with CAGGCT GGCGCA (nucleotides 910 - 915) replaced with GGTGCT TTCCGC (nucleotides 655 - 660) replaced with TTTCGT TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT TTCTGG (nucleotides 379 - 384) replaced with TTTTGG CTCTCC (nucleotides 397 - 402) replaced with CTGTCT.
121. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: ATCAAG (nucleotides 625 - 630) TTTGCC (nucleotides 778 - 783) TTGAAA (nucleotides 235 - 240) TTCAAC (nucleotides 1051 - 1056) TTCAAC (nucleotides 1057 - 1062) ATCAAC (nucleotides 739 - 744) ATCAAC (nucleotides 1078 - 1083) GGTATC (nucleotides 148 - 153).
122. The nucleotide sequence of Claim 121, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
123. The nucleotide sequence of Claim 121, in which at least 3 of the following codon pair replacements have been made:
ATCAAG (nucleotides 625 - 630) replaced with ATTAAA TTTGCC (nucleotides 778 - 783) replaced with TTTGCA TTGAAA (nucleotides 235 - 240) replaced with TTAAAA TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT TTCAAC (nucleotides 1057 - 1062) replaced with TTTAAC ATCAAC (nucleotides 739 - 744) replaced with ATTAAT ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT GGTATC (nucleotides 148 - 153) replaced with GGAATT.
124. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATC (nucleotides 148 - 153 ) TTGAAA (nucleotides 235 - 240 ) GCCAAG (nucleotides 529 - 534 ) TTCCCA (nucleotides 547 - 552 ) CTTTCT (nucleotides 670 - 675 ) TTTGCC (nucleotides 778 - 783 ) TTTGCT (nucleotides 871 - 876 ) TTTGTC (nucleotides 1093 - 1098 ) TTCCCC (nucleotides 1240 - 1245 ) TTTGCT (nucleotides 1444 - 1449 ).
125. The nucleotide sequence of Claim 124, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
126. The nucleotide sequence of Claim 124, in which at least 3 of the following codon pair replacements have been made:
GGTATC (nucleotides 148 - 153 ) replaced with GGAATT TTGAAA (nucleotides 235 - 240 ) replaced with TTAAAA GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAA TTCCCA (nucleotides 547 - 552 ) replaced with TTCCCG CTTTCT (nucleotides 670 - 675 ) replaced with CTTAGT TTTGCC (nucleotides 778 - 783 ) replaced with TTCGCT TTTGCT (nucleotides 871 - 876 ) replaced with TTCGCT TTTGTC (nucleotides 1093 - 1098 ) replaced with TTCGTT TTCCCC (nucleotides 1240 - 1245 ) replaced with TTTCCA TTTGCT (nucleotides 1444 - 1449 ) replaced with TTCGCA.
127. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATC (nucleotides 148 - 153 ) GCAGGG (nucleotides 370 - 375 ) GCCAAG (nucleotides 529 - 534 ) ATCAAT (nucleotides 574 - 579 ) GCACCG (nucleotides 604 - 609 ) TTGGCA (nucleotides 616 - 621 ) ATCAAT (nucleotides 883 - 888 ) GTGCCT (nucleotides 1000 - 1005 ) GCGGCT (nucleotides 1 144 - 1 149 ) GCCAAT (nucleotides 1225 - 1230 ).
128. The nucleotide sequence of Claim 127, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
129. The nucleotide sequence of Claim 127, in which at least 3 of the following codon pair replacements have been made:
GGTATC (nucleotides 148 - 153 ) replaced with GGCATT GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGA GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAA ATCAAT (nucleotides 574 - 579 ) replaced with ATTAAT GCACCG (nucleotides 604 - 609 ) replaced with GCCCCA TTGGCA (nucleotides 616 - 621 ) replaced with TTGGCT ATCAAT (nucleotides 883 - 888 ) replaced with ATAAAT GTGCCT (nucleotides 1000 - 1005 ) replaced with GTACCA GCGGCT (nucleotides 1 144 - 1 149 ) replaced with GCTGCC GCCAAT (nucleotides 1225 - 1230 ) replaced with GCCAAC.
130. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
131. The nucleotide sequence of Claim 130, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
132. The nucleotide sequence of Claim 130, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
133. A laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long- tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W3110
Escherichia coli UTI89
Escherichia coliO\51:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis Schizosaccharomyces pombe.
134. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 29-153 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
135. The laccase-encoding nucleotide sequence of Claim 134, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
136. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 162-306 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
137. The laccase-encoding nucleotide sequence of Claim 136, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
138. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 12 land which encode amino acids 364-493 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
139. The laccase-encoding nucleotide sequence of Claim 138, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
140. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 121and which encode amino acids 1-30 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
141. The laccase-encoding nucleotide sequence of Claim 140, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
142. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 153-162 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
143. The laccase-encoding nucleotide sequence of Claim 142, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
144. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -518 of wild-type laccase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 121 and which encode amino acids 306-364 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
145. The laccase-encoding nucleotide sequence of Claim 144, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
146. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 397 - 402) TTGAAG (nucleotides 235 - 240) GGGTTC (nucleotides 868 - 873) ATCAAA (nucleotides 625 - 630) ACTTTG (nucleotides 502 - 507) GACCGT (nucleotides 187 - 192) GGCCAA (nucleotides 148 - 153) AGCGAT (nucleotides 1546 - 1551).
147. The nucleotide sequence of Claim 146, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
148. The nucleotide sequence of Claim 146, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 397 - 402) replaced with CTGTCT TTGAAG (nucleotides 235 - 240) replaced with CTGAAA GGGTTC (nucleotides 868 - 873) replaced with GGTTTC ATCAAA (nucleotides 625 - 630) replaced with ATCAAA ACTTTG (nucleotides 502 - 507) replaced with ACCCTG GACCGT (nucleotides 187 - 192) replaced with GACCGT GGCCAA (nucleotides 148 - 153) replaced with GGTCAA AGCGAT (nucleotides 1546 - 1551) replaced with TCTGAC.
149. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCAGC (nucleotides 81 1 - 816) CTTTCC (nucleotides 397 - 402) TTCCTC (nucleotides 1405 - 1410) ATCCTC (nucleotides 895 - 900) TTCCAG (nucleotides 10 - 15) TTCCAG (nucleotides 193 - 198) TTCCAG (nucleotides 268 - 273) TTCCAG (nucleotides 1378 - 1383) CTCTCT (nucleotides 670 - 675) GTCAGC (nucleotides 106 - 111) GTCAGC (nucleotides 1339 - 1344) AGCCAG (nucleotides 814 - 819) TTCCCG (nucleotides 547 - 552) ATTGCC (nucleotides 169 - 174) GATCTC (nucleotides 1549 - 1554) CTCGGT (nucleotides 583 - 588) TTCCGC (nucleotides 655 - 660) TTCCGC (nucleotides 1327 - 1332) TTCTGG (nucleotides 379 - 384) CTCTCC (nucleotides 22 - 27).
150. The nucleotide sequence of Claim 149, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
151. The nucleotide sequence of Claim 149, in which at least 3 of the following codon pair replacements have been made:
GCCAGC (nucleotides 81 1 - 816) replaced with GCTTCT CTTTCC (nucleotides 397 - 402) replaced with CTGTCT TTCCTC (nucleotides 1405 - 1410) replaced with TTCCTG ATCCTC (nucleotides 895 - 900) replaced with ATTCTG TTCCAG (nucleotides 10 - 15) replaced with TTCCAA TTCCAG (nucleotides 193 - 198) replaced with TTTCAG TTCCAG (nucleotides 268 - 273) replaced with TTTCAG TTCCAG (nucleotides 1378 - 1383) replaced with TTCCAA CTCTCT (nucleotides 670 - 675) replaced with CTGTCT GTCAGC (nucleotides 106 - 11 1) replaced with GTTAGC GTCAGC (nucleotides 1339 - 1344) replaced with GTTTCG AGCCAG (nucleotides 814 - 819) replaced with TCTCAG TTCCCG (nucleotides 547 - 552) replaced with TTTCCG ATTGCC (nucleotides 169 - 174) replaced with ATCGCG GATCTC (nucleotides 1549 - 1554) replaced with GACCTG CTCGGT (nucleotides 583 - 588) replaced with CTGGGT TTCCGC (nucleotides 655 - 660) replaced with TTTCGT TTCCGC (nucleotides 1327 - 1332) replaced with TTTCGT TTCTGG (nucleotides 379 - 384) replaced with TTTTGG CTCTCC (nucleotides 22 - 27) replaced with CTGTCT.
152. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AAACTG (nucleotides 532 - 537) TTCAAC (nucleotides 1051 - 1056) ATCAAC (nucleotides 307 - 312) ATCAAC (nucleotides 1078 - 1083) ATCAAA (nucleotides 625 - 630) GGCCGT (nucleotides 1006 - 101 1) GGGTTC (nucleotides 868 - 873) GGCCAA (nucleotides 148 - 153) CTTTCC (nucleotides 397 - 402).
153. The nucleotide sequence of Claim 152, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
154. The nucleotide sequence of Claim 152, in which at least 3 of the following codon pair replacements have been made:
AAACTG (nucleotides 532 - 537) replaced with AAATTG TTCAAC (nucleotides 1051 - 1056) replaced with TTTAAT ATCAAC (nucleotides 307 - 312) replaced with ATTAAT ATCAAC (nucleotides 1078 - 1083) replaced with ATTAAT ATCAAA (nucleotides 625 - 630) replaced with ATTAAA GGCCGT (nucleotides 1006 - 1011) replaced with GGTAGA GGGTTC (nucleotides 868 - 873) replaced with GGATTC GGCCAA (nucleotides 148 - 153) replaced with GGACAA CTTTCC (nucleotides 397 - 402) replaced with TTGTCT.
155. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGCCAA (nucleotides 148 - 153 ) GACCGT (nucleotides 187 - 192 ) TTGAAG (nucleotides 235 - 240 ) CTTTCC (nucleotides 397 - 402 ) ATCAAA (nucleotides 625 - 630 ) GGGTTC (nucleotides 868 - 873 ) GGCCGT (nucleotides 1006 - 1011 ) TTTGCT (nucleotides 1444 - 1449 ).
156. The nucleotide sequence of Claim 155, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
157. The nucleotide sequence of Claim 155, in which at least 3 of the following codon pair replacements have been made: GGCCAA (nucleotides 148 - 153 ) replaced with GGTCAA GACCGT (nucleotides 187 - 192 ) replaced with GATAGA TTGAAG (nucleotides 235 - 240 ) replaced with TTAAAA CTTTCC (nucleotides 397 - 402 ) replaced with TTGTCT ATCAAA (nucleotides 625 - 630 ) replaced with ATTAAA GGGTTC (nucleotides 868 - 873 ) replaced with GGTTTC GGCCGT (nucleotides 1006 - 1011 ) replaced with GGTAGA TTTGCT (nucleotides 1444 - 1449 ) replaced with ITTGCG.
158. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AGCCGT (nucleotides 124 - 129 ) GCCGGT (nucleotides 172 - 177 ) GGCCCC (nucleotides 295 - 300 ) TCCGGT (nucleotides 328 - 333 ) GCAGGG (nucleotides 370 - 375 ) CACAGC (nucleotides 388 - 393 ) CTCTAT (nucleotides 469 - 474 ) ACTTTG (nucleotides 502 - 507 ) ATCAAT (nucleotides 574 - 579 ) GCGGCT (nucleotides 607 - 612 ) GATGCC (nucleotides 808 - 813 ) GCCAAT (nucleotides 844 - 849 ) GCCGGT (nucleotides 874 - 879 ) GTGCCT (nucleotides 1000 - 1005 ) GCCAAT (nucleotides 1225 - 1230 ) GATGCC (nucleotides 1435 - 1440 ).
159. The nucleotide sequence of Claim 158, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
160. The nucleotide sequence of Claim 158, in which at least 3 of the following codon pair replacements have been made: AGCCGT (nucleotides 124 - 129 ) replaced with TCTCGT GCCGGT (nucleotides 172 - 177 ) replaced with GCTGGT GGCCCC (nucleotides 295 - 300 ) replaced with GGACCT TCCGGT (nucleotides 328 - 333 ) replaced with TCTGGT GCAGGG (nucleotides 370 - 375 ) replaced with GCTGGT CACAGC (nucleotides 388 - 393 ) replaced with CATTCT CTCTAT (nucleotides 469 - 474 ) replaced with TTGTAT ACTTTG (nucleotides 502 - 507 ) replaced with ACCTTG ATCAAT (nucleotides 574 - 579 ) replaced with ATTAAT GCGGCT (nucleotides 607 - 612 ) replaced with GCTGCT GATGCC (nucleotides 808 - 813 ) replaced with GACGCC GCCAAT (nucleotides 844 - 849 ) replaced with GCTAAT GCCGGT (nucleotides 874 - 879 ) replaced with GCTGGT GTGCCT (nucleotides 1000 - 1005 ) replaced with GTTCCT GCCAAT (nucleotides 1225 - 1230 ) replaced with GCTAAC GATGCC (nucleotides 1435 - 1440 ) replaced with GATGCT.
161. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
162. The nucleotide sequence of Claim 161, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
163. The nucleotide sequence of Claim 161, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
164. A laccase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tons
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia coliOlST.M EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
165. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 29-153 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
166. The laccase-encoding nucleotide sequence of Claim 165, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
167. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 162-306 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
168. The laccase-encoding nucleotide sequence of Claim 167, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
169. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 364-493 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
170. The laccase-encoding nucleotide sequence of Claim 169, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
171. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 1-29 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
172. The laccase-encoding nucleotide sequence of Claim 171 , wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
173. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 153-162 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
174. The laccase-encoding nucleotide sequence of Claim 173, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
175. A laccase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-518 of wild-type laccase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 145 and which encode amino acids 306-364 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
176. The laccase-encoding nucleotide sequence of Claim 175, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
177. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 421 - 426 ) GCCAAG (nucleotides 496 - 501 ) GATATC (nucleotides 643 - 648 ) AAGAAA (nucleotides 859 - 864 ) GCCAAG (nucleotides 1243 - 1248 ) ATCAAG (nucleotides 1264 - 1269 ) GGTATT (nucleotides 141 1 - 1416 ).
178. The nucleotide sequence of Claim 177, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
179. The nucleotide sequence of Claim 177, in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA GATATC (nucleotides 643 - 648 ) replaced with GACATT AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAG ATCAAG (nucleotides 1264 - 1269 ) replaced with ATTAAA GGTATT (nucleotides 1411 - 1416 ) replaced with GGAATA.
180. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTCTCC (nucleotides 274 - 279 ) GACAGC (nucleotides 520 - 525 ) AGCCAG (nucleotides 523 - 528 ) GACTGG (nucleotides 787 - 792 ) TTCCAG (nucleotides 934 - 939 ) GCCAGC (nucleotides 1441 - 1446 ).
181. The nucleotide sequence of Claim 180, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
182. The nucleotide sequence of Claim 180, in which at least 3 of the following codon pair replacements have been made:
CTCTCC (nucleotides 274 - 279 ) replaced with TTATCT GACAGC (nucleotides 520 - 525 ) replaced with GATTCT AGCCAG (nucleotides 523 - 528 ) replaced with TCTCAA GACTGG (nucleotides 787 - 792 ) replaced with GATTGG TTCCAG (nucleotides 934 - 939 ) replaced with TTCCAG GCCAGC (nucleotides 1441 - 1446 ) replaced with GCTTCG.
183. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 421 - 426 ) GATATC (nucleotides 643 - 648 ) AAGAAA (nucleotides 859 - 864 ) ATCAAC (nucleotides 901 - 906 ) TTCAAG (nucleotides 1057 - 1062 ) ATCAAG (nucleotides 1264 - 1269 ) GGTATT (nucleotides 1411 - 1416 ).
184. The nucleotide sequence of Claim 183, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
185. The nucleotide sequence of Claim 183, in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT GATATC (nucleotides 643 - 648 ) replaced with GACATT AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG ATCAAC (nucleotides 901 - 906 ) replaced with ATTAAT TTCAAG (nucleotides 1057 - 1062 ) replaced with TTTAAA ATCAAG (nucleotides 1264 - 1269 ) replaced with ATTAAA GGTATT (nucleotides 1411 - 1416 ) replaced with GGAATT.
186. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTTGTC (nucleotides 286 - 291 ) TTGAAC (nucleotides 421 - 426 ) GCCAAG (nucleotides 496 - 501 ) GATATC (nucleotides 643 - 648 ) AAGAAA (nucleotides 859 - 864 ) AAGAAG (nucleotides 1060 - 1065 ) GCCAAG (nucleotides 1243 - 1248 ).
187. The nucleotide sequence of Claim 186, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
188. The nucleotide sequence of Claim 186, in which at least 3 of the following codon pair replacements have been made:
TTTGTC (nucleotides 286 - 291 ) replaced with TTCGTT TTGAAC (nucleotides 421 - 426 ) replaced with TTAAAT GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAA GATATC (nucleotides 643 - 648 ) replaced with GACATT AAGAAA (nucleotides 859 - 864 ) replaced with AAAAAG AAGAAG (nucleotides 1060 - 1065 ) replaced with AAAAAG GCCAAG (nucleotides 1243 - 1248 ) replaced with GCTAAA.
189. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ACATGG (nucleotides 46 - 51 ) AACAGC (nucleotides 136 - 141 ) AACAGC (nucleotides 268 - 273 ) CTTTAC (nucleotides 325 - 330 ) GCCAAG (nucleotides 496 - 501 ) GACAGC (nucleotides 520 - 525 ) ATCAAT (nucleotides 550 - 555 ) CTCGAT (nucleotides 847 - 852 ) TCCGGT (nucleotides 1204 - 1209 ) GCCAAG (nucleotides 1243 - 1248 ) GGTATT (nucleotides 1411 - 1416 ) GGCCCC (nucleotides 1426 - 1431 ).
190. The nucleotide sequence of Claim 189, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
191. The nucleotide sequence of Claim 189, in which at least 3 of the following codon pair replacements have been made:
ACATGG (nucleotides 46 - 51 ) replaced with ACCTGG AACAGC (nucleotides 136 - 141 ) replaced with AATAGT AACAGC (nucleotides 268 - 273 ) replaced with AACTCC CTTTAC (nucleotides 325 - 330 ) replaced with TTATAT GCCAAG (nucleotides 496 - 501 ) replaced with GCTAAG GACAGC (nucleotides 520 - 525 ) replaced with GATAGC ATCAAT (nucleotides 550 - 555 ) replaced with ATCAAC CTCGAT (nucleotides 847 - 852 ) replaced with TTAGAT TCCGGT (nucleotides 1204 - 1209 ) replaced with AGCGGT GCCAAG (nucleotides 1243 - 1248 ) replaced with GCAAAG GGTATT (nucleotides 141 1 - 1416 ) replaced with GGAATT GGCCCC (nucleotides 1426 - 1431 ) replaced with GGTCCG.
192. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
193. The nucleotide sequence of Claim 192, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
194. The nucleotide sequence of Claim 192, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
195. A cellobiohydrolase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-497 of wild-type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tor is
Orγctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W3110
Escherichia coli UTI89
Escherichia co/7O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
196. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 1-434 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
197. The cellobiohydrolase-encoding nucleotide sequence of Claim 196, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
198. The cellobiohydrolase-encoding nucleotide sequence of any of Claims 196-197, wherein no replacement codon encoding amino acids 1-434 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair CTCAAC when expressed in the native organism.
199. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 465-493 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
200. The cellobiohydrolase-encoding nucleotide sequence of Claim 199, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
201. The cellobiohydrolase-encoding nucleotide sequence of any of Claims 199-200, wherein no replacement codon encoding amino acids 465-493 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair ATTGGC when expressed in the native organism.
202. A cellobiohydrolase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-497 of wild- type cellobiohydrolase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 169 and which encode amino acids 435-464 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
203. The cellobiohydrolase-encoding nucleotide sequence of Claim 202, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
204. The cellobiohydrolase-encoding nucleotide sequence of any of Claims 202-203, wherein at least one replacement codon encoding amino acids 435-464 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair CCTACC when expressed in the native organism.
205. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CAGTTT (nucleotides 445 - 450 )
CAGTAC (nucleotides 571 - 576 )
CAGTAC (nucleotides 685 - 690 )
AAGGGC (nucleotides 793 - 798 )
GAGTTT (nucleotides 808 - 813 ).
206. The nucleotide sequence of Claim 205, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
207. The nucleotide sequence of Claim 205, in which at least 3 of the following codon pair replacements have been made:
CAGTTT (nucleotides 445 - 450 ) replaced with CAATTT CAGTAC (nucleotides 571 - 576 ) replaced with CAATAT CAGTAC (nucleotides 685 - 690 ) replaced with CAATAT AAGGGC (nucleotides 793 - 798 ) replaced with AAGGGA GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT.
208. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO:181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTCGGC (nucleotides 7 - 12 ) AGCCAG (nucleotides 142 - 147 ) CTGGCA (nucleotides 301 - 306 ) GATCTC (nucleotides 307 - 312 ) TTCCAG (nucleotides 415 - 420 ) TTCTGG (nucleotides 424 - 429 ) GCCGGA (nucleotides 556 - 561 ) GTCTGG (nucleotides 886 - 891 ) GCCGGG (nucleotides 913 - 918 ).
209. The nucleotide sequence of Claim 208, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
210. The nucleotide sequence of Claim 208, in which at least 3 of the following codon pair replacements have been made:
CTCGGC (nucleotides 7 - 12 ) replaced with CTGGGT AGCCAG (nucleotides 142 - 147 ) replaced with AGCCAA CTGGCA (nucleotides 301 - 306 ) replaced with CTCGCG GATCTC (nucleotides 307 - 312 ) replaced with GACCTG TTCCAG (nucleotides 415 - 420 ) replaced with TTCCAA TTCTGG (nucleotides 424 - 429 ) replaced with TTTTGG GCCGGA (nucleotides 556 - 561 ) replaced with GCGGGT GTCTGG (nucleotides 886 - 891 ) replaced with GTTTGG GCCGGG (nucleotides 913 - 918 ) replaced with GCAGGT.
211. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGCTCT (nucleotides 10 - 15 ) ACCAAG (nucleotides 82 - 87 ) CTTCCA (nucleotides 151 - 156 ) GGCTCT (nucleotides 280 - 285 ) CAGTTT (nucleotides 445 - 450 ) CACGAT (nucleotides 493 - 498 ) AAGAAG (nucleotides 790 - 795 ) GAGTTT (nucleotides 808 - 813 ) CTTCCT (nucleotides 982 - 987 ).
212. The nucleotide sequence of Claim 21 1, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
213. The nucleotide sequence of Claim 21 1, in which at least 3 of the following codon pair replacements have been made:
GGCTCT (nucleotides 10 - 15 ) replaced with GGATCT ACCAAG (nucleotides 82 - 87 ) replaced with ACTAAA CTTCCA (nucleotides 151 - 156 ) replaced with TTGCCA GGCTCT (nucleotides 280 - 285 ) replaced with GGATCA CAGTTT (nucleotides 445 - 450 ) replaced with CAATTC CACGAT (nucleotides 493 - 498 ) replaced with CATGAT AAGAAG (nucleotides 790 - 795 ) replaced with AAAAAG GAGTTT (nucleotides 808 - 813 ) replaced with GAATTT CTTCCT (nucleotides 982 - 987 ) replaced with TTGCCA.
214. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO:181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCCAA (nucleotides 100 - 105 ) CAGTTT (nucleotides 445 - 450 ) TTTGCT (nucleotides 448 - 453 ) TTTGTC (nucleotides 580 - 585 ) AAGAAG (nucleotides 790 - 795 ) GAGTTT (nucleotides 808 - 813 ).
215. The nucleotide sequence of Claim 214, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
216. The nucleotide sequence of Claim 214, in which at least 3 of the following codon pair replacements have been made:
TTCCAA (nucleotides 100 - 105 ) replaced with TTCCAG CAGTTT (nucleotides 445 - 450 ) replaced with CAATTC TTTGCT (nucleotides 448 - 453 ) replaced with TTCGCT TTTGTC (nucleotides 580 - 585 ) replaced with TTCGTT AAGAAG (nucleotides 790 - 795 ) replaced with AAAAAG GAGTTT (nucleotides 808 - 813 ) replaced with GAGTTC.
217. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182, wherein at least 3 of the following codon pairs of SEQ ID NO: 181 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCCGGT (nucleotides 124 - 129 )
GTCGAT (nucleotides 358 - 363 )
GCCGGA (nucleotides 556 - 561 )
GGGGCA (nucleotides 604 - 609 )
GCATGG (nucleotides 607 - 612 ).
218. The nucleotide sequence of Claim 217, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
219. The nucleotide sequence of Claim 217, in which at least 3 of the following codon pair replacements have been made:
TCCGGT (nucleotides 124 - 129 ) replaced with TCTGGT GTCGAT (nucleotides 358 - 363 ) replaced with GTTGAT GCCGGA (nucleotides 556 - 561 ) replaced with GCTGGT GGGGCA (nucleotides 604 - 609 ) replaced with GGCGCG GCATGG (nucleotides 607 - 612 ) replaced with GCGTGG.
220. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
221. The nucleotide sequence of Claim 220, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
222. The nucleotide sequence of Claim 220, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
223. A endoglucanase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
224. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 32-276 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
225. The endoglucanase-encoding nucleotide sequence of Claim 224, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
226. A endoglucanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild- type endoglucanase as set forth in SEQ ID NO: 182 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 181 and which encode amino acids 1-32 of SEQ ID NO: 182 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
227. The endoglucanase-encoding nucleotide sequence of Claim 226, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
228. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AGTGAC (nucleotides 58 - 63 ) AAGGGC (nucleotides 148 - 153 ) GCAAGA (nucleotides 172 - 177 ) GACCAA (nucleotides 406 - 411 ) AGCGGT (nucleotides 442 - 447 ) TTGAAT (nucleotides 493 - 498 ).
229. The nucleotide sequence of Claim 228, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
230. The nucleotide sequence of Claim 228, in which at least 3 of the following codon pair replacements have been made:
AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT AAGGGC (nucleotides 148 - 153 ) replaced with AAAGGT GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA GACCAA (nucleotides 406 - 411 ) replaced with GATCAA AGCGGT (nucleotides 442 - 447 ) replaced with TCTGGA TTGAAT (nucleotides 493 - 498 ) replaced with TTAAAC.
231. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGCTGG (nucleotides 25 - 30 ) CTGGAA (nucleotides 91 - 96 ) GGCGGT (nucleotides 127 - 132 ) GGCTGG (nucleotides 151 - 156 ) CTCGGC (nucleotides 352 - 357 ) TACTGG (nucleotides 412 - 417 ) CGCCAG (nucleotides 424 - 429 ) ACCAGC (nucleotides 439 - 444 ) GCCTGG (nucleotides 475 - 480 ).
232. The nucleotide sequence of Claim 231, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
233. The nucleotide sequence of Claim 231, in which at least 3 of the following codon pair replacements have been made:
GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG CTGGAA (nucleotides 91 - 96 ) replaced with CTGGAG GGCGGT (nucleotides 127 - 132 ) replaced with GGCGGC GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG CTCGGC (nucleotides 352 - 357 ) replaced with CTGGGT TACTGG (nucleotides 412 - 417 ) replaced with TATTGG CGCCAG (nucleotides 424 - 429 ) replaced with CGTCAG ACCAGC (nucleotides 439 - 444 ) replaced with ACCTCT GCCTGG (nucleotides 475 - 480 ) replaced with GCGTGG.
234. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CACGAT (nucleotides 31 - 36 ) AGTGAC (nucleotides 58 - 63 ) GAGTAT (nucleotides 259 - 264 ) AACTTT (nucleotides 277 - 282 ) GTCAAC (nucleotides 370 - 375 ) GTCAAC (nucleotides 499 - 504 ).
235. The nucleotide sequence of Claim 234, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
236. The nucleotide sequence of Claim 234, in which at least 3 of the following codon pair replacements have been made:
CACGAT (nucleotides 31 - 36 ) replaced with CATGAT AGTGAC (nucleotides 58 - 63 ) replaced with TCTGAT GAGTAT (nucleotides 259 - 264 ) replaced with GAATAT AACTTT (nucleotides 277 - 282 ) replaced with AATTTC GTCAAC (nucleotides 370 - 375 ) replaced with GTTAAT GTCAAC (nucleotides 499 - 504 ) replaced with GTGAAT.
237. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGCTGG (nucleotides 25 - 30 ) GGCTGG (nucleotides 151 - 156 ) GCAAGA (nucleotides 172 - 177 ) GGTGTT (nucleotides 193 - 198 ) AACTTT (nucleotides 277 - 282 ) GACCAA (nucleotides 406 - 41 1 ) GGTACC (nucleotides 445 - 450 ) TTGAAT (nucleotides 493 - 498 ) ACCGTT (nucleotides 568 - 573 ).
238. The nucleotide sequence of Claim 237, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
239. The nucleotide sequence of Claim 237, in which at least 3 of the following codon pair replacements have been made:
GGCTGG (nucleotides 25 - 30 ) replaced with GGTTGG GGCTGG (nucleotides 151 - 156 ) replaced with GGTTGG GCAAGA (nucleotides 172 - 177 ) replaced with GCTAGA GGTGTT (nucleotides 193 - 198 ) replaced with GGTGTT AACTTT (nucleotides 277 - 282 ) replaced with AATTTC GACCAA (nucleotides 406 - 41 1 ) replaced with GATCAA GGTACC (nucleotides 445 - 450 ) replaced with GGTACA TTGAAT (nucleotides 493 - 498 ) replaced with TTAAAT ACCGTT (nucleotides 568 - 573 ) replaced with ACTGTT.
240. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAAGGC (nucleotides 94 - 99 ) GCAAGA (nucleotides 172 - 177 ) AACAGC (nucleotides 214 - 219 ) ACCTAT (nucleotides 286 - 291 ) TCCGGT (nucleotides 301 - 306 ) GCAACG (nucleotides 529 - 534 ) GGCTAT (nucleotides 553 - 558 ).
241. The nucleotide sequence of Claim 240, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
242. The nucleotide sequence of Claim 240, in which at least 3 of the following codon pair replacements have been made:
GAAGGC (nucleotides 94 - 99 ) replaced with GAAGGA GCAAGA (nucleotides 172 - 177 ) replaced with GCTCGT AACAGC (nucleotides 214 - 219 ) replaced with AATTCT ACCTAT (nucleotides 286 - 291 ) replaced with ACGTAT TCCGGT (nucleotides 301 - 306 ) replaced with TCTGGT GCAACG (nucleotides 529 - 534 ) replaced with GCCACC GGCTAT (nucleotides 553 - 558 ) replaced with GGTTAT.
243. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
244. The nucleotide sequence of Claim 243, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
245. The nucleotide sequence of Claim 243, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
246. A xylanase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia coliO\51:Wl EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Kluyveromyces lactis
Zymomonas mobilis
Schizosaccharomyces pombe.
247 '. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 31-221 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
248. The xylanase-encoding nucleotide sequence of Claim 247, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
249. A xylanase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-225 of wild-type xylanase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 193 and which encode amino acids 1-31 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
250. The xylanase-encoding nucleotide sequence of Claim 249, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
251. An isolated polynucleotide comprising the nucleotide sequence of any of Claims 1-250.
252. An isolated polynucleotide comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 11 1, 113, 1 15, 117, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161 , 163, 165, 167, 171, 173, 175, 177, 179, 183, 185, 187, 189, 191, 195, 197, 199, 201 or 203.
253. An isolated polypeptide encoded by the nucleotide sequence of any of Claims 1-250, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194.
254. An expression system, comprising: an expression vector in a host organism, wherein the expression vector includes the polynucleotide of Claim 251 or Claim 252 operably linked to an expression control sequence.
255. An expression system, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides in accordance with Claim 251 or Claim 252, each polynucleotide being operably linked to the same or different expression control sequences.
256. A system for degrading cellulose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes: endo- 1 ,4-β-glucanase, exo-l ,4-β-D-glucanase, and β-D-glucosidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
257. A system for metabolizing lignin, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: laccase,
Mn-dependent peroxidase, and lignin peroxidase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
258. The system of any of Claims 254 or 256, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 , 23, 171, 173, 175, 177, 179, 183, 185, 187, 189 or 191.
259. The system of any of Claims 254-256, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 171 , 173, 175, 177, 179, 183, 185, 187, 189 or 191.
260. The system of any of Claims 254 or 257, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51 , 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 1 1 1, 113, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141, 143, 147, 149, 151 , 153, 155, 157, 159, 161, 163, 165 or 167.
261. The system of any of Claims 254, 255 or 257, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 1 11, 113, 115, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167.
262. The system of any one of Claims 254-261 , wherein said one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
263. The system of any of Claims 254-262, wherein each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme.
264. The system of any of Claims 254-263, wherein each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 182 or 194) under normal physiological conditions.
265. A cell comprising the polynucleotide of Claim 251 or Claim 252.
266. The cell of Claim 265, wherein said cell expresses the polypeptide encoded by said polynucleotide.
267. A method of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with the polynucleotide of Claim 251 or Claim 252 under conditions that permit the polynucleotide to be introduced into the host cell.
268. A method of expressing a polypeptide comprising: providing a cell comprising the polynucleotide of Claim 251 or Claim 252; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
269. A method of hydrolyzing a carbohydrate comprising: providing a carbohydrate comprising at least one glycosidic bond; providing a polypeptide encoded by the polynucleotide of Claim 251 or Claim 252; and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one covalent bond of said carbohydrate, whereby at least one covalentbond of said carbohydrate is hydrolyzed.
Applications Claiming Priority (12)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US94732907P | 2007-06-29 | 2007-06-29 | |
US94727707P | 2007-06-29 | 2007-06-29 | |
US94708607P | 2007-06-29 | 2007-06-29 | |
US60/947,329 | 2007-06-29 | ||
US60/947,277 | 2007-06-29 | ||
US60/947,086 | 2007-06-29 | ||
US94761707P | 2007-07-02 | 2007-07-02 | |
US94749607P | 2007-07-02 | 2007-07-02 | |
US60/947,617 | 2007-07-02 | ||
US60/947,496 | 2007-07-02 | ||
US94778407P | 2007-07-03 | 2007-07-03 | |
US60/947,784 | 2007-07-03 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2009005564A2 true WO2009005564A2 (en) | 2009-01-08 |
WO2009005564A3 WO2009005564A3 (en) | 2009-03-05 |
Family
ID=39734194
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2008/006379 WO2009005564A2 (en) | 2007-06-29 | 2008-05-14 | Cellulose- and hemicellulose-degradation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2009005564A2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120149065A1 (en) * | 2010-11-15 | 2012-06-14 | Edeniq, Inc. | Use of manganese peroxidase for enzymatic hydrolysis of lignocellulosic material |
WO2014028773A3 (en) * | 2012-08-16 | 2014-04-17 | Bangladesh Jute Research Institute | Lignin degrading enzymes from macrophomina phaseolina and uses thereof |
CN104357414A (en) * | 2014-11-28 | 2015-02-18 | 上海市农业科学院 | Laccase gene derived from laccaria bicolor, and applications thereof |
CN104388441A (en) * | 2014-10-30 | 2015-03-04 | 上海市农业科学院 | Laccase gene from melon anthracnose pathogens and application of laccase gene |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5082767A (en) * | 1989-02-27 | 1992-01-21 | Hatfield G Wesley | Codon pair utilization |
WO2003070957A2 (en) * | 2002-02-20 | 2003-08-28 | Novozymes A/S | Plant polypeptide production |
WO2007130606A2 (en) * | 2006-05-04 | 2007-11-15 | The Regents Of The University Of California | Analyzing translational kinetics using graphical displays of translational kinetics values of codon pairs |
WO2008000632A1 (en) * | 2006-06-29 | 2008-01-03 | Dsm Ip Assets B.V. | A method for achieving improved polypeptide expression |
-
2008
- 2008-05-14 WO PCT/US2008/006379 patent/WO2009005564A2/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5082767A (en) * | 1989-02-27 | 1992-01-21 | Hatfield G Wesley | Codon pair utilization |
WO2003070957A2 (en) * | 2002-02-20 | 2003-08-28 | Novozymes A/S | Plant polypeptide production |
WO2007130606A2 (en) * | 2006-05-04 | 2007-11-15 | The Regents Of The University Of California | Analyzing translational kinetics using graphical displays of translational kinetics values of codon pairs |
WO2007130650A2 (en) * | 2006-05-04 | 2007-11-15 | The Regents Of The University Of California | Methods for calculating codon pair-based translational kinetics values, and methods for generating polypeptide-encoding nucleotide sequences from such values |
WO2008000632A1 (en) * | 2006-06-29 | 2008-01-03 | Dsm Ip Assets B.V. | A method for achieving improved polypeptide expression |
Non-Patent Citations (8)
Title |
---|
DATABASE EMBL [Online] 25 November 2003 (2003-11-25), "Pycnoporus sanguineus laccase mRNA, complete cds." XP002496141 retrieved from EBI accession no. EMBL:AY458017 Database accession no. AY458017 * |
GONZALEZ ET AL: "Identification of a new laccase gene and confirmation of genomic predictions by cDNA sequences of Trametes sp. I-62 laccase family" MYCOLOGICAL RESEARCH, ELSEVIER, GB, vol. 107, no. 6, 1 June 2003 (2003-06-01), pages 727-735, XP022443249 ISSN: 0953-7562 * |
GUSTAFSSON C ET AL: "Codon bias and heterologous protein expression" TRENDS IN BIOTECHNOLOGY, ELSEVIER PUBLICATIONS, CAMBRIDGE, GB, vol. 22, no. 7, 1 July 2004 (2004-07-01), pages 346-353, XP004520507 ISSN: 0167-7799 * |
HATFIELD G WESLEY ET AL: "Optimizing scaleup yield for protein production: Computationally Optimized DNA Assembly (CODA) and Translation Engineering(TM)" BIOTECHNOLOGY ANNUAL REVIEW, XX, XX, vol. 13, 1 January 2007 (2007-01-01), pages 27-42, XP009092735 * |
HATFIELD G.W. ET AL.: "Codon pair utilization bias in bacteria, yeast and mammals" [Online] 1993, CRC PRESS , BOCA RATON, LA , XP002495824 Retrieved from the Internet: URL:http://www.codagenomics.com/technology/pubs/Synthesis_Book%20Chapter%207.pdf> [retrieved on 2008-09-15] cited in the application the whole document * |
IRWIN B ET AL: "codon pair utilization biases influence translational elongation step times" JOURNAL OF BIOLOGICAL CHEMISTRY, AMERICAN SOCIETY OF BIOLOCHEMICAL BIOLOGISTS, BIRMINGHAM,; US, vol. 270, no. 39, 29 September 1995 (1995-09-29), pages 22801-22806, XP002406003 ISSN: 0021-9258 * |
KITTLE J.D., JR: "Radical Changes in the Engineering of Synthetic Genes for Protein Expression" BIOPHARM INTERNATIONAL, [Online] February 2006 (2006-02), XP002495822 Retrieved from the Internet: URL:http://www.codagenomics.com/technology/pubs/Biopharm_pdf_BP3-61-06e.pdf> [retrieved on 2008-09-15] * |
TRINH RYAN ET AL: "Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression." MOLECULAR IMMUNOLOGY, vol. 40, no. 10, January 2004 (2004-01), pages 717-722, XP002495823 ISSN: 0161-5890 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120149065A1 (en) * | 2010-11-15 | 2012-06-14 | Edeniq, Inc. | Use of manganese peroxidase for enzymatic hydrolysis of lignocellulosic material |
US8686123B2 (en) | 2010-11-15 | 2014-04-01 | Edeniq, Inc. | Use of manganese peroxidase for enzymatic hydrolysis of lignocellulosic material |
WO2014028773A3 (en) * | 2012-08-16 | 2014-04-17 | Bangladesh Jute Research Institute | Lignin degrading enzymes from macrophomina phaseolina and uses thereof |
CN104388441A (en) * | 2014-10-30 | 2015-03-04 | 上海市农业科学院 | Laccase gene from melon anthracnose pathogens and application of laccase gene |
CN104357414A (en) * | 2014-11-28 | 2015-02-18 | 上海市农业科学院 | Laccase gene derived from laccaria bicolor, and applications thereof |
Also Published As
Publication number | Publication date |
---|---|
WO2009005564A3 (en) | 2009-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10179904B2 (en) | Polypeptides having cellobiohydrolase activity and polynucleotides encoding same | |
US20200157517A1 (en) | Methods for enhancing the degradation or conversion of cellulosic material | |
AU2019206043B2 (en) | Enzyme compositions and uses thereof | |
US9677102B2 (en) | Methods and compositions for degrading cellulosic material | |
US20220279818A1 (en) | Enzyme blends and processes for producing a high protein feed ingredient from a whole stillage byproduct | |
US20240110169A1 (en) | Gh61 variants and polynucleotides encoding same | |
DK2553093T3 (en) | Cellobiohydrolase variants and polynucleotides encoding them | |
CA2905033C (en) | Expression of beta-glucosidases for hydrolysis of lignocellulose and associated oligomers | |
WO2010096562A2 (en) | Yeast cells expressing an exogenous cellulosome and methods of using the same | |
WO2010005553A1 (en) | Isolation and characterization of schizochytrium aggregatum cellobiohydrolase i (cbh 1) | |
CN111094562A (en) | Polypeptides having trehalase activity and their use in methods of producing fermentation products | |
CA2778998A1 (en) | Heterologous expression of fungal cellobiohydrolase 2 genes in yeast | |
WO2009005564A2 (en) | Cellulose- and hemicellulose-degradation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same | |
WO2008137958A1 (en) | Cellobiohydrolase-encoding nucleotide sequences with refined translational kinetics and methods of making same | |
WO2008144012A2 (en) | Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same | |
US11053489B2 (en) | Cellobiohydrolase variants and polynucleotides encoding same | |
US11203746B2 (en) | Polypeptides having cellobiohydrolase activity and polynucleotides encoding same | |
WO2024184727A1 (en) | Alpha-amylase variants | |
WO2023170628A1 (en) | Bacterial and archaeal alpha-amylases | |
US20180216089A1 (en) | Polypeptides Having Beta-Xylosidase Activity And Polynucleotides Encoding Same | |
WO2008153676A2 (en) | Pentose phosphate pathway and fermentation enzyme-encoding nucleotide sequences with refined translational kinetics and methods of making same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08767804 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08767804 Country of ref document: EP Kind code of ref document: A2 |