WO2001061344A1 - Conception evolutive a ciblage computationnel - Google Patents
Conception evolutive a ciblage computationnel Download PDFInfo
- Publication number
- WO2001061344A1 WO2001061344A1 PCT/US2001/005043 US0105043W WO0161344A1 WO 2001061344 A1 WO2001061344 A1 WO 2001061344A1 US 0105043 W US0105043 W US 0105043W WO 0161344 A1 WO0161344 A1 WO 0161344A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- polymer
- residues
- residue
- sequences
- Prior art date
Links
- 238000013461 design Methods 0.000 title description 22
- 229920000642 polymer Polymers 0.000 claims abstract description 223
- 238000000034 method Methods 0.000 claims abstract description 196
- 230000035772 mutation Effects 0.000 claims abstract description 125
- 230000003993 interaction Effects 0.000 claims description 69
- 150000001413 amino acids Chemical class 0.000 claims description 68
- 125000000539 amino acid group Chemical group 0.000 claims description 61
- 230000008878 coupling Effects 0.000 claims description 44
- 238000010168 coupling process Methods 0.000 claims description 44
- 238000005859 coupling reaction Methods 0.000 claims description 44
- 238000004422 calculation algorithm Methods 0.000 claims description 36
- 239000002904 solvent Substances 0.000 claims description 31
- 125000003729 nucleotide group Chemical group 0.000 claims description 28
- 230000027455 binding Effects 0.000 claims description 23
- 230000008030 elimination Effects 0.000 claims description 18
- 238000003379 elimination reaction Methods 0.000 claims description 18
- 230000003197 catalytic effect Effects 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 12
- 239000003446 ligand Substances 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 7
- 239000000758 substrate Substances 0.000 claims description 4
- 102000004169 proteins and genes Human genes 0.000 abstract description 190
- 150000007523 nucleic acids Chemical class 0.000 abstract description 83
- 102000039446 nucleic acids Human genes 0.000 abstract description 74
- 108020004707 nucleic acids Proteins 0.000 abstract description 74
- 238000004458 analytical method Methods 0.000 abstract description 20
- 230000002411 adverse Effects 0.000 abstract description 3
- 108091005461 Nucleic proteins Proteins 0.000 abstract description 2
- 108090000623 proteins and genes Proteins 0.000 description 231
- 235000018102 proteins Nutrition 0.000 description 185
- 235000001014 amino acid Nutrition 0.000 description 80
- 229940024606 amino acid Drugs 0.000 description 67
- 210000004027 cell Anatomy 0.000 description 57
- 108020004414 DNA Proteins 0.000 description 42
- 108090000765 processed proteins & peptides Proteins 0.000 description 41
- 102000004196 processed proteins & peptides Human genes 0.000 description 41
- 229920001184 polypeptide Polymers 0.000 description 40
- 229920001222 biopolymer Polymers 0.000 description 34
- 238000006467 substitution reaction Methods 0.000 description 32
- 238000002474 experimental method Methods 0.000 description 31
- 238000004364 calculation method Methods 0.000 description 28
- 230000000694 effects Effects 0.000 description 28
- 125000003275 alpha amino acid group Chemical group 0.000 description 27
- 239000013598 vector Substances 0.000 description 27
- 108010056079 Subtilisins Proteins 0.000 description 26
- 102000005158 Subtilisins Human genes 0.000 description 26
- 230000014509 gene expression Effects 0.000 description 26
- 108091028043 Nucleic acid sequence Proteins 0.000 description 25
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 24
- 239000002773 nucleotide Substances 0.000 description 24
- 101000925646 Enterobacteria phage T4 Endolysin Proteins 0.000 description 23
- 238000009826 distribution Methods 0.000 description 22
- 102000004190 Enzymes Human genes 0.000 description 21
- 108090000790 Enzymes Proteins 0.000 description 21
- 229940088598 enzyme Drugs 0.000 description 21
- 108091033319 polynucleotide Proteins 0.000 description 21
- 102000040430 polynucleotide Human genes 0.000 description 21
- 239000002157 polynucleotide Substances 0.000 description 21
- 230000009286 beneficial effect Effects 0.000 description 17
- 108091034117 Oligonucleotide Proteins 0.000 description 16
- 238000009396 hybridization Methods 0.000 description 15
- 230000007423 decrease Effects 0.000 description 14
- 230000006872 improvement Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 12
- 238000005290 field theory Methods 0.000 description 12
- 239000012634 fragment Substances 0.000 description 12
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 11
- 108020004999 messenger RNA Proteins 0.000 description 11
- 230000004048 modification Effects 0.000 description 11
- 238000012986 modification Methods 0.000 description 11
- 238000003752 polymerase chain reaction Methods 0.000 description 11
- 230000006798 recombination Effects 0.000 description 11
- ZHNUHDYFZUAESO-UHFFFAOYSA-N Formamide Chemical compound NC=O ZHNUHDYFZUAESO-UHFFFAOYSA-N 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 239000000463 material Substances 0.000 description 10
- 238000005215 recombination Methods 0.000 description 10
- 238000003860 storage Methods 0.000 description 10
- 238000013459 approach Methods 0.000 description 9
- 238000005481 NMR spectroscopy Methods 0.000 description 8
- 238000000338 in vitro Methods 0.000 description 8
- 230000001105 regulatory effect Effects 0.000 description 8
- 239000000126 substance Substances 0.000 description 8
- 239000002299 complementary DNA Substances 0.000 description 7
- 229910052739 hydrogen Inorganic materials 0.000 description 7
- 239000001257 hydrogen Substances 0.000 description 7
- 238000002703 mutagenesis Methods 0.000 description 7
- 231100000350 mutagenesis Toxicity 0.000 description 7
- -1 phosphotriesters Chemical class 0.000 description 7
- 108091026890 Coding region Proteins 0.000 description 6
- ZMXDDKWLCZADIW-UHFFFAOYSA-N N,N-Dimethylformamide Chemical compound CN(C)C=O ZMXDDKWLCZADIW-UHFFFAOYSA-N 0.000 description 6
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 description 6
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Chemical compound CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 description 6
- 238000004587 chromatography analysis Methods 0.000 description 6
- 238000012217 deletion Methods 0.000 description 6
- 230000037430 deletion Effects 0.000 description 6
- 229920002521 macromolecule Polymers 0.000 description 6
- 238000005259 measurement Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 210000000349 chromosome Anatomy 0.000 description 5
- 230000003247 decreasing effect Effects 0.000 description 5
- 230000009881 electrostatic interaction Effects 0.000 description 5
- 238000002744 homologous recombination Methods 0.000 description 5
- 230000006801 homologous recombination Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 5
- 239000003960 organic solvent Substances 0.000 description 5
- 238000000746 purification Methods 0.000 description 5
- 108020003175 receptors Proteins 0.000 description 5
- 102000005962 receptors Human genes 0.000 description 5
- YBJHBAHKTGYVGT-ZKWXMUAHSA-N (+)-Biotin Chemical compound N1C(=O)N[C@@H]2[C@H](CCCCC(=O)O)SC[C@@H]21 YBJHBAHKTGYVGT-ZKWXMUAHSA-N 0.000 description 4
- 108091092195 Intron Proteins 0.000 description 4
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 4
- 108090000787 Subtilisin Proteins 0.000 description 4
- 235000004279 alanine Nutrition 0.000 description 4
- 238000003556 assay Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 4
- 238000000205 computational method Methods 0.000 description 4
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 239000013612 plasmid Substances 0.000 description 4
- 238000007423 screening assay Methods 0.000 description 4
- 229910052710 silicon Inorganic materials 0.000 description 4
- 239000010703 silicon Substances 0.000 description 4
- 238000007614 solvation Methods 0.000 description 4
- 241000894007 species Species 0.000 description 4
- 230000006641 stabilisation Effects 0.000 description 4
- 238000011105 stabilization Methods 0.000 description 4
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 102000053602 DNA Human genes 0.000 description 3
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 3
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 3
- 108010076504 Protein Sorting Signals Proteins 0.000 description 3
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 description 3
- 239000004473 Threonine Substances 0.000 description 3
- 239000000654 additive Substances 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 3
- 125000004429 atom Chemical group 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 239000013078 crystal Substances 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- GNBHRKFJIUUOQI-UHFFFAOYSA-N fluorescein Chemical compound O1C(=O)C2=CC=CC=C2C21C1=CC=C(O)C=C1OC1=CC(O)=CC=C21 GNBHRKFJIUUOQI-UHFFFAOYSA-N 0.000 description 3
- 230000005714 functional activity Effects 0.000 description 3
- 230000002068 genetic effect Effects 0.000 description 3
- 230000000670 limiting effect Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000002844 melting Methods 0.000 description 3
- 230000008018 melting Effects 0.000 description 3
- 229910052751 metal Inorganic materials 0.000 description 3
- 239000002184 metal Substances 0.000 description 3
- 150000002739 metals Chemical class 0.000 description 3
- 238000000329 molecular dynamics simulation Methods 0.000 description 3
- 230000007935 neutral effect Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 239000002245 particle Substances 0.000 description 3
- 239000013615 primer Substances 0.000 description 3
- 230000012846 protein folding Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 230000008685 targeting Effects 0.000 description 3
- 238000002424 x-ray crystallography Methods 0.000 description 3
- 241000819038 Chichester Species 0.000 description 2
- 241000588724 Escherichia coli Species 0.000 description 2
- WHUUTDBJXJRKMK-UHFFFAOYSA-N Glutamic acid Natural products OC(=O)C(N)CCC(O)=O WHUUTDBJXJRKMK-UHFFFAOYSA-N 0.000 description 2
- NYHBQMYGNKIUIF-UUOKFMHZSA-N Guanosine Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O NYHBQMYGNKIUIF-UUOKFMHZSA-N 0.000 description 2
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 2
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 2
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 2
- AGPKZVBTJJNPAG-WHFBIAKZSA-N L-isoleucine Chemical compound CC[C@H](C)[C@H](N)C(O)=O AGPKZVBTJJNPAG-WHFBIAKZSA-N 0.000 description 2
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 description 2
- 102000018697 Membrane Proteins Human genes 0.000 description 2
- 108010052285 Membrane Proteins Proteins 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 101710163270 Nuclease Proteins 0.000 description 2
- IQFYYKKMVGJFEH-XLPZGREQSA-N Thymidine Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 IQFYYKKMVGJFEH-XLPZGREQSA-N 0.000 description 2
- 108700009124 Transcription Initiation Site Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 239000002253 acid Substances 0.000 description 2
- 150000007513 acids Chemical class 0.000 description 2
- DZBUGLKDJFMEHC-UHFFFAOYSA-N acridine Chemical compound C1=CC=CC2=CC3=CC=CC=C3N=C21 DZBUGLKDJFMEHC-UHFFFAOYSA-N 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 125000003295 alanine group Chemical group N[C@@H](C)C(=O)* 0.000 description 2
- 239000000427 antigen Substances 0.000 description 2
- 108091007433 antigens Proteins 0.000 description 2
- 102000036639 antigens Human genes 0.000 description 2
- 235000003704 aspartic acid Nutrition 0.000 description 2
- 230000001580 bacterial effect Effects 0.000 description 2
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 description 2
- 229960002685 biotin Drugs 0.000 description 2
- 235000020958 biotin Nutrition 0.000 description 2
- 239000011616 biotin Substances 0.000 description 2
- 229910052799 carbon Inorganic materials 0.000 description 2
- 238000004113 cell culture Methods 0.000 description 2
- 230000002759 chromosomal effect Effects 0.000 description 2
- 239000000356 contaminant Substances 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 235000018417 cysteine Nutrition 0.000 description 2
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000002050 diffraction method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 125000001495 ethyl group Chemical group [H]C([H])([H])C([H])([H])* 0.000 description 2
- 239000013604 expression vector Substances 0.000 description 2
- 238000001943 fluorescence-activated cell sorting Methods 0.000 description 2
- 238000001502 gel electrophoresis Methods 0.000 description 2
- 125000001475 halogen functional group Chemical group 0.000 description 2
- 230000002209 hydrophobic effect Effects 0.000 description 2
- 230000005661 hydrophobic surface Effects 0.000 description 2
- 229960000310 isoleucine Drugs 0.000 description 2
- AGPKZVBTJJNPAG-UHFFFAOYSA-N isoleucine Natural products CCC(C)C(N)C(O)=O AGPKZVBTJJNPAG-UHFFFAOYSA-N 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 229930182817 methionine Natural products 0.000 description 2
- 125000002496 methyl group Chemical group [H]C([H])([H])* 0.000 description 2
- 238000010369 molecular cloning Methods 0.000 description 2
- 238000000302 molecular modelling Methods 0.000 description 2
- 238000012900 molecular simulation Methods 0.000 description 2
- 238000007899 nucleic acid hybridization Methods 0.000 description 2
- 238000012856 packing Methods 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 238000001556 precipitation Methods 0.000 description 2
- 230000004952 protein activity Effects 0.000 description 2
- 238000000455 protein structure prediction Methods 0.000 description 2
- ZCCUUQDIBDJBTK-UHFFFAOYSA-N psoralen Chemical compound C1=C2OC(=O)C=CC2=CC2=C1OC=C2 ZCCUUQDIBDJBTK-UHFFFAOYSA-N 0.000 description 2
- 239000013014 purified material Substances 0.000 description 2
- 238000002708 random mutagenesis Methods 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000002864 sequence alignment Methods 0.000 description 2
- 238000002741 site-directed mutagenesis Methods 0.000 description 2
- 239000007790 solid phase Substances 0.000 description 2
- 230000000087 stabilizing effect Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- 108020005345 3' Untranslated Regions Proteins 0.000 description 1
- VXGRJERITKFWPL-UHFFFAOYSA-N 4',5'-Dihydropsoralen Natural products C1=C2OC(=O)C=CC2=CC2=C1OCC2 VXGRJERITKFWPL-UHFFFAOYSA-N 0.000 description 1
- 108020003589 5' Untranslated Regions Proteins 0.000 description 1
- WYWHKKSPHMUBEB-UHFFFAOYSA-N 6-Mercaptoguanine Natural products N1C(N)=NC(=S)C2=C1N=CN2 WYWHKKSPHMUBEB-UHFFFAOYSA-N 0.000 description 1
- DWRXFEITVBNRMK-UHFFFAOYSA-N Beta-D-1-Arabinofuranosylthymine Natural products O=C1NC(=O)C(C)=CN1C1C(O)C(O)C(CO)O1 DWRXFEITVBNRMK-UHFFFAOYSA-N 0.000 description 1
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 1
- QCMYYKRYFNMIEC-UHFFFAOYSA-N COP(O)=O Chemical class COP(O)=O QCMYYKRYFNMIEC-UHFFFAOYSA-N 0.000 description 1
- 108091092236 Chimeric RNA Proteins 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- MIKUYHXYGGJMLM-GIMIYPNGSA-N Crotonoside Natural products C1=NC2=C(N)NC(=O)N=C2N1[C@H]1O[C@@H](CO)[C@H](O)[C@@H]1O MIKUYHXYGGJMLM-GIMIYPNGSA-N 0.000 description 1
- NYHBQMYGNKIUIF-UHFFFAOYSA-N D-guanosine Natural products C1=2NC(N)=NC(=O)C=2N=CN1C1OC(CO)C(O)C1O NYHBQMYGNKIUIF-UHFFFAOYSA-N 0.000 description 1
- 239000003155 DNA primer Substances 0.000 description 1
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 1
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 1
- 241000255581 Drosophila <fruit fly, genus> Species 0.000 description 1
- 101100001670 Emericella variicolor andE gene Proteins 0.000 description 1
- 230000010665 Enzyme Interactions Effects 0.000 description 1
- GHASVSINZRGABV-UHFFFAOYSA-N Fluorouracil Chemical compound FC1=CNC(=O)NC1=O GHASVSINZRGABV-UHFFFAOYSA-N 0.000 description 1
- 241000238631 Hexapoda Species 0.000 description 1
- 108060003951 Immunoglobulin Proteins 0.000 description 1
- 102000008394 Immunoglobulin Fragments Human genes 0.000 description 1
- 108010021625 Immunoglobulin Fragments Proteins 0.000 description 1
- 229930010555 Inosine Natural products 0.000 description 1
- UGQMRVRMYYASKQ-KQYNXXCUSA-N Inosine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C2=NC=NC(O)=C2N=C1 UGQMRVRMYYASKQ-KQYNXXCUSA-N 0.000 description 1
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 description 1
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 description 1
- 108091026898 Leader sequence (mRNA) Proteins 0.000 description 1
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 description 1
- 102000016943 Muramidase Human genes 0.000 description 1
- 108010014251 Muramidase Proteins 0.000 description 1
- 108010021466 Mutant Proteins Proteins 0.000 description 1
- 102000008300 Mutant Proteins Human genes 0.000 description 1
- 102000016349 Myosin Light Chains Human genes 0.000 description 1
- 108010067385 Myosin Light Chains Proteins 0.000 description 1
- 108010062010 N-Acetylmuramoyl-L-alanine Amidase Proteins 0.000 description 1
- 239000004677 Nylon Substances 0.000 description 1
- 108020005187 Oligonucleotide Probes Proteins 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 230000004570 RNA-binding Effects 0.000 description 1
- 230000010799 Receptor Interactions Effects 0.000 description 1
- 108020004511 Recombinant DNA Proteins 0.000 description 1
- 108091027981 Response element Proteins 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- RYYWUUFWQRZTIU-UHFFFAOYSA-N Thiophosphoric acid Chemical class OP(O)(S)=O RYYWUUFWQRZTIU-UHFFFAOYSA-N 0.000 description 1
- GXDLGHLJTHMDII-WISUUJSJSA-N Thr-Ser Chemical compound C[C@@H](O)[C@H](N)C(=O)N[C@@H](CO)C(O)=O GXDLGHLJTHMDII-WISUUJSJSA-N 0.000 description 1
- 108091036066 Three prime untranslated region Proteins 0.000 description 1
- 108090000848 Ubiquitin Proteins 0.000 description 1
- 102000044159 Ubiquitin Human genes 0.000 description 1
- 108091023045 Untranslated Region Proteins 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 238000002441 X-ray diffraction Methods 0.000 description 1
- 101710185494 Zinc finger protein Proteins 0.000 description 1
- 102100023597 Zinc finger protein 816 Human genes 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 229960005305 adenosine Drugs 0.000 description 1
- 239000002168 alkylating agent Substances 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000012863 analytical testing Methods 0.000 description 1
- 210000004102 animal cell Anatomy 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000000692 anti-sense effect Effects 0.000 description 1
- 239000003125 aqueous solvent Substances 0.000 description 1
- 210000004507 artificial chromosome Anatomy 0.000 description 1
- IQFYYKKMVGJFEH-UHFFFAOYSA-N beta-L-thymidine Natural products O=C1NC(=O)C(C)=CN1C1OC(CO)C(O)C1 IQFYYKKMVGJFEH-UHFFFAOYSA-N 0.000 description 1
- 238000004166 bioassay Methods 0.000 description 1
- 238000005842 biochemical reaction Methods 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 239000012620 biological material Substances 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 150000004657 carbamic acid derivatives Chemical class 0.000 description 1
- 239000003054 catalyst Substances 0.000 description 1
- 238000006555 catalytic reaction Methods 0.000 description 1
- 230000003915 cell function Effects 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 239000002738 chelating agent Substances 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 108010081370 chymotrypsin inhibitor 2 Proteins 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 239000013599 cloning vector Substances 0.000 description 1
- 238000002742 combinatorial mutagenesis Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000001268 conjugating effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000004925 denaturation Methods 0.000 description 1
- 230000036425 denaturation Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000003599 detergent Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000539 dimer Substances 0.000 description 1
- NAGJZTKCGNOGPW-UHFFFAOYSA-N dithiophosphoric acid Chemical class OP(O)(S)=S NAGJZTKCGNOGPW-UHFFFAOYSA-N 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 229940011871 estrogen Drugs 0.000 description 1
- 239000000262 estrogen Substances 0.000 description 1
- 102000015694 estrogen receptors Human genes 0.000 description 1
- 108010038795 estrogen receptors Proteins 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000000855 fermentation Methods 0.000 description 1
- 230000004151 fermentation Effects 0.000 description 1
- 239000007850 fluorescent dye Substances 0.000 description 1
- 229960002949 fluorouracil Drugs 0.000 description 1
- 125000000524 functional group Chemical group 0.000 description 1
- 230000002538 fungal effect Effects 0.000 description 1
- 238000001641 gel filtration chromatography Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 238000010353 genetic engineering Methods 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 235000013922 glutamic acid Nutrition 0.000 description 1
- 239000004220 glutamic acid Substances 0.000 description 1
- ZDXPYRJPNDTMRX-UHFFFAOYSA-N glutamine Natural products OC(=O)C(N)CCC(N)=O ZDXPYRJPNDTMRX-UHFFFAOYSA-N 0.000 description 1
- 150000004676 glycans Chemical class 0.000 description 1
- 229940029575 guanosine Drugs 0.000 description 1
- 238000004128 high performance liquid chromatography Methods 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 108091008039 hormone receptors Proteins 0.000 description 1
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 1
- 210000001822 immobilized cell Anatomy 0.000 description 1
- 238000003018 immunoassay Methods 0.000 description 1
- 102000018358 immunoglobulin Human genes 0.000 description 1
- 238000000126 in silico method Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 229960003786 inosine Drugs 0.000 description 1
- 238000004255 ion exchange chromatography Methods 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 238000001155 isoelectric focusing Methods 0.000 description 1
- 239000006166 lysate Substances 0.000 description 1
- 229960000274 lysozyme Drugs 0.000 description 1
- 235000010335 lysozyme Nutrition 0.000 description 1
- 239000004325 lysozyme Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000010534 mechanism of action Effects 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 230000011987 methylation Effects 0.000 description 1
- 238000007069 methylation reaction Methods 0.000 description 1
- YACKEPLHDIMKIO-UHFFFAOYSA-N methylphosphonic acid Chemical compound CP(O)(O)=O YACKEPLHDIMKIO-UHFFFAOYSA-N 0.000 description 1
- 235000013336 milk Nutrition 0.000 description 1
- 239000008267 milk Substances 0.000 description 1
- 210000004080 milk Anatomy 0.000 description 1
- 238000010995 multi-dimensional NMR spectroscopy Methods 0.000 description 1
- 239000006225 natural substrate Substances 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000002858 neurotransmitter agent Substances 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 229920001778 nylon Polymers 0.000 description 1
- 239000002751 oligonucleotide probe Substances 0.000 description 1
- 238000002515 oligonucleotide synthesis Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000003463 organelle Anatomy 0.000 description 1
- 230000001590 oxidative effect Effects 0.000 description 1
- 238000004091 panning Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000004810 partition chromatography Methods 0.000 description 1
- 239000012071 phase Substances 0.000 description 1
- PTMHPRAIXMAOOB-UHFFFAOYSA-N phosphoramidic acid Chemical class NP(O)(O)=O PTMHPRAIXMAOOB-UHFFFAOYSA-N 0.000 description 1
- 239000013600 plasmid vector Substances 0.000 description 1
- 229920000729 poly(L-lysine) polymer Polymers 0.000 description 1
- 230000008488 polyadenylation Effects 0.000 description 1
- 229920002704 polyhistidine Polymers 0.000 description 1
- 229920001282 polysaccharide Polymers 0.000 description 1
- 239000005017 polysaccharide Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000002818 protein evolution Methods 0.000 description 1
- 230000009145 protein modification Effects 0.000 description 1
- 150000003254 radicals Chemical class 0.000 description 1
- 230000002285 radioactive effect Effects 0.000 description 1
- 230000009257 reactivity Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000004007 reversed phase HPLC Methods 0.000 description 1
- 238000005185 salting out Methods 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 238000012772 sequence design Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000014639 sexual reproduction Effects 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012868 site-directed mutagenesis technique Methods 0.000 description 1
- FQENQNTWSFEDLI-UHFFFAOYSA-J sodium diphosphate Chemical compound [Na+].[Na+].[Na+].[Na+].[O-]P([O-])(=O)OP([O-])([O-])=O FQENQNTWSFEDLI-UHFFFAOYSA-J 0.000 description 1
- 229940048086 sodium pyrophosphate Drugs 0.000 description 1
- 230000005328 spin glass Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 235000019818 tetrasodium diphosphate Nutrition 0.000 description 1
- 239000001577 tetrasodium phosphonato phosphate Substances 0.000 description 1
- ZEMGGZBWXRYJHK-UHFFFAOYSA-N thiouracil Chemical compound O=C1C=CNC(=S)N1 ZEMGGZBWXRYJHK-UHFFFAOYSA-N 0.000 description 1
- 229950000329 thiouracil Drugs 0.000 description 1
- 229940104230 thymidine Drugs 0.000 description 1
- MNRILEROXIRVNJ-UHFFFAOYSA-N tioguanine Chemical compound N1C(N)=NC(=S)C2=NC=N[C]21 MNRILEROXIRVNJ-UHFFFAOYSA-N 0.000 description 1
- 229960003087 tioguanine Drugs 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 239000003053 toxin Substances 0.000 description 1
- 231100000765 toxin Toxicity 0.000 description 1
- 108700012359 toxins Proteins 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 238000001890 transfection Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000009261 transgenic effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 239000013638 trimer Substances 0.000 description 1
- WFKWXMTUELFFGS-UHFFFAOYSA-N tungsten Chemical compound [W] WFKWXMTUELFFGS-UHFFFAOYSA-N 0.000 description 1
- 238000005199 ultracentrifugation Methods 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 239000004474 valine Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 108700026220 vif Genes Proteins 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
Definitions
- the invention relates to biomolecular engineering and design, including methods for the engineering and design of biopolymers such as proteins and nucleic acids.
- the invention relates to methods for directed evolution, including in vitro directed evolution, of biopolymers such as proteins and nucleic acids.
- the invention also relates to computational methods for identifying residues of a biopolymer (e.g., nucleotide residues of a nucleic acid or amino acid residues of a polypeptide) where mutations may produce beneficial results, such as one or more improved properties.
- improvements are obtained while minimally disrupting a desired biopolymer property, such as stability or functionality.
- Disruption is less likely when biopolymers Eire mutated at structurally tolerant mutation sites determined according to the invention. This provides a targeted approach for obtaining mutant or hybrid biopolymers with improved properties using directed evolution techniques. More particularly, the invention is useful in the design of hybrid polypeptides having new or improved properties.
- the invention is concerned primarily with biopolymers such as polynucleotides (chains of nucleic acids) and polypeptides (proteins).
- Proteins are polypeptides that are useful to living organisms. For example, they provide structures in the body, do physical or chemical work, or act as catalysts for chemical reactions (i.e. as enzymes).
- Proteins are made by cells according to genetic information encoded, translated and transcribed by polynucleotides (DNA and RNA). It is often desirable to modify proteins so that they have new or improved properties. For example, a protein may be altered to increase its biological activity (e.g. its potency as an enzyme), or to improve its stability under different environmental conditions (e.g. temperature, organic solvent), or to change its function (e.g. to catalyze a different chemical reaction).
- Directed evolution techniques attempt to alter the properties of a biopolymer (e.g. , a protein or a nucleic acid) by accumulating stepwise improvements through interations of random mutagenesis, recombination and screening (see, e.g. , Moore & Arnold, Nature Biotechnology 1996, 14:458; Miyazaki et al, J. Mol. Biol. 2000, 297: 1015-1026; Arnold, Adv. Protein Chem. 2000, 55:ix-xi). Broadly speaking, these methods work by speeding up the natural processes of evolution. Changes in genetic material (e.g. mutations) are rapidly and artificially induced, typically in cells that can be easily and quickly grown in cell culture (e.g. outside the body).
- a biopolymer e.g. , a protein or a nucleic acid
- the resulting mutants are rapidly evaluated to identify new or improved properties or changes of interest.
- Genetic recombination methods have been widely applied to accelerate in vitro protein evolution (See, e.g., Stemmer, Proc. Natl. Acad. Sci. 1994, 91 :10747; Stemmer, Nature 1994, 370:389; Zhao & Arnold, Nucleic Acids Res.X991, 25:1307; Zhao et al, Nature Biotechnology X 998, 49:290).
- in vitro recombination methods include D ⁇ A shuffling, random-priming recombination, and the staggered extension process (StEP)(see, Arnold & Wintrode, Enzymes, Directed Evolution, in Encyclopedia of bioprocess technology: fermentation, biocatalysis, and bioseparation 1999, 2:971).
- StEP staggered extension process
- Computational design by contrast, has developed separately from directed evolution and is a fundamentally different approach (Street & Mayo, Structure 1999 7:R105). Unlike the essentially random approach of directed evolution, computational design attempts to predict and then make the changes or mutations that will be beneficial or useful. Thus, the general objective of computational design is to identify particular interactions in a protein (or other biopolymer) that lead to desirable properties, and then modify the biopolymer sequence to optimize those interactions. For example, a force field model is typically used to quantitatively describe interactions between amino acid residues in a protein.
- amino acid sequence may then be computed, at least in theory, to globablly optimize these interactions (see, e.g., Malakaukas & Mayo, Nature Structural Biology 1998, 5:470; Dahiyat & Mayo, Science 1997, 278:82).
- the Sequence Space Computational design can effectively search a large sequence space, that is, a large number of sequences (e.g., > 10 26 ). See, Dahiyat & Mayo, Science 1997 278:82).
- the technique is currently limited by the size of the biopolymer.
- the largest full sequence design accomplished to date is a 28-mer zinc finger protein (id.).
- Partial designs e.g. limiting the number of residues calculated
- the technique currently is based on calculating the molecule's conformational energy, i.e. the relative energy of the molecule's folded and unfolded states.
- current computational methods have only been used to improve a molecule's stability.
- the technique has not been used to improve other properties of biopolymers, such as activity, selectivity, efficiency, or other characteristics of biological fitness.
- Directed evolution methods have the benefit of improving any property in a molecule that can be detected and/or captured by a screen, for example catalytic activity of an enzyme.
- one effective and widely used directed evolution method involves production of a library of mutants from a parent sequence, e.g. , by using error-prone PCR to produce random point mutations (see, Moore & Arnold, Nature Biotechnology 1996, 14:458; Miyazaki et al, J. Mol. Biol. 2000, 297:1015- 1026).
- the technique is limited by several factors, one of which is the practical size of the screen (Zhao & Arnold, Protein Engineering 1999, 12:47).
- any real screening or selection assay can only search a very small fraction of the possible sequences.
- Exmplary directed evolution techniques include DNA shuffling, random-primer extension, and StEP recombination.
- the parental DNA is enzymatically digested into fragments which can be reassembled into offspring genes (Stemmer, Proc. Natl. Acad. Sci.1994 91: 10747; Stemmer, Nature 1994, 370:389; Zhao & Arnold, Nucleic Acids Res.1991, 25: 1307).
- template DNA sequences are primed with random- sequence primers and then extended by DNA polymerase to create fragments.
- the template is removed and the fragments are reassembled into full length genes, as in the final step of DNA shuffling (Shao et al. , Nuc. Acids Res. 1998, 26:681).
- the number of cut points can be increased by starting with smaller fragments or by limiting the extension reaction.
- StEP recombination differs from the first two methods because it does not use gene fragments (Zhao et al, Nat. Biotechnology 1998, 49:290).
- the template genes are primed and extended before denaturation and reannealing. As the fragments grow, they reanneal to new templates and thus combine information from multiple parents. This process is cycled hundreds of times until a full length offspring gene is formed.
- Mutations to a polymer are less likely to have an adverse affect on the "fitness" of the polymer when only "structurally tolerant" residues are mutated.
- the structurally tolerant residues are preferably ones that have few and/or weak (if any) coupling interactions with other residues in the polymer.
- Applicants have discovered novel techniques for identifying structurally tolerant residues in a polymer sequence. These methods are straightforward and are computationally tractable. Accordingly, a skilled artisan can readily use these methods to identify the residues of a particular polymer sequence that are structurally tolerant, and may selectively mutate those residues to generate compatible mutants that do not adversely affect that particular polymer's properties of interest.
- mutants are more likely to have one or more properties of interest that are improved over the properties of the parent polymer.
- tolerant mutations There is significant overlap between tolerant mutations and beneficial mutations.
- a skilled artisan may more readily and efficiently identify novel sequences with improved properties than if the artisan randomly mutated the polymer.
- the invention therefore provides methods for selecting residues of a polymer sequence for mutation by obtaining or determining the structural tolerance for residues of the polymer sequence, and selecting structurally tolerant residues for mutation.
- the polymers may be any type of polymer, including biopolymers such as, but not limited to, nucleic acids (comprising a sequence of nucleotide residues) and proteins or polypeptides (comprising a sequence of amino acid residues).
- the invention also provides numerous methods for determining the structural tolerance of residues in a polymer including, in preferred embodiments, the site entropy of the residues.
- the invention also provides methods for directed evolution of polymers.
- a parent sequence may be provided that has one or more properties of interest and one or more structurally tolerant residues selected for mutation.
- One or more mutant polymers may then be generated from the parent polymer sequence in which one or more of the selected structurally tolerant residues are mutated, and these mutants are then preferably screened for the one or more properties of interest. Mutants are then selected where one or more of the properties of interest is modified and, preferably, is improved.
- the directed evolution methods of the invention are iteratively repeated, and selected mutants are used as parent polymer sequences in subsequent iterations of the method.
- the invention can also be used to identify parent molecules or families of parent molecules (e.g. preferred parent genes or gene families) for mutation. For example, a particular biochemical reaction may be facilitated by more than one enzyme or enzyme family, encoded by more than one gene or gene family. These genes or gene families can be evaluated to determine which are more likely, when altered (e.g. by directed evolution), to produce desirable improvements.
- Computer systems are also provided that may be used to implement the analytical methods of the invention, including methods of identifying structurally tolerant residues in a polymer sequence and/or selecting such residues for mutation (e.g., as part of a directed evolution method).
- These computer systems comprise a processor interconnected with a memory that contains one or more software components.
- the one or more software components include programs that cause the processor to implement steps of the analytical methods described herein.
- the software components may comprise additional programs and/or files including, for example, sequence or structural databases of polymers.
- Computer program products are further provided, which comprise a computer readable medium , such as one or more floppy disks, compact discs (e.g., CD-ROMS or RW-CDS), DVDs, data tapes, etc. , that have one or more software components encoded thereon in computer readable form.
- the software components may be loaded into the memory of a computer system and may then cause a processor of the computer system to execute steps of the analytical methods described herein.
- the software components may include additional programs and/or files including databases, e.g., of polymer sequences and/or structures.
- FIG. 1 is a flow diagram illustrating an exemplary embodiment of the methods of the invention.
- FIG. 2 shows an exemplary computer system that may be used to implement analytical methods of the invention.
- FIG. 3 is a plot of the probability distribution P(c) that a positive mutation (i.e. , a mutation increasing the protein fitness, F) occurs at a residue having c coupled interactions.
- FIGS.4A-B are plots showing the site entropy profile s, (FIG.4A) and %-solvent exposure (FIG. 4B) for amino acid residues 5-268 of subtilisin ⁇ protein.
- FIGS. 5A-B show the probability distribution P(s ⁇ ) of site entropy values s, in subtilisin ⁇ protein (FIG. 5A) and T4 lysozyme protein (FIG. 5B).
- FIG. 6 shows the three dimension crystal structure of subtilisin ⁇ protein, with the site entropy s, of each amino acid residue indicated by its color: yellow, 2.16 ⁇ s, ⁇ 3.00; red, 1.31 ⁇ .s, ⁇ 2.16; gray, s, ⁇ 1.31.
- FIG. 7 shows the off-rate k ojj of the 4-4-20 antibody mutants plotted against the entropy of the sites where the beneficial mutations occurred.
- FIG. 8 is a representative plot of percent functional improvement versus entropy for T4 lysozyme in a model according to the invention.
- FIG. 9 shows a representative comparison of T4 lysozyme entropy calculated according to mean-field algorithm and dead-end elimination algorithms.
- the invention overcomes problems in the prior art and provides novel methods which can be used for directed evolution of biopolymers such as proteins and nucleic acids.
- the invention provides methods which can be used to identify residues of a polymer where mutations are most likely to produce one or more improved properties. By preferentially mutating these residues, the sequence space for a given polymer may be more efficiently searched. Mutant or variant polymers having one or more improved properties may be more readily identified while simultaneously reducing the number(s) of mutants screened.
- the inventors have discovered, in particular, that the probability of a beneficial mutation occurring at a highly coupled residue decreases significantly as the "fitness" of the parent polymer increases.
- Highly coupled residues in a polymer will generally require several simultaneous mutations at other residues to demonstrate improvement, e.g., in a directed evolution experiment.
- the probability that this occurs decreases rapidly due to the limited mutation rate and library size.
- fewer simultaneous mutations are generally required for mutations at uncoupled or weakly coupled residues to improve a particular property of the polymer.
- molecule means any distinct or distinguishable structural unit of matter comprising one or more atoms, and includes, for example, polypeptides and polynucleotides.
- polymer means any substance or compound that is composed of two or more building blocks ('mers') that are repetitively linked together. For example, a
- dimer is a compound in which two building blocks have been joined togther; a “trimer” is a compound in which three building blocks have been joined together; etc.
- the individual building blocks of a polymer are also referred to herein as "residues”.
- biopolymer is any polymer having an organic or biochemical utility or that is produced by a cell.
- Preferred biopolymers include, but are not limited to, polynucleotides, polypeptides and polysaccharides.
- polynucleotide or “nucleic acid molecule” refers to a polymeric molecule having a backbone that supports bases capable of hydrogen bonding to typical polynucleotides, wherein the polymer backbone presents the bases in a manner to permit such hydrogen bonding in a specific fashion between the polymeric molecule and a typical polynucleotide (e.g., single-stranded DNA).
- bases are typically inosine, adenosine, guanosine, cytosine, uracil and thymidine.
- Polymeric molecules include "double stranded” and “single stranded” DNA and RNA, as well as backbone modifications thereof (for example, methylphosphonate linkages).
- a "polynucleotide” or “nucleic acid” sequence is a series of nucleotide bases
- nucleotides generally in DNA and RNA, and means any chain of two or more nucleotides.
- a nucleotide sequence frequently carries genetic information, including the information used by cellular machinery to make proteins and enzymes.
- the terms include genomic DNA, cDNA, RNA, any synthetic and genetically manipulated polynucleotide, and both sense and antisense polynucleotides. This includes single- and double-stranded molecules; i.e., DNA-DNA, DNA-RNA, and RNA-RNA hybrids as well as “protein nucleic acids” (PNA) formed by conjugating bases to an amino acid backbone.
- PNA protein nucleic acids
- nucleic acids containing modified bases for example, thio- uracil, thio-guanine and fluoro-uracil.
- the polynucleotides herein may be flanked by natural regulatory sequences, or may be associated with heterologous sequences, including promoters, enhancers, response elements, signal sequences, polyadenylation sequences, introns, 5'- and 3'-non- coding regions and the like.
- the nucleic acids may also be modified by many means known in the art.
- Non-limiting examples of such modifications include methylation, "caps”, substitution of one or more of the naturally occurring nucleotides with an analog, and internucleotide modifications such as, for example, those with uncharged linkages (e.g., methyl phosphonates, phosphotriesters, phosphoroamidates, carbamates, etc.) and with charged linkages (e.g., phosphorothioates, phosphorodithioates, etc.).
- uncharged linkages e.g., methyl phosphonates, phosphotriesters, phosphoroamidates, carbamates, etc.
- charged linkages e.g., phosphorothioates, phosphorodithioates, etc.
- Polynucleotides may contain one or more additional covalently linked moieties, such as proteins (e.g., nucleases, toxins, antibodies, signal peptides, poly-L-lysine, etc.), intercalators (e.g., acridine, psoralen, etc.), chelators (e.g., metals, radioactive metals, iron, oxidative metals, etc.) and alkylators to name a few.
- the polynucleotides may be derivatized by formation of a methyl or ethyl phosphotriester or an alkyl phosphoramidite linkage.
- polynucleotides herein may also be modified with a label capable of providing a detectable signal, either directly or indirectly.
- exemplary labels include radioisotopes, fluorescent molecules, biotin and the like.
- Other non-limiting examples of modification which may be made are provided, below, in the description of the invention.
- the term "oligonucleotide” refers to a nucleic acid, generally of at least 10, preferably at least 15, and more preferably at least 20 nucleotides, preferably no more than 100 nucleotides, that is hybridizable to a genomic DNA molecule, a cDNA molecule, or an mRNA molecule encoding a gene, mRNA, cDNA, or other nucleic acid of interest.
- Oligonucleotides can be labeled, e.g., with 32 P -nucleotides or nucleotides to which a label, such as biotin or a fluorescent dye (for example, Cy3 or Cy5) has been covalently conjugated.
- a label such as biotin or a fluorescent dye (for example, Cy3 or Cy5) has been covalently conjugated.
- oligonucleotides are prepared synthetically, preferably on a nucleic acid synthesizer. Accordingly, oligonucleotides can be prepared with non- naturally occurring phosphoester analog bonds, such as thioester bonds, etc.
- a “polypeptide” is a chain of chemical building blocks called amino acids that are linked together by chemical bonds called “peptide bonds”.
- the term “protein” refers to polypeptides that contain the amino acid residues encoded by a gene or by a nucleic acid molecule (e.g., an mRNA or a cDNA) transcribed from that gene either directly or indirectly.
- a protein may lack certain amino acid residues that are encoded by a gene or by an mRNA.
- a gene or mRNA molecule may encode a sequence of amino acid residues on the N-terminus of a protein (/ ' . e. , a signal sequence) that is cleaved from, and therefore may not be part of, the final protein.
- a protein or polypeptide, including an enzyme may be a "native” or “wild-type”, meaning that it occurs in nature; or it may be a “mutant”, “variant” or “modified”, meaning that it has been made, altered, derived, or is in some way different or changed from a native protein or from another mutant.
- Amplification of a polynucleotide denotes the use of polymerase chain reaction (PCR) to increase the concentration of a particular DNA sequence within a mixture of DNA sequences.
- PCR polymerase chain reaction
- a “ligand” is, broadly speaking, any molecule that binds to another molecule.
- the ligand is either a soluble molecule or the smaller of the two molecule or both.
- the other molecule is referred to as a "receptor".
- both a ligand and its receptor are molecules (preferably proteins or polypeptides) produced by cells.
- a ligand is a soluble molecule and the receptor is an integral membrane protein (i.e., a protein expressed on the surface of a cell). The binding of a ligand to its receptor is frequently a step of signal transduction within a cell.
- exemplary ligand-receptor interactions include, but are not limited to, binding of a hormone to a hormone receptor (for example, the binding of estrogen to the estrogen receptor) and the binding of a neurotransmitter to a receptor on the surface of a neuron.
- a "gene” is a sequence of nucleotides which code for a functional "gene product”.
- a gene product is a functional protein.
- a gene product can also be another type of molecule in a cell, such as an RNA (e.g., a tRNA or a rRNA).
- a gene product also refers to an mRNA sequence which may be found in a cell.
- measuring gene expression levels according to the invention may correspond to measuring mRNA levels.
- a gene may also comprise regulatory (i.e., non-coding) sequences as well as coding sequences.
- Exemplary regulatory sequences include promoter sequences, which determine, for example, the conditions under which the gene is expressed.
- the transcribed region of the gene may also include untranslated regions including introns, a 5 '-untranslated region (5'-UTR) and a 3 '-untranslated region (3'-UTR).
- a "coding sequence” or a sequence “encoding” an expression product such as a
- RNA, polypeptide, protein or enzyme is a nucleotide sequence that, when expressed, results in the production of that RNA, polypeptide, protein or enzyme; /. e. , the nucleotide sequence "encodes" that RNA or it encodes the amino acid sequence for that polypeptide, protein or enzyme.
- a "promoter sequence” is a DNA regulatory region capable of binding RNA polymerase in a cell and initiating transcription of a downstream (3' direction) coding sequence.
- a promoter sequence is typically bounded at its 3' terminus by the transcription initiation site and extends upstream (5' direction) to include the minimum number of bases or elements necessary to initiate transcription at levels detectable above background.
- a transcription initiation site (conveniently found, for example, by mapping with nuclease SI), as well as protein binding domains (consensus sequences) responsible for the binding of RNA polymerase.
- a coding sequence is "under the control of or is “operatively associated with” transcriptional and translational control sequences in a cell when RNA polymerase transcribes the coding sequence into RNA, which is then trans-RNA spliced (if it contains introns) and, if the sequence encodes a protein, is translated into that protein.
- RNA such as rRNA or mRNA
- a DNA sequence is expressed by a cell to form an "expression product” such as an RNA (e.g., a mRNA or a rRNA) or a protein.
- the expression product itself e.g. , the resulting RNA or protein, may also said to be “expressed” by the cell.
- transfection means the introduction of a foreign nucleic acid into a cell.
- transformation means the introduction of a "foreign” (i.e., extrinsic or extracellular) gene, DNA or RNA sequence into a host cell so that the host cell will express the introduced gene or sequence to produce a desired substance, in this invention typically an RNA coded by the introduced gene or sequence, but also a protein or an enzyme coded by the introduced gene or sequence.
- the introduced gene or sequence may also be called a “cloned” or “foreign” gene or sequence, may include regulatory or control sequences (e.g. , start, stop, promoter, signal, secretion or other sequences used by a cell's genetic machinery).
- the gene or sequence may include nonfunctional sequences or sequences with no known function.
- a host cell that receives and expresses introduced DNA or RNA has been "transformed” and is a "transformanf or a “clone”.
- the DNA or RNA introduced to a host cell can come from any source, including cells of the same genus or species as the host cell or cells of a different genus or species.
- vector means the vehicle by which a DNA or RNA sequence (e.g. , a foreign gene) can be introduced into a host cell so as to transform the host and promote expression (e.g., transcription and translation) of the introduced sequence.
- Vectors may include plasmids, phages, viruses, etc. and are discussed in greater detail below.
- a “cassette” refers to a DNA coding sequence or segment of DNA that codes for an expression product that can be inserted into a vector at defined restriction sites.
- the cassette restriction sites are designed to ensure insertion of the cassette in the proper reading frame.
- foreign DNA is inserted at one or more restriction sites of the vector DNA, and then is carried by the vector into a host cell along with the transmissible vector DNA.
- a segment or sequence of DNA having inserted or added DNA, such as an expression vector can also be called a "DNA construct.”
- a common type of vector is a "plasmid", which generally is a self-contained molecule of double-stranded DNA, usually of bacterial origin, that can readily accept additional (foreign) DNA and which can readily introduced into a suitable host cell.
- host cell means any cell of any organism that is selected, modified, transformed, grown or used or manipulated in any way for the production of a substance by the cell.
- a host cell may be one that is manipulated to express a particular gene, a DNA or RNA sequence, a protein or an enzyme.
- Host cells may be cultured in vitro or one or more cells in a non-human animal (e.g., a transgenic animal or a transiently transfected animal).
- expression system means a host cell and compatible vector under suitable conditions, e.g. for the expression of a protein coded for by foreign DNA carried by the vector and introduced to the host cell.
- Common expression systems include E. coli host cells and plasmid vectors, insect host cells such as Sf9, Hi5 or S2 cells and B ⁇ culovirus vectors, Drosophila cells (Schneider cells) and expression systems, and mammalian host cells and vectors.
- mutations may include, but are not limited to, changes in the nucleotide sequence of a nucleic acid (including changes in the sequence of a gene), and also changes in the amino acid sequence of a protein or polypeptide.
- mutations are limited to substitutions of one or more polymer residues (e.g., nucleotide and/or amino acid substitutions).
- mutations of the invention may also include deletions or insertions of one or more residues, such as amino acid and/or nucleotide substitutions or deletions.
- the methods of the invention may include steps of comparing parent sequences to each other or a parent sequence to one or more mutants.
- Such comparisons typically comprise alignments of polymer sequences, e.g., using sequence alignment programs and/or algorithms that are well known in the art (for example, BLAST, FASTA and MEGALIGN, to name a few).
- sequence alignment programs and/or algorithms that are well known in the art (for example, BLAST, FASTA and MEGALIGN, to name a few).
- sequence alignment programs and/or algorithms that are well known in the art (for example, BLAST, FASTA and MEGALIGN, to name a few).
- amino acid residue / in the mutant sequence is preferably said to be a "gap" or "deletion".
- heterologous refers to a combination of elements not naturally occurring.
- chimeric RNA molecules may comprise an rRNA sequence and a heterologous RNA sequence which is not part of the rRNA sequence.
- the heterologous RNA sequence refers to an RNA sequence that is not naturally located within the ribosomal RNA sequence.
- heterologous RNA sequence may be naturally located within the ribosomal RNA sequence, but is found at a location in the rRNA sequence where it does not naturally occur.
- heterologous DNA refers to DNA that is not naturally located in the cell, or in a chromosomal site of the cell.
- heterologous DNA includes a gene foreign to the cell.
- a heterologous expression regulatory element is a regulatory element operatively associated with a different gene than the one it is operatively associated with in nature.
- homologous in all its grammatical forms and spelling variations, refers to the relationship between two proteins that possess a "common evolutionary origin", including proteins from superfamilies (e.g. , the immunoglobulin superfamily) in the same species of organism, as well as homologous proteins from different species of organism (for example, myosin light chain polypeptide, etc. ; see, Reeck et al. , Cell 1987, 50:667).
- proteins and their encoding nucleic acids
- sequence homology as reflected by their sequence similarity, whether in terms of percent identity or by the presence of specific residues or motifs and conserved positions.
- sequence similarity in all its grammatical forms, refers to the degree of identity or correspondence between nucleic acid or amino acid sequences that may or may not share a common evolutionary origin (see, Reeck et al, supra).
- sequence similarity when modified with an adverb such as "highly”, may refer to sequence similarity and may or may not relate to a common evolutionary origin.
- recombination and variant spellings thereof, encompasses both
- homologous and non-homologous recombination are the exchange of biopolymer fragments between two biopolymer sequences.
- sequences may be recombined at the amino acid or nucleic acid level.
- homologous recombination refers to the exchange of biopolymer fragments between two bioploymer sequences at locations where the sequences exhibit regions of sequence homology.
- homologous recombination refers to the insertion of a modified or foreign DNA sequence contained by a first vector into another DNA sequence contained in second vector, or a chromosome of a cell.
- the first vector targets a specific chromosomal site for homologous recombination.
- the first vector will contain sufficiently long region of homology to sequences of the second vector or chromosome to allow complementary binding and incorporation of DNA from the first vector into the DNA of the second vector, or the chromosome.
- non-homologous recombination refers to the exchange of biopolymer fragments between two biopolymer sequences at location where the sequences are not homologous.
- a nucleic acid molecule is "hybridizable" to another nucleic acid molecule, such as a cDNA, genomic DNA, or RNA, when a single stranded form of the nucleic acid molecule can anneal to the other nucleic acid molecule under the appropriate conditions of temperature and solution ionic strength (see Sambrook et al. , supra).
- the conditions of temperature and ionic strength determine the "stringency" of the hybridization.
- low stringency hybridization conditions corresponding to a T m (melting temperature) of 55°C, can be used, e.g.
- Moderate stringency hybridization conditions correspond to a higher T m , e.g. , 40% formamide, with 5x or 6x SSC.
- High stringency hybridization conditions correspond to the highest T m , e.g., 50% formamide, 5x or 6x SSC.
- SSC is a 0.15MNaCl, 0.015MNa- citrate.
- the appropriate stringency for hybridizing nucleic acids depends on the length of the nucleic acids and the degree of complementation, variables well known in the art. The greater the degree of similarity or homology between two nucleotide sequences, the greater the value of T m for hybrids of nucleic acids having those sequences.
- the relative stability (corresponding to higher T m ) of nucleic acid hybridizations decreases in the following order: RNA:RNA, DNA:RNA, DNA:DNA.
- equations for calculating T m have been derived (see Sambrook et al, supra, 9.50-9.51). For hybridization with shorter nucleic acids, i. e.
- a minimum length for a hybridizable nucleic acid is at least about 10 nucleotides; preferably at least about 15 nucleotides; and more preferably the length is at least about 20 nucleotides.
- standard hybridization conditions refers to a T m of about 55°C, and utilizes conditions as set forth above. In a preferred embodiment, the T m is 60°C; in a more preferred embodiment, the T m is 65°C.
- high stringency refers to hybridization and/or washing conditions at 68°C in 0.2XSSC, at 42°C in 50% formamide, 4XSSC, or under conditions that afford levels of hybridization equivalent to those observed under either of these two conditions.
- Suitable hybridization conditions for oligonucleotides are typically somewhat different than for full-length nucleic acids (e.g., full-length cDNA), because of the oligonucleotides' lower melting temperature. Because the melting temperature of oligonucleotides will depend on the length of the oligonucleotide sequences involved, suitable hybridization temperatures will vary depending upon the oligoncucleotide molecules used.
- Exemplary temperatures may be 37 °C (for 14-base oligonucleotides), 48 °C (for 17-base oligoncucleotides), 55 °C (for 20-base oligonucleotides) and 60 °C (for 23-base oligonucleotides).
- Exemplary suitable hybridization conditions for oligonucleotides include washing in 6x SSC/0.05% sodium pyrophosphate, or other conditions that afford equivalent levels of hybridization.
- isolated means that the referenced material is removed from the environment in which it is normally found. Thus, an isolated biological material can be free of cellular components, i.e., components of the cells in which the material is found or produced.
- an isolated nucleic acid includes a PCR product, an isolated mRNA, a cDNA, or a restriction fragment.
- an isolated nucleic acid is preferably excised from the chromosome in which it may be found, and more preferably is no longer joined to non-regulatory, non- coding regions, or to other genes, located upstream or downstream of the gene contained by the isolated nucleic acid molecule when found in the chromosome.
- the isolated nucleic acid lacks one or more introns. Isolated nucleic acid molecules include sequences inserted into plasmids, cosmids, artificial chromosomes, and the like.
- a recombinant nucleic acid is an isolated nucleic acid.
- An isolated protein may be associated with other proteins or nucleic acids, or both, with which it associates in the cell, or with cellular membranes if it is a membrane-associated protein.
- An isolated organelle, cell, or tissue is removed from the anatomical site in which it is found in an organism.
- An isolated material may be, but need not be, purified.
- purified refers to material that has been isolated under conditions that reduce or eliminate the presence of unrelated materials, i.e., contaminants, including native materials from which the material is obtained.
- a purified protein is preferably substantially free of other proteins or nucleic acids with which it is associated in a cell; a purified nucleic acid molecule is preferably substantially free of proteins or other unrelated nucleic acid molecules with which it can be found within a cell.
- substantially free is used operationally, in the context of analytical testing of the material.
- purified material substantially free of contaminants is at least 50%) pure; more preferably, at least 90% pure, and more preferably still at least 99% pure. Purity can be evaluated by chromatography, gel electrophoresis, immunoassay, composition analysis, biological assay, and other methods known in the art.
- nucleic acids can be purified by precipitation, chromatography (including preparative solid phase chromatography, oligonucleotide hybridization, and triple helix chromatography), ultracentrifugation, and other means.
- Polypeptides and proteins can be purified by various methods including, without limitation, preparative disc-gel electrophoresis, isoelectric focusing, HPLC, reversed-phase HPLC, gel filtration, ion exchange and partition chromatography, precipitation and salting-out chromatography, extraction, and countercurrent distribution.
- the polypeptide in a recombinant system in which the protein contains an additional sequence tag that facilitates purification, such as, but not limited to, a polyhistidine sequence, or a sequence that specifically binds to an antibody, such as FLAG and GST.
- the polypeptide can then be purified from a crude lysate of the host cell by chromatography on an appropriate solid-phase matrix.
- antibodies produced against the protein or against peptides derived therefrom can be used as purification reagents.
- Cells can be purified by various techniques, including centrifugation, matrix separation (e.g., nylon wool separation), panning and other immunoselection techniques, depletion (e.g., complement depletion of contaminating cells), and cell sorting (e.g., fluorescence activated cell sorting or "FACS"). Other purification methods are possible.
- a purified material may contain less than about 50%), preferably less than about 15%, and most preferably less than about 90%, of the cellular components with which it was originally associated. The "substantially pure" indicates the highest degree of purity which can be achieved using conventional purification techniques known in the art.
- the terms “about” and “approximately” shall generally mean an acceptable degree of error for the quantity measured given the nature or precision of the measurements. Typical, exemplary degrees of error are within 20 percent (%), preferably within 10%, and more preferably within 5% of a given value or range of values. Alternatively, and particularly in biological systems, the terms “about” and “approximately” may mean values that are within an order of magnitude, preferably within 5-fold and more preferably within 2-fold of a given value. Numerical quantities given herein are approximate unless stated otherwise, meaning that the term “about” or “approximately” can be inferred when not expressly stated.
- sequence space refers to the set of all possible sequences of residues for a polymer having a specified length.
- sequence space for a protein or polypeptide 300 amino acid residues in length is the group consisting of all sequences of 300 amino acid residues.
- sequences space of a nucleic acid 300 nucleotides in length is the group consisting of all sequences of 300 nucleotides, etc.
- Conformational energy refers generally to the energy associated with a particular "conformation", or three-dimensional structure, of a polymer, such as the energy associated with the conformation of a particular protein or nucleic acid. Interactions that tend to stabilize a macromolecule, such as a polymer (e.g.
- the conformational energy for any stable polymer is quantitatively represented by a negative conformational energy value.
- the conformational energy for a particular polymer will be related to that polymer's stability.
- polymers and other macromolecules that have a lower (i.e, more negative) conformational energy are typically more stable, e.g., at higher temperatures (i.e., they have greater "thermal stability").
- the conformational energy of a polymer may also be referred to as the polymer's "stabilization energy”.
- the conformational energy is calculated using an energy "force-field” that calculates or estimates the energy contribution from various interactions which depend upon the conformation of a polymer.
- the force-field is comprised of terms that include the conformational energy of the alpha-carbon backbone, side chain - backbone interactions, and side chain - side chain interactions.
- interactions with the backbone or side chain include terms for bond rotation, bond torsion, and bond length.
- the backbone-side chain and side chain-side chain interactions include van der Waals interactions, hydrogen-bonding, electrostatics and solvation terms.
- Electrostatic interactions may include coulombic interactions, dipole interactions and quadrapole interactions). Other similar terms may also be included.
- Force-fields that may be used to determine the conformational energy for a polymer are well known in the art and include the CHARMM (see, Brooks etal.,J. Comp. Chem. 1983, 4:187-217; MacKerell et al, in The Encyclopedia of Computational Chemistry, Vol. 1:271-277, John Wiley & Sons, Chichester, 1998 ), AMBER (see, Georgia et al, J. Amer. Chem. Soc. 1995, 117:5179; Woods et al., J. Phys. Chem. 1995, 99:3832-3846; Weiner et al., J. Comp. Chem. 1986, 7:230; and Weiner et al, J. Amer. Chem.
- Coupled residues are residues in a polymer that interact, through any mechanism. The interaction between the two residues is therefore referred to as a “coupling interaction” . Coupled residues generally contribute to polymer fitness through the coupling interaction. Typically, the coupling interaction is a physical or chemical interaction, such as an electrostatic interaction, a van der Waals interaction, a hydrogen bonding interaction, or a combination thereof. As a result of the coupling interaction, changing the identity of either residue will affect the fitness of the polymer, particularly if the change disrupts the coupling interaction between the two residues. Coupling interaction may also be described by a distance parameter between residues in a polymer. If the residues are within a certain cutoff distance, they are considered interacting.
- the term "fitness" is used to denote the level or degree to which a particular property or a particular combination of properties for a polymer (e.g. , a biopolymer such as a protein or a nucleic acid) are optimized.
- the fitness of a polymer is preferably determined by properties which a user wishes to improve.
- the fitness of a protein may refer to the protein's thermal stability, catalytic activity, binding affinity, solubility (e.g. , in aqueous or organic solvent), and the like.
- Other examples of fitness properties include enantioselectivity, activity towards non-natural substrates, and alternative catalytic mechanisms. Coupling interactions can be modeled as a way of evaluating or predicting fitness (stability). Fitness can be determined or evaluated experimentally or theoretically, e.g. computationally.
- the fitness is quantitated so that each polymer (e.g., each amino acid or nucleotide sequence) will have a particular "fitness value".
- the fitness of a protein may be the rate at which the polymer catalyzes a particular chemical reaction, or the protein's binding affinity for a ligand.
- the fitness of a polymer refers to the conformational energy of the polymer and is calculated, e.g. , using any method known in the art. See, e.g.
- the fitness of a polymer is quantitated so that the fitness value increases as the property or combination of properties is optimized.
- fitness landscape is used to describe the set of all fitness values belonging to all polymer sequences in a sequence space.
- each polypeptide in the sequence space will have a particular fitness value that may (at least in theory) be calculated or measured (e.g. , by screening each polypeptide to determine its fitness).
- the set of these fitness values is therefore the fitness landscape of the sequence space for proteins 300 amino acid residues in length.
- fitness values may vary considerably among individual sequences in a given sequence space. The fitness value for a given sequence may be higher or lower than other, similar sequences in the sequence space.
- the "fitness contribution" of a polymer residue refers to the level or extent ff a ) to which the residue i a , having an identity a, contributes to the total fitness of the polymer.
- the residue i a having an identity a
- the term "structural tolerance" is used to indicate the number of sequences or
- sequence states in a particular sequence space that are compatible with a particular stabilization or conformational energy (generally referred to here as the "energy").
- the particular energy is the energy of a particular "parent" sequence (e.g., the sequence of a particular polypeptide or nucleic acid).
- a sequence state may be compatible with a particular energy if the sequence state's energy is equal to (or approximately equal to) the particular energy. In other embodiments, however, a sequence state may be compatible with a particular energy if the sequence state's energy is less than or equal to (or is less than or approximately equal to) the particular energy.
- structurally tolerant mutants are those sequences in the sequence space (i. e. , those mutants) having a stabilization or conformational energy that is compatible with the parent sequence's stabilization or conformational energy.
- the structural tolerance is determined or otherwise obtained or provided for each residue in a particular "parent" sequence (e.g. , for each amino acid residue of a protein sequence, or for each nucleotide of a nucleic acid sequence).
- the structural tolerance of a particular residue is indicative of the number of compatible sequences having a mutation at that residue (i.e., where the identity of the corresponding residue in a compatible sequence is different from the residue's identity in the parent sequence).
- the structural tolerance of a polymer is measured by its "sequence entropy".
- the sequence entropy S(E) for a particular polymer sequence (referred to herein as the "parent sequence”) is preferably obtained or provided by the relation
- ⁇ is the number of polymer sequences in the sequence space containing the parent sequence which are compatible with a particular conformational energy E (preferably, the conformational energy of the parent sequence), and s, is the "site entropy" (defined below) of residue i in the polymer sequence.
- E preferably, the conformational energy of the parent sequence
- s is the "site entropy" (defined below) of residue i in the polymer sequence.
- the "site entropy" of a particular residue i in a particular polymer sequence is an indication or measurement of the number of compatible sequence states (i.e. sequence states that are compatible with a given conformational energy, preferably the conformational energy of the parent sequence) ⁇ , c ⁇ that have a residue mutation or substitution at the residue corresponding to residue i in the parent sequence.
- the site entropy of a residue i is a measurement or indication of the extent or likelihood that a mutation at residue / will disrupt the three-dimensional structure and/or the "fitness" (defined supra) of the parent sequence.
- the site entropy s, of residue in a given polymer sequence is expressed as:
- A is the total number of substitutable groups (20 for amino acids)
- k B is the Boltzman constant (e.g., about 1.38 x 10 "23 J/K), or a dimensionless proportionality constant and is preferably chosen to be unity (i.e., k B - X).
- the probability p,(a) is the probability " ?” that amino acid residue " ⁇ " exists at residue "i.”
- probabilities for p,(a) can be determined by a Boltzmann weighting of the energies associated with mutating a residue to each state, while the remaining residues j (j ⁇ i) retain their wild-type amino acid identities, i. e. : e - ⁇ E, (a) p ⁇ ( ) - — A (Equation 3).
- E,(a) is the energy state a at residue i.
- the mean field theory may be use to deconvolute the probabilities while effectively allowing all amino acids at all positions to vary.
- site entropy distribution or “entropy distribution” refers to the distribution of site entropy values for a particular polymer sequence and may be represented, e.g. , as a histogram of the polymer's site entropy values.
- the site entropy distribution may provide, in certain embodiments, the probability P(s that a residue of the polymer will have a particular site entropy value s Equivalently, the site entropy distribution P(s_) gives the fraction of residues in the polymer sequence that has a site entropy value _?,.
- mean field theory is a set of mathematical techniques for the theoretical treatment of systems undergoing phase transitions, using approximations.
- the idea of mean-field theory is to focus on one particular "tagged" particle in the system (in the case of proteins - one residue) and assume that the role of neighboring particles (residues) is to form an average energetic field which acts on the tagged particle (the specific amino acid at that residue). This is a useful technique to deconvolute the probability of sequence S ⁇ into the product of the individual amino acid probabilities at each residue
- DEE Dead-end elimination
- GMEC global minimum energy conformation
- Dead end elimination is based on the following concept.
- two rotamers, i r and iliens at residue i and the set of all other rotamer configurations ⁇ S ⁇ at all residues excluding / (of which rotamer/, is a member). If the pairwise energy contributed between i r andy " . is higher than the pairwise energy between /, and/, for all ⁇ S ⁇ , then rotamer i r cannot exist in the global minimum energy conformation, and can be eliminated. This notion is expressed mathematically by the inequality:
- Equation 5 is not computationally tractable because, to make an elimination, it is required that the entire sequence (rotamer) space be enumerated. To simplify the problem, bounds implied by Equation 5 can be utilized:
- Equation 6 Equation 6
- Equation (6) can be extended to the elimination of pairs of rotamers inconsistent with the GMEC. This is done by determining that a pair of rotamers /, at residue / at residue , always contribute higher energies than rotamers / admir andy ' v with all possible rotamer combinations ⁇ L ⁇ . Similar to Equation 6, the strict bound of this statement is given by:
- FIG. 1 provides a flow diagram illustrating a general, exemplary embodiment of the methods used in this invention.
- the flow diagram in FIG. 1 as well as the other examples presented in Section 6, infra, describe preferred embodiments where the methods are used in directed evolution of a protein or other polypeptide.
- the methods illustrated by these examples and throughout this specification may be used to modify any polymer. Indeed, any molecule composed of a sequence or series of discrete residues can be optimized according to these methods.
- the methods of the invention may also be used, e.g., for directed evolution of nucleic acids (including directed evolution of DNA or RNA).
- a person skilled in the relevant art(s) may readily modify the methods for use with such other polymers using only what is taught in this application coupled with routine methods, already known in the art, for synthesizing, modifying and/or screening other polymers.
- the method shown in FIG. 1 begins with the selection of a "parent" amino acid (or other polymer) sequence (101).
- the parent sequence may be any amino acid sequence and may or may not correspond to a naturally occurring polypeptide. However, in preferred embodiments the parent sequence will be the sequence for a protein, or other polypeptide, that is expressed by a cell. Preferably, the parent sequence is also the sequence for a protein that has some level or degree of activity or function (e.g. , catalytic activity, binding affinity, solubility, thermal stability, etc.) to be optimized. The methods of the invention may then be used, e.g. , to optimize the activity or function of the parent sequence and/or to optimize the activity in altered conditions.
- the methods of the invention may then be used, e.g. , to optimize the activity or function of the parent sequence and/or to optimize the activity in altered conditions.
- the parent sequence may be a protein having a particular catalytic or other activity, and the invention may be used to identify sequences having the same activity but under different (generally more extreme) conditions such as conditions of temperature or of solvent (including, for example, solvent polarity, salt conditions, acidity, alkalinity, etc. ).
- the parent sequence may have a particular level or amount of activity (e.g., catalytic activity, binding affinity, etc.), and the directed evolution methods of the invention may be used to identify sequences having improved levels or amounts of that same activity (e.g., higher binding affinity or increased catalytic rate).
- the fitness value of the parent sequence is determined (e.g. , calculated) or otherwise obtained or provided from an expression that comprises a first term f(i a ) for the uncoupled contribution of each residue i a to the fitness, and a second te ⁇ m f i a ,j a ) for the contribution made by coupling interactions between residues i a andj a ; i.e., of the general form
- the conformational energy may be obtained or provided from an expression of the form
- e(i r ) denotes interactions between components (typically atoms or functional groups, such as methyl-groups, ethyl-groups, hydroxyl-groups, etc.) of the same residue / (i.e., intra-residue interactions) that contribute to the conformational energy
- e(i,j) denotes interactions between components of different residues / and/ (i.e., inter-residue interactions) that contribute to the conformational energy, such as (but not limited to) van der Waals, electrostatic, and hydrogen bonding interactions between different residues.
- the structure or conformation of the parent sequence is obtained or otherwise provided. (See FIG. 1, step 102).
- the parent sequence is the sequence for a known protein or nucleic acid
- the structure or conformation of the parent sequence will be known and can be obtained from any of a variety of resources (for a review, see Hogue et al. , Methods Biochem. Anal. 1998, 39:46-73).
- the Protein Data Bank (PDB) (Berman et al. , Nucl Acids Res. 2000, 28:235- 242) is a public repository of three-dimensional structures for a large number of macromolecules, including the structures of many proteins, nucleic acids and other biopolymers.
- the structure of a polymer (e.g, protein) sequence that is similar or homologous to the parent sequence will be known.
- the known structure may, therefore, be used as the structure for the parent sequence or, more preferably, may be used to predict the structure of the parent sequence (i.e., in "homology modeling").
- MMDB Molecular Modeling Database
- 1999, 27:240-243 provides search engines that may be used to identify proteins and/or nucleic acids that are similar or homologous to a parent sequence (referred to as "neighboring" sequences in the MMDB), including neighboring sequences whose three-dimensional structures are known.
- the database further provides links to the known structures along with alignment and visualization tools whereby the homologous and parent sequences may be compared and a structure may be obtained for the parent sequence based on such sequence alignments and known structures.
- the three-dimensional structure of a parent sequence may be calculated from the sequence itself and using ab initio molecular modeling techniques already known in the art. See e.g., Smith TF, LoConte L, Bienkowska J, et al., “Current limitations to protein threading approaches,” J. Comput.
- a fitness value for the parent may be optionally obtained by calculating or determining the "conformational energy” or "energy” E of the parent structure (103).
- sequences that have a lower (/. e. , more negative) conformational energy are typically expected to be more stable and therefore more "fit” than are sequences having higher (i.e., less negative) conformation energy.
- the conformational energy is calculated ab initio from the conformation determined in step 102, discussed above, and using an empirical or semi-empirical force field such as CHARM (Brooks et al, J. Comp. Chem. 1983, 4: 187-217; MacKerell et al, in 77ze Encyclopedia of Computational Chemistry, Vol. 1:271-277, John Wiley & Sons, Chichester, 1998 ) AMBER (see, Georgia et al, J. Amer. Chem. Soc. 1995, 117:5179; Woods et al, J. Phys. Chem. 1995, 99:3832-3846; Weiner et al, J. Comp. Chem.
- Equation 4 may be expanded to include an additional term (/ ⁇ ,/ ⁇ , k a ) for contributions made by coupling interactions between triplets of residues i a ,j_ and k_.
- expressions for the fitness of a sequence may comprise terms for coupling interactions between any multiple of residues.
- the difficulty of calculating the fitness value of a sequence increases exponentially as additional coupling terms are included.
- the energy force field may be which is expanded beyond the pairwise form to include multi-body energy terms such as, but not limited to, buried hydrophobic surface area and more complicated electrostatic interactions (e.g., electrostatic dipole and/or electrostatic quadropole interactions).
- each amino acid has a set of rotamers, each of which has an associated energy (and therefore a probability).
- the amino acid probability which is used to calculate the entropy, is the sum of the rotamer probabilities.
- the energies presented in the simplified example above are for single residues. This does not account for coupling between amino acids.
- the structural tolerance (104) can be readily determined for each residue (e.g., for each amino acid or nucleic acid residue) of that parent sequence using the methods provided herein.
- mutations that improve the fitness of a particular polymer are most likely to occur at uncoupled residues or at residues that are only weakly coupled. This is particularly true in preferred embodiments of the invention where the parent sequence is the sequence of a polymer, such as a protein or nucleic acid, that has a relatively high level of fitness.
- the structural tolerance of a residue is determined by evaluating the level or degree of coupling interactions that residue has with other residues of the polymer. Counting the number of coupling interactions between each residue is a simple means of estimating the structure tolerance of a residue. For example, a residue that has many coupling interactions between itself and the remaining structure is intolerant and a residue that has few coupling interactions between itself and the remaining structure is tolerant.
- the structural tolerance of a residue / is provided or determined by obtaining or determining the "site entropy" s, of that residue.
- the site entropy s is related to the number of sequences ⁇ , in the sequence space that are compatible with conformational energy E of the parent sequence and have a mutation (/. e. , substitution) at residue /.
- Residues that have higher site entropy values are residues where more mutations may be made which are compatible with the conformational energy of a parent sequence.
- the site entropy is particularly useful as a measurement of a residue's structural tolerance for mutations.
- the site entropy may be provided by any relationship where s, increases with ⁇ ,.
- the site entropy s l of a residue may be calculated or obtained by identifying all compatible sequences ⁇ in the sequence space, and identifying, among these compatible sequences, those that have a mutation or substitutions at residue / (i.e., those compatible sequence where the residue at position / has a different identity than in the parent sequence).
- the sequences will include compatible sequences in the sequence space which have mutations or substitutions at one or more other residues/ ⁇ i in addition to a mutation or substitution at residue /. Practically speaking, however, it will be computationally intractable to identify all possible sequences ⁇ in a sequence space that are compatible with the conformational energy of a particular parent sequence.
- the invention also provides embodiments where the site entropy s, of a residue may be calculated or obtained by identifying compatible sequences that are identical to the parent except for a single mutation or substitution at residue /.
- the methods of the invention may involve identifying and/or determining the number of compatible structures having the same sequence as the parent and having multiple residues where amino acid substitutions are allowed simultaneously.
- a skilled artisan may readily determine the conformational energy of a mutant sequence that differs from the parent by a single mutation or substitution at residue /, using the three dimensional structure provided for the parent sequence (102).
- the conformational energy for all possible mutations and/or substitutions of the residue / is modelled (e.g., for all possible amino acid residue substitutions or for all possible nucleotide substitutions).
- Those single residue mutations and/or substitutions that have a conformational energy which is compatible with the conformational energy provided for the parent (103) are identified, and are used to calculate or determine the site entropy value _>,.
- the invention also provides embodiments where the site entropy s, for each residue in a polymer sequence is iteratively calculated.
- Examples 6.2 demonstrates exemplary calculations of site entropy values using mean-field theory to identify sequences that are compatible with the conformational energy of certain proteins that are used (in that example) as parent sequences. Conformational energies in these particularly preferred embodiments may be calculated, for example, using an energy force-field that models interactions of amino acid residues as rotamers interacting with a fixed backbone.
- the mean field calculation effectively varies all amino acids of a parent protein simultaneously, across a fitness range that is modeled according to relative conformation energies across a corresponding "sequence temperature" range.
- the sequence temperature is a convenient term for representing relative energies; it is not a physical or measured temperature in the sense of degrees Kelvin or Centigrade.
- low fitness corresponds to high sequence temperatures
- high fitness corresponds to low sequence temperatures.
- no mutations e.g. amino acid deletions, additions or substitutions
- the mean field equations are iterated for self-consistency across the range of temperatures used.
- a hypothetical polypeptide having amino acid "X" at position 25 may be substituted at that position by any other amino acid.
- the probability of finding alanine at position 25 (according to the conformational energies at a first (relatively high) temperature may be 0.1, whereas the probability associated with serine may be 0.9. According to the invention, this means that serine has a lower energy than alanine at position 25, and serine is expected to be "better” for the fitness of the polypeptide than alanine.
- amino acid "Y" at position 30 of this polypeptide interacts with the amino acid at position 25, then amino acids that interact more favorably with serine will be preferred as amino acid Y at position 30.
- 6.2 are particularly preferred since these embodiments effectively vary all residues in a given polymer sequence simultaneously.
- particular residues e.g., one or more, two or more, three or more, four or more, five or more, ten or more, etc.
- the calculations demonstrated in Section 6.2 may be performed varying only the selected residues.
- only the probability distribution of the selected residue(s) is (are) determined.
- first one residue is picked and its probability distribution is calculated (e.g., according to the mean field theory described in Example 6.2), and a second residue is then picked and its probability distribution is calculated, etc.
- site entropies are calculated for a plurality of residues in the polymer sequence (preferably for all residues in the polymer sequence).
- one residue is picked to vary (in composition and conformation) while holding all of the other residues constant (e.g. in their wild-type state). Calculations like those above are done, but only the probability distribution for the picked residue is determined when the temperature is decreased (the site entropy is calculated only for that residue). Then, a new residue is picked, and the process is repeated as desired, typically until the site entropies for all of the positions have been determined.
- stochastic algorithms such as Monte Carlo algorithms may be used to determine conformational energies for a large number of sequences folded into a fixed backbone (preferably, one that corresponds to the conformation of the backbone in a parent sequence), and to identify those sequences which are compatible with (e.g. less than or approximately equal to) the conformational energy of the parent sequence, this giving an estimate of s See e.g., Desjarlais JR & Clarke ND, "Computer search algorithms in protein modification and design," Curr. Opin. Struct.
- a parent sequence may be compared to one or more (preferably to a plurality of) polymer sequences to identify homologous sequences.
- the parent sequence for example, a particular protein sequence
- the parent sequence may be aligned with a plurality of other sequences (e.g., from a database of naturally occurring protein or other polymer sequences, such as the GenBank, S WISPROT or EMBL database) to identify homologous sequences.
- Sequences that have a certain level of sequence similarity may also be compared.
- the level of sequence similarity is a threshold level or percentage of sequence homology (or sequence identity) that may be selected by a user.
- preferred levels of sequence homology are at least 70%, at least 75%, at least 80%, at least 85%, at least 90%), at least 95%) or at least 99%.
- a variety of methods and algorithms are known in the art for aligning polymer sequences and/or determining their levels of sequence similarity. Any of these methods and algorithms may be used in connection with this invention. Exemplary algorithms include, but are not limited to, the BLAST family of algorithms, FASTA, MEGALIGN and CLUSTAL.
- the site entropy of one or more particular residues in the parent sequence may be determined or estimated from the number of homologous or aligned sequences in which the particular residue is mutated.
- a homologous or aligned polymer sequence is said to have a mutation at a particular residue if, in an alignment of the particular polymer sequence (e.g, from the alignment algorithm used to identify a homologous sequence or sequences), the residue in the homologous sequence which that aligns with the particular residue in the parent sequence is different (i.e., has a different identity) from the particular residue.
- the probabilities required to determine the site entropy e.g. Equation 2 can be calculated by the relative number of times each amino acid appears at each residue in the alignment.
- the structural tolerance of residues in a polymer may be determined indirectly through other parameters that are related to or that correlate with structural tolerance.
- B-values of individual residues may correlate with the individual residues' structural tolerance and can be used as indicators thereof.
- an ensemble of structures is available or may be readily determined (e.g., by NMR) for the parent sequence, per residue root mean squared (rms) deviation values from the ensemble may also be used as indicators of structural tolerance.
- Section 6.3 demonstrates that site entropy values for residues (e.g., as determined according to the method demonstrated in Section 6.2) correlate, at least to some degree, with the solvent accessibility of each residue in the parent sequence (e.g., in a protein). See, Tables 4A and 4B, and FIGS. 4A and 4B. Solvent accessibility for each residue was calculated using the Lee and Richards definition of solvent accessible surface area (Lee, B. & Richards, F.M. (1971) J. Mol. Biol. 55, 379) where 1.4 A was used as the radius for water. Accordingly, the level or extent to which a residue in a polymer is accessible to solvent may also be used to indicate structural tolerance.
- the methods described in Section 5.2, supra, are particularly useful for directed evolution experiments, e.g., to obtain proteins, nucleic acids or other polymers having one or more desirable properties. Accordingly, the invention also provides methods, including methods of directed evolution, for obtaining polymers that have one or more improved properties.
- the improved properties include any property or combination of properties that can be detected by a user and include, for example, properties of catalytic activity (for example, increased rates of catalysis), properties of stability (for example, increased thermal stability) or properties of binding affinity (for example, increased affinity for a particular ligand or substrate) and properties of binding specificity (including stereo- or enantio-selectivity; /. e.
- directed evolution methods comprise selecting at least one polymer sequence (i.e., a "parent" sequence).
- the polymer sequence is the sequence for a polymer (e.g., a nucleic acid or a polypeptide) that has a particular property or properties of interest.
- the particular property of the parent may be a particular catalytic activity, binding to a particular substrate or ligand, thermal stability or a combination thereof.
- the property is one that can be readily determined or evaluated by a screening assay, e.g. a high throughput screen.
- One or more residues of the parent polymer sequence is selected or targeted for mutation.
- point mutagenesis is applied across an entire gene. This is a random process, and mutations appear at random sites.
- specific residues in the parent sequence which are structurally tolerant are selected.
- the structurally tolerant residues may be identified, for example, according to the analytical methods described supra (see, Section 5.2). The eliminates or reduces the random mutagenesis of known methods, and provides a more targeted approach with improved efficiency.
- One or more, and preferably a plurality of mutant polymer sequences may then be generated based on the parent sequence.
- the directed evolution methods of the invention preferably generate a plurality of mutants which are identical to the parent sequence except that one or more structurally tolerant residues are mutated.
- Polymers having the mutant sequences may then be generated using polymer synthesis and or recombinant technologies well known in the art, and the polymers having these mutant sequences are then preferably screened for the one or more properties of interest.
- methods of directed evolution may be iteratively repeated to generate and identify polymers where one or more properties of interest progressively improve with each iteration. Accordingly, in a preferred embodiment, one or more of the selected polymers may be selected as a new parent sequence, for use in a next round of iteration in the directed evolution method.
- Structurally tolerant residues of the new parent sequence may then be selected, and a second generation of mutants can be generated and screened as described above.
- Improved mutants may also be recombined if desired, using conventional genetic engineering techniques or by DNA shuffling to obtain further variations and improvements (see, for example, the Stemmer references, supra). These processes may be repeated as desired, to obtain successive generations of mutants.
- mutants are then tested, preferably in a screening assay, to identify mutants that actually have an improved property detected in the assay (for example, increased catalytic activity, or stronger binding to a ligand or substrate). These mutants are selected and again mutated, and the second generation of mutants is again tested to identify new mutants where the property is further improved.
- traditional directed evolution methods randomly search through the sequence space of a polymer one residue at a time to identify mutants with an increased fitness.
- screening assays may observe and/or select from between about 10 3 or 10 12 mutants, depending on the particular method. However, for a typical protein of 300 amino acid residues the number of possible amino acid combinations is about 10 390 . Thus, screening assays can only observe a small fraction of sequences in the sequence space of a given parent. Using the analytical methods described in Section 5.2, therefore, a user can improve upon such existing methods by identifying those polymer residues having the highest level of structural tolerance and specifically selecting those residues for mutation in a directed evolution experiment.
- a user may select residues that have a structural tolerance (or a parameter such as site entropy which is indicative thereof) above a threshold value, which may also be selected by a user (for example, based on the number of residues that can be reasonably targeted in a particular experiment). For example, a user may select residues having a structural tolerance (or site entropy, etc.) that is above the average value of structural tolerance values in the parent sequences site entropy distribution. Alternatively, a user may select residues whose structural tolerance is greater than, e.g. , one standard deviation in the site entropy distribution.
- mutations of residues that have an increased structural tolerance value and/or fewer coupling interactions with other residues are less likely to destablize the structure of the polymer and, conversely, are more likely to increase the polymer's fitness. Accordingly, by focusing the mutations in a directed evolution experiment to residues having higher structural tolerance values, the number of sequences that must be tested or screened is considerably reduced and the sequence space may be searched more efficiently using existing screening techniques.
- the parent sequence may be expressed in facile gene expression systems to obtain libraries of mutant proteins.
- Any source of nucleic acid in purified form can be utilized as the starting nucleic acid.
- the process may employ DNA or RNA, including messenger RNA.
- the DNA or RNA may be either single or double stranded.
- DNA-RNA hybrids which contain one strand of each may be utilized.
- the nucleic acid sequence may also be of various lengths depending on the size of the sequence to be mutated.
- the specific nucleic acid sequence is from 50 to 50,000 base pairs. It is contemplated that entire vectors containing the nucleic acid encoding the protein of interest may be used in these methods.
- mutants of a parent sequence are generated by saturation or site directed mutagenesis techniques. These methods are targeted to specifically mutate selected residues, which, e.g., have higher structural tolerance or site entropy values.
- Oligonucleotide-directed mutagenesis which replaces a short sequence with a synthetically mutagenized oligonucleotide may also be employed to generate evolved polynucleotides having improved expression.
- An initial population of mutants of a specific (i.e., parent) sequence may be created by methods known in the art. These methods include oligonuclotide-directed mutagenesis, error- prone PCR, DNA shuffling, parallel PCR, chemical mutagenesis and sexual PCR.
- Nucleic acid or DNA shuffling which uses a method of in vitro or in vivo, generally homologous, recombination of pools of nucleic acid fragments or polynucleotides, can be employed to generate polynucleotide molecules having variant sequences of the invention.
- the evolved polynucleotide molecules can be cloned into a suitable vector selected by the skilled artisan according to methods well known in the art. If a mixed population of the specific nucleic acid sequence is cloned into a vector it can be clonally amplified by inserting each vector into a host cell and allowing the host cell to amplify the vector. The mixed population may be tested to identify the desired recombinant nucleic acid fragment. The method of selection will depend on the DNA fragment desired. For example, in this invention a DNA fragment which encodes for a protein with improved properties can be determined by tests for functional activity and/or stability of the protein. Such tests are well known in the art.
- the invention provides a novel means for producing functional proteins with improved properties.
- the mutants can be expressed in conventional or facile expression systems such as E. coli.
- Conventional tests can be used to determine whether a protein of interest produced from an expression system has improved expression, folding and/or functional properties.
- a polynucleotide subjected to directed evolution and expressed in a foreign host cell produces a protein with improved activity
- one skilled in the art can perform experiments designed to test the functional activity of the protein. Briefly, the evolved protein can be rapidly screened, and is readily isolated and purified from the expression system or media if secreted. It can then be subjected to assays designed to test functional activity of the particular protein in native form.
- assays designed to test functional activity of the particular protein in native form.
- FIG. 2 schematically illustrates an exemplary computer system suitable for implementation of the analytical methods of this invention.
- Computer 201 is illustrated here as comprising internal components linked to external components.
- internal components may, in alternative embodiments, be external.
- external components may also be internal.
- the internal components of this computer system include processor element 202 interconnected with a main memory 203.
- computer system 201 may be a Silicon Graphics RI 0000 Processor running at 195 MHz or greater and with 2 gigabytes or more of physical memory.
- computer system 201 may be an Intel Pentium based processor of 150 MHz or greater clock rate and the 32 megabytes or more of main memory.
- the external components may include a mass storage 204.
- This mass storage may be one or more hard disks which are typically packaged together with the processor and memory Such hard disks are typically of at least 1 gigabyte storage capacity, and more preferably have at least 5 gigabytes or at least 10 gigabtyes of storage capacity.
- the mass storage may also comprise, for example, a removable medium such as, a CD-ROM drive, a DVD drive, a floppy disk drive (including a ZipTM drive), or a DAT drive or other
- a user interface device 205 which can be, for example, a monitor and a keyboard.
- the user interface is also coupled with a pointing device 206 which may be, for example, a "mouse" or other graphical input device (not illustrated).
- computer system 201 is also linked to a network link 207, which can be part of an Ethernet or other link to one or more other, local computer systems (e.g. , as part of a local area network or LAN), or the network link may be a link to a wide area communication network (WAN) such as the Internet.
- WAN wide area communication network
- one or more software components are loaded into main memory 203 during operation of computer system 201.
- These software components may include both components that are standard in the art and special to the invention, and the components collectively cause the computer system to function according to the analytical methods of the invention.
- the software components are stored on mass storage 204 (e.g. , on a hard drive or on removable storage media such as on one or more CD-ROMs, RW-CDs, DVDs, floppy disks or DATs) .
- Software component 210 represents an operating system, which is responsible for managing computer system 201 and its network interconnections.
- This operating is typically an operating system routinely used in the art and may be, for example, a UNIX operating system or, less preferably, a member of the Microsoft WindowsTM family of operating systems (for example, Windows 2000, Windows Me, Windows 98, Windows 95 or Windows NT) or a Macintosh operating system.
- Software component 211 represents common languages and functions conveniently present in the system to assist programs implementing the methods specific to the invention. Languages that may be used include, for example, FORTRAN, C, C++ and less preferably JAVA.
- the analytical methods of the invention may also be programmed in mathematical software packages which allow symbolic entry of equations and high-level specification of processing, including algorithms to be used, thereby freeing a user of the need to procedurally program individual equations and algorithms.
- software component 212 represents the analytic methods of the invention as programmed in a procedural language or symbolic package.
- the memory 203 may, optionally, further comprise software components 213 which cause the processor to calculate or determine a three-dimensional structure for a macromolecule and, in particular, for a given polymer sequence such as a protein or nucleic acid sequence.
- Such programs are well known in the art, and numerous software packages are available.
- the memory may also comprise one or more other software components, such as one or more other files representing, e.g. , one or more sequences of polymer residues including, for example, a parent sequence and/or other sequences (for example, mutant sequences) in a sequence space.
- the memory 203 may also comprise one or more files representing the three-dimensional structures of one or more sequences, including a file representing the three-dimensional structure of a parent sequence, such as a parent protein or nucleic acid.
- a computer program product of the invention comprises a computer readable medium such as one or more compact disks (/. e. , one or more "CDs", which may be CD-ROMs or a RW-CDs), one or more DVDs, one or more floppy disks (including, for example, one or more ZIPTM disks) or one or more DATs to name a few.
- the computer readable medium has encoded thereon, in computer readable form, one or more of the software components 212 that, when loaded into memory 203 of a computer system 201, cause the computer system to implement analytic methods of the invention.
- the computer readable medium may also have other software components encoded thereon in computer readable form.
- Such other software components may include, for example, functional languages 211 or an operating system 210.
- the other software components may also include one or more files or databases including, for example, files or databases representing one or more polymer sequences (e.g. protein or nucleic acid sequences) and/or files or databases representing one or more three-dimensional structures for particular polymer sequences (e. g. , three-dimensional structures for proteins and nucleic acids.
- a parent sequence may first be loaded into the computer system 201.
- the parent sequence may be directly entered by a user from monitor and keyboard 205 and by directly typing a sequence of code of symbols representing different residues (e.g., different amino acid or nucleotide residues).
- a user may specify parent sequences, e.g., by selecting a sequence from a menu of candidate sequences presented on the monitor or by entering an accession number for a sequence in a database (for example, the GenBank or S WISPROT database) and the computer system may access the selected parent sequence from the database, e.g.
- the programs may then cause the computer system to obtain a three-dimensional structure of the parent sequence.
- the three-dimensional structure for the parent sequence may also be accessed from a file (for example, a database of structures) in the memory 203 or mass storage 204.
- the three-dimensional structure may also be retrieved through the computer network (e.g., over the network) from a database of structures such as the PDB database.
- the software components may, themselves, calculate a three-dimensional structure using the molecular modeling software components.
- Such software components may calculate or determine a three-dimensional structure, e.g., ab initio or may use empirical or experimental data such as X-ray crystallography or NMR data that may also be entered by a user of loaded into the memory 203 (e.g., from one or more files on the mass storage 204 or over the computer network 207).
- the software components may further cause the computer system to calculate a conformational energy for the parent sequence using the three- dimensional structure.
- the software components of the computer system when loaded into memory 203, preferably also cause the computer system to determine a structural tolerance or, in the alternative, a parameter related to or correlating with structural tolerance according to the methods described herein.
- the software components may cause the computer system to generate one or more mutant sequences of the parent and, using the conformation determined or obtained for the parent sequence, determine the conformational energy of each mutant and identifying mutants that are compatible with the parent sequence's conformational energy.
- the structural tolerance of the residues may then be determined, e.g., by determining the site entropy s r
- the software components may cause the computer system to determine or evaluate the solvent accessibility of each residue in the parent sequence using its three- dimensional conformation.
- the computer system preferably then outputs, e.g., the structural tolerance or structural tolerance values of residues of the parent sequence.
- the structural tolerance values and/or values of one or more other parameters relating to or correlating with structural tolerance may be output to the monitor, printed on a printer (not shown) and/or written on mass storage 204.
- the software components may also cause the computer system to select and identify one or more particular residues in the parent sequence for mutation, e.g., in a directed evolution experiment.
- the computer system may identify residues of the parent sequence having a structural tolerance value that is above a certain threshold, such as values above the average structural tolerance value for residues of the polymer or, alternatively, values above one or more standard deviations of the average structural tolerance value. These residues could be identified, for a user, as ones which, if mutated, are most likely to improve properties of the polymer in a directed evolution experiment.
- N is the number of residues
- i a is the identity of residue / (i.e., the amino acid at position / in a protein)
- (/ ⁇ ) is the uncoupled fitness contribution of i a to the total fitness (F)
- a d ⁇ j a ,j a ) is the fitness contribution of coupling interactions between amino acid residues i ⁇ and/,.
- the hypothetical biopolymer was "mutated” by selected a residue / at random and changing its amino acid identity. 3000 mutants were thus identified and “screened” by evaluating each mutant's fitness F according to Equation 12 above.
- the directed evolution algorithm gradually progressed through different "fitness heights" (i.e., different levels of fitness) on the fitness landscape. At each round of evolution, the coupling of the residues where beneficial mutations occurred was recorded.
- FIG. 3 provides a plot showing the probability P(c) that a positive mutation (i.e., a mutation that increases the protein's fitness, F) occurs at a residue having c coupled interactions.
- results described here are independent of the specific form used to describe or model the fitness landscape E for a polymer (e.g. , for a biopolymer such as DNA, RNA or a polypeptide).
- a polymer e.g. , for a biopolymer such as DNA, RNA or a polypeptide
- one skilled in the art can readily obtain the same results using any model that incorporates a variable degree of coupling between residues.
- examples of other models which may be used include Kauffman's NK-model (Kauffman, The Origins of Order 1993, Oxford University Press, Oxford), a lattice protein model (Li et al, Science 1996, 273:666; Shakhnovich, Phys. Rev. eft.
- the fitness landscape is not limited to forms having only a two- body coupling term. Expressions may also be used which include terms for interactions or couplings between multiple (e.g. , three or more) residues. Indeed, such terms will be useful and desirable in embodiments where a user wishes to also consider more complicated coupling interactions (for example, buried hydrophobic surface areas and/or complicated electrostatic interactions). 6.2. Calculating Structural Tolerance of Proteins
- conformational energy and fitness are independently evaluated according to the invention. To improve fitness, conformational energy is retained.
- the energy term used in this example consisted of two contributions: a rotamer/backbone contribution e(i r ), and a rotamer/ro tamer contribution e(i r ,j_):
- Nis the number of residues (e.g. , amino acid residues in a protein) and i r is rotamer r at position /. Because the backbone remains fixed in this model, its energy contribution is not relevant to evaluating fitness, and therefore was not included in Equation 13.
- Potential functions and parameters for van der Walls interactions, hydrogen bonding, and electrostatics were used as previously described (see, Dahiyat & Mayo, Proc. Natl. Acad. Sci. U.S.A. 1997, 94:10172; and Dahiyat & Mayo, Protein Science 1996, 5:895). For atomic radii and internal coordinate parameters, the DREIDING force field was used (Mayo et al. , J. Phys.
- a backbone-dependent rotamer library was used to model protein structures, as described by Dunbrack and Karplus (Dunbrack & Karplus, J. Mol. Biol. 1993, 230:543; Dunbrack & Karplus, Nature Structural Biology 1994, 1:334). However, modifications were made as previously described (Dahiyat et al, Protein Science 1997, 6:1333). Specifically, the ⁇ _ angles that were undetermined from database statistics were assigned the following values: Arg,-60°, 60° and 180°; Gin, -120°, -60°, 0°, 60°, 120° and 180°;
- subtilisin E Rotamers interacting with the backbone with energies greater than either 5 kcal/mol (subtilisin E) or 20 kcal/mol (T4 lysozyme) were eliminated from the calculation.
- Amino acid residues 1 -4 and 269-274 of subtilisin E were fixed in the wild- type amino acid side chain conformations.
- subtilisin E an average of 121 rotamers per residue were considered, corresponding to about 3.2 x 10 4 one-body energies, 5.1 x 10 8 two-body energies, and a rotamer space of 10 497 combinations.
- a rotamer model was used to study antibody 4-4-20, a known anti-fluorescyl antibody. Rotamers that interact with the backbone at energies greater than 5 kcal/mol were eliminated from the calculation.
- V L light
- V H heavy chains
- An average of 140 rotamers per residue were considered, corresponding to 3.3 x 10 4 one-body energies, 5.3 x 10 8 two-body energies, and a rotamer space of 10 439 combinations. The high-resolution crystal structure was obtained.
- the probability that a beneficial mutation will occur at a highly coupled residue decreases dramatically.
- the evolutionary potential of an experiment can also be assessed.
- the number of generations before the optimum is reach can be estimated.
- the term k off is a kinetic constant for the antibody, which is an expression if its affinity for the fluorescein antigen. As the antibody is mutated, the value for k off may change, as shown in FIG. 7.
- the best mutants bind with femtomolar affinities (the left of the graph). As the fitness of the parent sequence increases, there is a trend towards mutations occurring at high entropy residues.
- “Structural tolerance” is preferably quantitated by counting the number of sequences (also referred to as “states”) ⁇ that are compatible with a conformational energy.
- the "site entropy” s is a measure of the variability of the amino acid residue identity at position / among the different sequences consistent with a given energy E.
- amino acid probabilities p(i a ) may be calculated as the sum of the amino acid residue's rotamer probabilities, as determined by the mean-field theory methods described in Section 6.2.1, supra., or by keeping residues at other sites in their wild-type amino acid identities.
- e m j(i r ) is the mean-field energy felt by rotamer r at position /
- K, and K j are the total number of rotamers at residues / and/, respectively
- p(J s ) is the probability that rotamer s exist at residue/, and is calculated at a "temperature" T, iterating between:
- the mean-field solution for subtilisin E required 8900 minutes on a single Silicon Graphics RI 0000 Processor running at 195 MHz and 2.1 gigabytes of physical memory.
- the mean-field solution for T4 lysozyme required 6402 minutes on the same computer system.
- Entropy in a mean-field model of the invention can be calculated from the probability distribution of allowed amino acid substitutions.
- the entropy _?, for a given site / is calculation from Equation 14, also discussed above:
- A is the total number of amino acids
- p (i a ) is the probability that amino acid a exists at position /
- the total sequence entropy is simply the sum of the site entropies
- the mean-field theory is applied to calculate the amino acid probabilities required by Equation 14, as a function of the fitness. It is difficult, however, to do this with a fixed fitness. Instead, the thermodynamic equivalence of groups or ensembles can be used to work with a fixed fitness ⁇ F> A , where the average is taken over all sequences corresponding to a "temperature” T.
- the "temperature” acts to generate a "variational” free energy. This means that the energy of the system can be changed or mathematically controlled via the "temperature" variable, as shown for example by these derivations.
- the variational free energy is the variational free energy:
- the average energy is obtained from:
- G(T) ⁇ p( A + ⁇ p(i p ⁇ )f(i a ⁇ - b ) ⁇ ⁇ + i ⁇ i j ⁇ l a b
- Equations 26 and 28 constitute a set of self-consistent equations for p(i a ). Self- consistency is computed by iterating between these equations until the probabilities converge, according to a convergence criterion.
- decreasing the temperature is analogous to increasing the fitness.
- the probabilities calculated as the "temperature” decreases are used to calculate the average fitness and entropy. See, Equations 20 and 24. See also, FIG. 5.
- the list of sequences consistent with a "fitness” demonstrates the tolerance of each position to amino acid substitutions, as measured by the site entropies.
- the site entropy s is calculated for all possible amino acid residue substitutions at position /, as well as for the amino acid residue at position / of the parent. Site entropy values were calculated for both subtilisin E and T4 lysozyme according to Equation 28. A tabulation of the site entropy at each amino acid residue position in subtilisin E is shown in FIG. 4A. A corresponding plot for percent solvent exposed is shown in FIG. 4B. The distributions of site entropy values P(s are provided in FIGS.
- the site entropy s is a measurement or indication of the number of amino acid residue substitutions that can be made at each residue / without disrupting the protein's structure. Specifically, those residue positions that are intolerant of mutations will have a low site entropy, whereas a residue position that is tolerant for mutations will have a relatively high value for it site entropy.
- FIG. 5A provides a plot of the distribution profile P(s ) of site entropy values calculated for subtilisin E (see Section 6.2, supra).
- the top row of horizontal bars indicates mutations found from the in vitro evolution of subtilisin E in a screen for improved thermostability while retaining protein activity (Zhao & Arnold, Protein Engineering 1999, 12:47) These amino acid residue positions are listed in Table 1, below, along with their calculated site entropy values _.,.
- FIG. 5B provides a plot of the distribution profile of site entropy values calculated for T4 lysozyme. The row of horizontal bars on this plot indicates in vitro mutations to the enzyme that improved stability (Pjura et al, Protein Science 1993, 2:2217). The amino acid positions are listed in Table 2, below, along with their calculated site entropy values s,.
- thermostability E may be used as an indicator of fitness F, and structure-based entropy predictions as described in Section 6.2, supra, may be used.
- proteins having enhanced activity at high temperatures may be identified by improving thermostability (see, e.g., Zhao & Arnold, Protein Engineering 1999, 12:47; Giver et al, Proc. Natl. Acad. Sci. U.S.A. 1998, 95:12809-12813).
- “high entropy” positions of a polypeptide As an example, Table 3 below lists amino acid residue positions of subtilisin E where mutations improved that enzyme's reactivity in the organic solvent dimethyl formamide (Chen & Arnold, Proc. Natl. Acad. Sci. U.S.A. 1993, 90:5618; You & Arnold, Protein Engineering 1996, 9:77-83), along with the site entropy value s , calculated for each residue. These mutations and their site entropy values are also indicated by the lower row of vertical bars in the site entropy distribution profile P(s_) for subtilisin E shown in FIG. 5A. As with thermal stability, the mutations are strongly biased towards amino acid residues that have low site entropy values. Indeed, mutations at amino acid residues 181 and 218 produce mutants that have both enhanced thermal stability and enhanced activity in organic solvent.
- FIG. 6 illustrates the three dimensional structure of subtilisin E, with the site entropy profile of each amino acid residue indicated by its color.
- the yellow amino acid residues have the highest site entropy values (2.16 ⁇ s, ⁇ 3.00; i.e., greater than about one standard deviation above the mean) and are therefore the most variable.
- the red residues have intermediate site entropy values (1.31 ⁇ s, ⁇ 2.
- X 6 i.e., between the mean value and about one standard deviation above
- the gray residues are ones having below average entropy values (s, ⁇ 1.31).
- FIG. 1 illustrates the three dimensional structure of subtilisin E, with the site entropy profile of each amino acid residue indicated by its color.
- the yellow amino acid residues have the highest site entropy values (2.16 ⁇ s, ⁇ 3.00; i.e., greater than about one standard deviation above the mean) and are therefore the most variable.
- the red residues have intermediate
- the positions with high site entropy values e.g., greater than about one standard deviation above the mean site entropy value for residues of a particular biopolymer
- below average solvent accessibility e.g., less than about 24%>
- both the site entropy and solvent accessibility may be used to identify residues for mutation in directed evolution experiments.
- solvent accessibility is a less preferable parameter than site entropy.
- energies and site entropy values may be calculated based on all amino acid substitutions, and not merely on the wild-type amino acid identity as in solvent accessibility calculations.
- solvent accessibility is a useful but less preferred measurement for identifying residues that may be tolerant to mutations, e.g., in a directed evolution experiment.
- Table 4 compares site entropy values and solvent accessibilities at positions in the subtilisin E and T4 lysozyme proteins where positive mutations are found.
- Solvent accessibility was determined as described above, e.g. according to Lee and Richards (1971)J Mol. Biol. 55, 379. Although most positive mutations are found at sites exposed to the solvent, some positive mutations are located at site with poor solvent accessibility. The site entropy does indicate, however, that these sites will be tolerant to mutations.
- This residue which is an isoleucine in the wild-type protein, is located on an ⁇ - helix with its side chain oriented towards and completely burried within the protein core.
- the packing of the side chains of the surrounding residues is such that several other amino acid residues may be sustained without affecting conformational energy.
- the T4 lysozyme calculations can also be represented in graphic form, as shown for example in FIG.8. The percent functional improvement is plotted versus the entropy, as determined from the T4 lysozyme calculation.
- the functional improvement is measured as halo formation on a bacterial lawn and is compared against the wild-type halo formation (0%). For this system, the largest improvements in catalytic activity occurred at the high-entropy positions. Thus, targeting the high entropy positions increases the probability of finding the largest gains in catalytic activity.
- Mean-field theory is an approximate method and is generally expected to worsen as the coupling in the system increases. See, Fischer, K. H., and Hertz, J. A., Spin Glasses. Cambridge University Press (1991).
- an algorithm can be used to calculate the entropy based on a series of minimizations performed by dead-end elimination. This dead-end elimination or "DEE" entropy algorithm calculates the substitution energy of all amino acids at all positions in the wild-type amino acid background. First, a residue is chosen (residue /) and the remaining residues in the structure are held in their wild-type amino acid identity. Then, residue / is assigned an amino acid identity a.
- the probability of each amino acid at each residue is calculated from the energies by taking the Boltzmann weight of the energies at each position:
- A 20 is the total number of amino acids
- p(i a ) is the probability of amino acid a existing at residue /
- E(i a ) is the energy of amino acid a at residue /.
- FIG. 9 shows a comparison of the entropy calculated by the mean-field algorithm and the DEE algorithm for T4 lysozyme. Both algorithms identify find the same high-entropy positions, but differ in their rank ordering of low-entropy positions. This is a demonstration of the fact that the mean-field approximation is less accurate at highly coupled positions.
- the computational challenge is to identify clusters of residues that are collectively coupled, but remain uncoupled from the remainder of the residues in the enzyme.
- the invention employs an algorithm that exhaustively determines the optimal set of m residues to be targeted using multi-site combinatorial mutagenesis. First, all of the clusters of m-coupled residues are determined. Two residues are considered to be coupled if their side-chains are coupled (they have at least one rotamer that interacts above a threshold of 5 kcal/mol).
- the algorithm Second, at each m-cluster of residues all combinations of amino acids are substituted, and the minimum-energy conformation of each combination is determined using the Dead-End Elimination algorithm. During the energy calculation, the remaining residues in the enzyme are held in their wild-type amino acid identity, but the conformation of the side-chains are allowed to adjust to the mutations at the clustered residues. For each of the ...-residue clusters, the algorithm generates a list of 20" 1 energies corresponding to all of the possible amino acid substitutions. This gives the advantage of being able to study several approaches to targeting the optimal residues. For example, it is possible to target the cluster that has the highest entropy, the most amino acid substitutions that lead to energies more stable than wild-type, or the lowest energy amino acid combination.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Immunology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Urology & Nephrology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Hematology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Microbiology (AREA)
- Cell Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Food Science & Technology (AREA)
- Medicinal Chemistry (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Pathology (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2001238397A AU2001238397A1 (en) | 2000-02-17 | 2001-02-16 | Computationally targeted evolutionary design |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18317100P | 2000-02-17 | 2000-02-17 | |
US60/183,171 | 2000-02-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2001061344A1 true WO2001061344A1 (fr) | 2001-08-23 |
Family
ID=22671732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/005043 WO2001061344A1 (fr) | 2000-02-17 | 2001-02-16 | Conception evolutive a ciblage computationnel |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2001238397A1 (fr) |
WO (1) | WO2001061344A1 (fr) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001090346A3 (fr) * | 2000-05-23 | 2002-10-10 | California Inst Of Techn | Recombinaison de genes et mise au point de proteines hybrides |
WO2003075129A2 (fr) | 2002-03-01 | 2003-09-12 | Maxygen, Inc. | Procedes, systemes et logiciel pour identifier des biomolecules fonctionnelles |
WO2004022747A1 (fr) * | 2002-09-09 | 2004-03-18 | Nautilus Biotech | Evolution de proteines dirigee, rationnelle faisant appel au balayage rationnel bidimensionnel de la mutagenese |
EP1432980A2 (fr) * | 2001-08-10 | 2004-06-30 | Xencor | Automatisation de la conception des proteines pour l'elaboration de bibliotheques de proteines |
WO2004022593A3 (fr) * | 2002-09-09 | 2004-07-15 | Nautilus Biotech | Evolution rationnelle de cytokines pour une plus grande stabilite, les cytokines et molecules d'acide nucleique codant |
US6917882B2 (en) | 1999-01-19 | 2005-07-12 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US6961664B2 (en) | 1999-01-19 | 2005-11-01 | Maxygen | Methods of populating data structures for use in evolutionary simulations |
US7024312B1 (en) | 1999-01-19 | 2006-04-04 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US7058515B1 (en) | 1999-01-19 | 2006-06-06 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
WO2007030594A3 (fr) * | 2005-09-07 | 2007-05-24 | Univ Texas | Procedes d'utilisation et d'analyse de donnees de sequences biologiques |
US7315786B2 (en) | 1998-10-16 | 2008-01-01 | Xencor | Protein design automation for protein libraries |
US7379822B2 (en) | 2000-02-10 | 2008-05-27 | Xencor | Protein design automation for protein libraries |
US7430477B2 (en) | 1999-10-12 | 2008-09-30 | Maxygen, Inc. | Methods of populating data structures for use in evolutionary simulations |
US7647184B2 (en) | 2001-08-27 | 2010-01-12 | Hanall Pharmaceuticals, Co. Ltd | High throughput directed evolution by rational mutagenesis |
US7747391B2 (en) | 2002-03-01 | 2010-06-29 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
US7747393B2 (en) | 2002-03-01 | 2010-06-29 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
US7863030B2 (en) | 2003-06-17 | 2011-01-04 | The California Institute Of Technology | Regio- and enantioselective alkane hydroxylation with modified cytochrome P450 |
US8026085B2 (en) | 2006-08-04 | 2011-09-27 | California Institute Of Technology | Methods and systems for selective fluorination of organic molecules |
US8252559B2 (en) | 2006-08-04 | 2012-08-28 | The California Institute Of Technology | Methods and systems for selective fluorination of organic molecules |
US8802401B2 (en) | 2007-06-18 | 2014-08-12 | The California Institute Of Technology | Methods and compositions for preparation of selectively protected carbohydrates |
US9322007B2 (en) | 2011-07-22 | 2016-04-26 | The California Institute Of Technology | Stable fungal Cel6 enzyme variants |
-
2001
- 2001-02-16 AU AU2001238397A patent/AU2001238397A1/en not_active Abandoned
- 2001-02-16 WO PCT/US2001/005043 patent/WO2001061344A1/fr active Application Filing
Non-Patent Citations (1)
Title |
---|
DAHIYAT B.I. ET AL.: "De novo protein design: fully automated sequence selection", SCIENCE, vol. 278, 3 October 1997 (1997-10-03), pages 82 - 87 * |
Cited By (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7315786B2 (en) | 1998-10-16 | 2008-01-01 | Xencor | Protein design automation for protein libraries |
US7024312B1 (en) | 1999-01-19 | 2006-04-04 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US7620502B2 (en) | 1999-01-19 | 2009-11-17 | Maxygen, Inc. | Methods for identifying sets of oligonucleotides for use in an in vitro recombination procedure |
US6917882B2 (en) | 1999-01-19 | 2005-07-12 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US6961664B2 (en) | 1999-01-19 | 2005-11-01 | Maxygen | Methods of populating data structures for use in evolutionary simulations |
US7421347B2 (en) | 1999-01-19 | 2008-09-02 | Maxygen, Inc. | Identifying oligonucleotides for in vitro recombination |
US7058515B1 (en) | 1999-01-19 | 2006-06-06 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US7430477B2 (en) | 1999-10-12 | 2008-09-30 | Maxygen, Inc. | Methods of populating data structures for use in evolutionary simulations |
US7379822B2 (en) | 2000-02-10 | 2008-05-27 | Xencor | Protein design automation for protein libraries |
WO2001090346A3 (fr) * | 2000-05-23 | 2002-10-10 | California Inst Of Techn | Recombinaison de genes et mise au point de proteines hybrides |
EP1432980A2 (fr) * | 2001-08-10 | 2004-06-30 | Xencor | Automatisation de la conception des proteines pour l'elaboration de bibliotheques de proteines |
EP1432980A4 (fr) * | 2001-08-10 | 2006-04-12 | Xencor Inc | Automatisation de la conception des proteines pour l'elaboration de bibliotheques de proteines |
US7647184B2 (en) | 2001-08-27 | 2010-01-12 | Hanall Pharmaceuticals, Co. Ltd | High throughput directed evolution by rational mutagenesis |
US7747391B2 (en) | 2002-03-01 | 2010-06-29 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
US8762066B2 (en) | 2002-03-01 | 2014-06-24 | Codexis Mayflower Holdings, Llc | Methods, systems, and software for identifying functional biomolecules |
US10453554B2 (en) | 2002-03-01 | 2019-10-22 | Codexis Mayflower Holdings, Inc. | Methods, systems, and software for identifying functional bio-molecules |
EP1493027A4 (fr) * | 2002-03-01 | 2006-03-29 | Maxygen Inc | Procedes, systemes et logiciel pour identifier des biomolecules fonctionnelles |
EP1493027A2 (fr) * | 2002-03-01 | 2005-01-05 | Maxygen, Inc. | Procedes, systemes et logiciel pour identifier des biomolecules fonctionnelles |
US9996661B2 (en) | 2002-03-01 | 2018-06-12 | Codexis Mayflower Holdings, Llc | Methods, systems, and software for identifying functional bio-molecules |
US9864833B2 (en) | 2002-03-01 | 2018-01-09 | Codexis Mayflower Holdings, Llc | Methods, systems, and software for identifying functional bio-molecules |
US8849575B2 (en) | 2002-03-01 | 2014-09-30 | Codexis Mayflower Holdings, Llc | Methods, systems, and software for identifying functional biomolecules |
EP2390803A1 (fr) * | 2002-03-01 | 2011-11-30 | Codexis Mayflower Holdings, LLC | Procédés, systèmes et logiciel pour identifier des biomolécules fonctionnelles |
WO2003075129A2 (fr) | 2002-03-01 | 2003-09-12 | Maxygen, Inc. | Procedes, systemes et logiciel pour identifier des biomolecules fonctionnelles |
US7747393B2 (en) | 2002-03-01 | 2010-06-29 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
US7751986B2 (en) | 2002-03-01 | 2010-07-06 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
US7783428B2 (en) | 2002-03-01 | 2010-08-24 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
EP2315145A1 (fr) * | 2002-03-01 | 2011-04-27 | Codexis Mayflower Holdings, LLC | Procédés, systèmes et logiciel pour identifier des biomolécules fonctionnelles |
EP2278509A1 (fr) * | 2002-03-01 | 2011-01-26 | Maxygen Inc. | Procédés, systèmes et logiciel pour identifier des biomolécules fonctionnelles |
WO2004022747A1 (fr) * | 2002-09-09 | 2004-03-18 | Nautilus Biotech | Evolution de proteines dirigee, rationnelle faisant appel au balayage rationnel bidimensionnel de la mutagenese |
US7998469B2 (en) | 2002-09-09 | 2011-08-16 | Hanall Biopharma Co., Ltd. | Protease resistant interferon beta mutants |
US7611700B2 (en) | 2002-09-09 | 2009-11-03 | Hanall Pharmaceuticals, Co., Ltd. | Protease resistant modified interferon alpha polypeptides |
US8052964B2 (en) | 2002-09-09 | 2011-11-08 | Hanall Biopharma Co., Ltd. | Interferon-β mutants with increased anti-proliferative activity |
US8057787B2 (en) | 2002-09-09 | 2011-11-15 | Hanall Biopharma Co., Ltd. | Protease resistant modified interferon-beta polypeptides |
US7650243B2 (en) | 2002-09-09 | 2010-01-19 | Hanall Pharmaceutical Co., Ltd. | Rational evolution of cytokines for higher stability, the cytokines and encoding nucleic acid molecules |
US8105573B2 (en) | 2002-09-09 | 2012-01-31 | Hanall Biopharma Co., Ltd. | Protease resistant modified IFN beta polypeptides and their use in treating diseases |
WO2004022593A3 (fr) * | 2002-09-09 | 2004-07-15 | Nautilus Biotech | Evolution rationnelle de cytokines pour une plus grande stabilite, les cytokines et molecules d'acide nucleique codant |
US9145549B2 (en) | 2003-06-17 | 2015-09-29 | The California Institute Of Technology | Regio- and enantioselective alkane hydroxylation with modified cytochrome P450 |
US8741616B2 (en) | 2003-06-17 | 2014-06-03 | California Institute Of Technology | Regio- and enantioselective alkane hydroxylation with modified cytochrome P450 |
US7863030B2 (en) | 2003-06-17 | 2011-01-04 | The California Institute Of Technology | Regio- and enantioselective alkane hydroxylation with modified cytochrome P450 |
US8343744B2 (en) | 2003-06-17 | 2013-01-01 | The California Institute Of Technology | Regio- and enantioselective alkane hydroxylation with modified cytochrome P450 |
WO2007030426A3 (fr) * | 2005-09-07 | 2007-07-26 | Univ Texas | Procedes d'utilisation et d'analyse de donnees de sequences biologiques |
WO2007030594A3 (fr) * | 2005-09-07 | 2007-05-24 | Univ Texas | Procedes d'utilisation et d'analyse de donnees de sequences biologiques |
US8252559B2 (en) | 2006-08-04 | 2012-08-28 | The California Institute Of Technology | Methods and systems for selective fluorination of organic molecules |
US8026085B2 (en) | 2006-08-04 | 2011-09-27 | California Institute Of Technology | Methods and systems for selective fluorination of organic molecules |
US8802401B2 (en) | 2007-06-18 | 2014-08-12 | The California Institute Of Technology | Methods and compositions for preparation of selectively protected carbohydrates |
US9322007B2 (en) | 2011-07-22 | 2016-04-26 | The California Institute Of Technology | Stable fungal Cel6 enzyme variants |
Also Published As
Publication number | Publication date |
---|---|
AU2001238397A1 (en) | 2001-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20020045175A1 (en) | Gene recombination and hybrid protein development | |
WO2001061344A1 (fr) | Conception evolutive a ciblage computationnel | |
US20010051855A1 (en) | Computationally targeted evolutionary design | |
Tang et al. | Tools for predicting the functional impact of nonsynonymous genetic variation | |
JP4851687B2 (ja) | 定向進化のための交叉点の最適化 | |
Pierri et al. | Computational approaches for protein function prediction: a combined strategy from multiple sequence alignment to molecular docking-based virtual screening | |
Cesari et al. | Fitting corrections to an RNA force field using experimental data | |
EP3049979A1 (fr) | Modélisation prédictive à base de structure | |
Strokach et al. | Predicting changes in protein stability caused by mutation using sequence‐and structure‐based methods in a CAGI5 blind challenge | |
Caldararu et al. | Three simple properties explain protein stability change upon mutation | |
Beerens et al. | Evolutionary analysis as a powerful complement to energy calculations for protein stabilization | |
Linial et al. | Methodologies for target selection in structural genomics | |
Ferla et al. | Venus: elucidating the impact of amino acid variants on protein function beyond structure destabilisation | |
US20050003389A1 (en) | Computationally targeted evolutionary design | |
US20100049689A1 (en) | Phenotype prediction method | |
Yadegari et al. | In silico analysis for determining the deleterious nonsynonymous single nucleotide polymorphisms of BRCA genes | |
US20030032059A1 (en) | Gene recombination and hybrid protein development | |
Ramachandran et al. | Homology modeling: generating structural models to understand protein function and mechanism | |
Ivanov et al. | Bioinformatics platform development: from gene to lead compound | |
Wang et al. | Recent advances in predicting functional impact of single amino acid polymorphisms: A review of useful features, computational methods and available tools | |
Mlynsky et al. | Can We Ever Develop an Ideal RNA Force Field? Lessons Learned from Simulations of the UUCG RNA Tetraloop and Other Systems | |
Jani et al. | Protein analysis: from sequence to structure | |
Mohammadpour et al. | A comprehensive in silico analysis of the functional and structural consequences of the deleterious missense nonsynonymous SNPs in human GABRA6 gene | |
Mukhopadhyay et al. | Implication of TITIN Variations in Dilated Cardiomyopathy: Integrating Whole Exome Sequencing With Molecular Dynamics Simulation Study | |
Opuu | Computational design of proteins and enzymes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |