WO2002014875A2 - Procede et systeme pour la prediction de sequences d'acides amines compatibles avec une structure tridimensionnelle specifiee - Google Patents
Procede et systeme pour la prediction de sequences d'acides amines compatibles avec une structure tridimensionnelle specifiee Download PDFInfo
- Publication number
- WO2002014875A2 WO2002014875A2 PCT/IL2001/000769 IL0100769W WO0214875A2 WO 2002014875 A2 WO2002014875 A2 WO 2002014875A2 IL 0100769 W IL0100769 W IL 0100769W WO 0214875 A2 WO0214875 A2 WO 0214875A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- amino acid
- rotamer
- sequence
- acid sequence
- computer
- Prior art date
Links
- 125000003275 alpha amino acid group Chemical group 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 71
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 116
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 116
- 125000000539 amino acid group Chemical group 0.000 claims abstract description 83
- 239000002904 solvent Substances 0.000 claims abstract description 78
- 150000001413 amino acids Chemical class 0.000 claims abstract description 63
- 230000002829 reductive effect Effects 0.000 claims abstract description 30
- 238000004590 computer program Methods 0.000 claims abstract description 18
- 238000000342 Monte Carlo simulation Methods 0.000 claims abstract description 9
- 238000004088 simulation Methods 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 48
- 230000035772 mutation Effects 0.000 claims description 34
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 23
- 230000015654 memory Effects 0.000 claims description 19
- 238000000137 annealing Methods 0.000 claims description 14
- 230000002209 hydrophobic effect Effects 0.000 claims description 12
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 150000007523 nucleic acids Chemical group 0.000 claims description 5
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 3
- 230000004962 physiological condition Effects 0.000 claims description 3
- 235000018102 proteins Nutrition 0.000 description 106
- 235000001014 amino acid Nutrition 0.000 description 47
- 229940024606 amino acid Drugs 0.000 description 47
- 238000004422 calculation algorithm Methods 0.000 description 23
- 125000004429 atom Chemical group 0.000 description 16
- 238000013461 design Methods 0.000 description 16
- 238000000329 molecular dynamics simulation Methods 0.000 description 13
- 102000004196 processed proteins & peptides Human genes 0.000 description 13
- 229920001184 polypeptide Polymers 0.000 description 10
- 230000003993 interaction Effects 0.000 description 9
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 description 7
- 239000011701 zinc Substances 0.000 description 7
- AYFVYJQAPQTCCC-GBXIJSLDSA-N L-threonine Chemical compound C[C@@H](O)[C@H](N)C(O)=O AYFVYJQAPQTCCC-GBXIJSLDSA-N 0.000 description 6
- COLNVLDHVKWLRT-QMMMGPOBSA-N L-phenylalanine Chemical compound OC(=O)[C@@H](N)CC1=CC=CC=C1 COLNVLDHVKWLRT-QMMMGPOBSA-N 0.000 description 5
- HCHKCACWOHOZIP-UHFFFAOYSA-N Zinc Chemical compound [Zn] HCHKCACWOHOZIP-UHFFFAOYSA-N 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 229910052725 zinc Inorganic materials 0.000 description 5
- WHUUTDBJXJRKMK-UHFFFAOYSA-N Glutamic acid Natural products OC(=O)C(N)CCC(O)=O WHUUTDBJXJRKMK-UHFFFAOYSA-N 0.000 description 4
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 description 4
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 description 4
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 description 4
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Natural products CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 description 4
- 238000012856 packing Methods 0.000 description 4
- 230000004572 zinc-binding Effects 0.000 description 4
- 108700011201 Streptococcus IgG Fc-binding Proteins 0.000 description 3
- 235000018417 cysteine Nutrition 0.000 description 3
- -1 deletions Chemical class 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 235000014304 histidine Nutrition 0.000 description 3
- 238000000324 molecular mechanic Methods 0.000 description 3
- 229920000642 polymer Polymers 0.000 description 3
- MTCFGRXMJLQNBG-REOHCLBHSA-N (2S)-2-Amino-3-hydroxypropansäure Chemical compound OC[C@H](N)C(O)=O MTCFGRXMJLQNBG-REOHCLBHSA-N 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 2
- MAJYPBAJPNUFPV-BQBZGAKWSA-N His-Cys Chemical compound SC[C@@H](C(O)=O)NC(=O)[C@@H](N)CC1=CN=CN1 MAJYPBAJPNUFPV-BQBZGAKWSA-N 0.000 description 2
- 101710185494 Zinc finger protein Proteins 0.000 description 2
- 102100023597 Zinc finger protein 816 Human genes 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000027455 binding Effects 0.000 description 2
- 230000004071 biological effect Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 150000001945 cysteines Chemical class 0.000 description 2
- 238000002050 diffraction method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 102000034238 globular proteins Human genes 0.000 description 2
- 108091005896 globular proteins Proteins 0.000 description 2
- 150000002411 histidines Chemical class 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 108020004707 nucleic acids Proteins 0.000 description 2
- 102000039446 nucleic acids Human genes 0.000 description 2
- 230000000704 physical effect Effects 0.000 description 2
- 230000012846 protein folding Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000001225 therapeutic effect Effects 0.000 description 2
- 239000004475 Arginine Substances 0.000 description 1
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 description 1
- 208000035404 Autolysis Diseases 0.000 description 1
- 206010057248 Cell death Diseases 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 102000052510 DNA-Binding Proteins Human genes 0.000 description 1
- 230000004568 DNA-binding Effects 0.000 description 1
- 101710096438 DNA-binding protein Proteins 0.000 description 1
- 102220550870 Fumarate hydratase, mitochondrial_N29L_mutation Human genes 0.000 description 1
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 1
- 108700002232 Immediate-Early Genes Proteins 0.000 description 1
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 1
- 125000000998 L-alanino group Chemical group [H]N([*])[C@](C([H])([H])[H])([H])C(=O)O[H] 0.000 description 1
- 150000008575 L-amino acids Chemical class 0.000 description 1
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 1
- AGPKZVBTJJNPAG-WHFBIAKZSA-N L-isoleucine Chemical compound CC[C@H](C)[C@H](N)C(O)=O AGPKZVBTJJNPAG-WHFBIAKZSA-N 0.000 description 1
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 description 1
- QIVBCDIJIAJPQS-VIFPVBQESA-N L-tryptophane Chemical compound C1=CC=C2C(C[C@H](N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-VIFPVBQESA-N 0.000 description 1
- 125000000510 L-tryptophano group Chemical group [H]C1=C([H])C([H])=C2N([H])C([H])=C(C([H])([H])[C@@]([H])(C(O[H])=O)N([H])[*])C2=C1[H] 0.000 description 1
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 description 1
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 description 1
- 239000004472 Lysine Substances 0.000 description 1
- 102000018697 Membrane Proteins Human genes 0.000 description 1
- 108010052285 Membrane Proteins Proteins 0.000 description 1
- 101100444898 Mus musculus Egr1 gene Proteins 0.000 description 1
- 108091005804 Peptidases Proteins 0.000 description 1
- ONIBWKKTOPOVIA-UHFFFAOYSA-N Proline Natural products OC(=O)C1CCCN1 ONIBWKKTOPOVIA-UHFFFAOYSA-N 0.000 description 1
- 239000004365 Protease Substances 0.000 description 1
- 102100037486 Reverse transcriptase/ribonuclease H Human genes 0.000 description 1
- 241000269821 Scombridae Species 0.000 description 1
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 description 1
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 description 1
- 239000004473 Threonine Substances 0.000 description 1
- QIVBCDIJIAJPQS-UHFFFAOYSA-N Tryptophan Natural products C1=CC=C2C(CC(N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-UHFFFAOYSA-N 0.000 description 1
- 239000006035 Tryptophane Substances 0.000 description 1
- 102000044159 Ubiquitin Human genes 0.000 description 1
- 108090000848 Ubiquitin Proteins 0.000 description 1
- PTFCDOFLOPIGGS-UHFFFAOYSA-N Zinc dication Chemical compound [Zn+2] PTFCDOFLOPIGGS-UHFFFAOYSA-N 0.000 description 1
- 235000004279 alanine Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 125000003277 amino group Chemical group 0.000 description 1
- ODKSFYDXXFIFQN-UHFFFAOYSA-N arginine Natural products OC(=O)C(N)CCCNC(N)=N ODKSFYDXXFIFQN-UHFFFAOYSA-N 0.000 description 1
- 235000009582 asparagine Nutrition 0.000 description 1
- 229960001230 asparagine Drugs 0.000 description 1
- 235000003704 aspartic acid Nutrition 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 description 1
- 125000003178 carboxy group Chemical group [H]OC(*)=O 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002738 chelating agent Substances 0.000 description 1
- 238000007385 chemical modification Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 239000013078 crystal Substances 0.000 description 1
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 229920001002 functional polymer Polymers 0.000 description 1
- 238000001415 gene therapy Methods 0.000 description 1
- 235000013922 glutamic acid Nutrition 0.000 description 1
- 239000004220 glutamic acid Substances 0.000 description 1
- ZDXPYRJPNDTMRX-UHFFFAOYSA-N glutamine Natural products OC(=O)C(N)CCC(N)=O ZDXPYRJPNDTMRX-UHFFFAOYSA-N 0.000 description 1
- HNDVDQJCIGZPNO-UHFFFAOYSA-N histidine Natural products OC(=O)C(N)CC1=CN=CN1 HNDVDQJCIGZPNO-UHFFFAOYSA-N 0.000 description 1
- 125000004435 hydrogen atom Chemical group [H]* 0.000 description 1
- 125000001165 hydrophobic group Chemical group 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- AGPKZVBTJJNPAG-UHFFFAOYSA-N isoleucine Natural products CCC(C)C(N)C(O)=O AGPKZVBTJJNPAG-UHFFFAOYSA-N 0.000 description 1
- 229960000310 isoleucine Drugs 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 235000020640 mackerel Nutrition 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 229910021645 metal ion Inorganic materials 0.000 description 1
- 229930182817 methionine Natural products 0.000 description 1
- 150000002894 organic compounds Chemical class 0.000 description 1
- 239000003960 organic solvent Substances 0.000 description 1
- 230000001590 oxidative effect Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000000144 pharmacologic effect Effects 0.000 description 1
- COLNVLDHVKWLRT-UHFFFAOYSA-N phenylalanine Natural products OC(=O)C(N)CC1=CC=CC=C1 COLNVLDHVKWLRT-UHFFFAOYSA-N 0.000 description 1
- 238000005381 potential energy Methods 0.000 description 1
- 230000002797 proteolythic effect Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000012857 repacking Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000028043 self proteolysis Effects 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000004094 surface-active agent Substances 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 229960004799 tryptophan Drugs 0.000 description 1
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 description 1
- 239000004474 valine Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
Definitions
- This invention relates to the field of protein design and more particularly to the field of inverse-protein folding for de novo protein design.
- proteins fold into a three-dimensional (3D) structure containing recurring motives which pack together to form the 3D structure, the most common motives observed being the ⁇ -helix, ⁇ -turn, parallel and anti-parallel ⁇ -sheets.
- the 3D structure of a protein may be characterized as having internal surfaces being the areas buried within the structure and thus directed away from the aqueous environment in which the protein is normally found; external surfaces being the areas exposed to the aqueous environment and intermediate or boundary surfaces.
- Dahiyat and Mayo further adapted the algorithm by Desmet for the explicit exploration of sequence space using semi-empirical potential functions and stereochemical constraints, which intended to capture most of the known contributions of protein stability.
- Desmet failed in expanding the range of computational protein design to residues of all parts of the protein: the buried core, the solvent-exposed surface, and the boundary between core and surface.
- the present invention relates to a computer-implemented method for predicting at least one amino acid sequence compatible with a specified three-dimensional (3D) structure of a protein or peptide, which method comprises the steps of:-
- step (c) randomly selecting one or more positions along the sequence provided in step (b) and applying on said position/s a simulation comprising one or more of the following scoring function calculating steps :- i) randomly selecting one or more amino acid residues of the same solvent accessibility as that defined for said position to obtain a mutation; ii) calculating an energy scoring function for each possible rotamer of the amino acid residue selected in step (i); iii) selecting a lowest energy scoring rotamer, or when more than one amino acid is manipulated simultaneously, selecting a lowest energy scoring rotamer combination; iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii); and v) assigning the selected amino acid residue or residues and their respective rotamer or rotamer combinations to said position/s and moving to another position along the sequence; said simulation steps are repeated until for each position along said sequence, the residue and residue's rotamer with the lowest energy score is selected, to obtain a virtual representation of
- the computer-implemented method of the invention comprises the steps of:-
- step (b) constructing a reduced virtual representation for the 3D structure provided in step (a);
- step (c) determining for each position along the virtual structure representation provided in step (b) its solvent accessibility; (d) constructing an initial amino acid sequence by assigning each position along the sequence an amino acid residue selected randomly from a predefined group of amino acids having a solvent accessibility (SA) compatible with the solvent accessibility determined for each position;
- SA solvent accessibility
- step (e) randomly selecting one or more positions along the sequence provided in step (d) and applying on each position a Monte-Carlo simulation in sequence space and rotamer space, said simulation comprising one or more scoring function calculating steps which include:- i) randomly selecting one or more amino acid residues of the same solvent accessibility as that defined for said position to provide a mutation;
- step (ii) calculating an energy scoring function for each possible rotamer of each amino acid residue provided in step (i) based on their said reduced virtual representation;
- step (iii) selecting the lowest scoring rotamer or when more than one amino acid is manipulated simultaneously, selecting the lowest scoring rotamer combination; iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii), by applying, for example, the Metropolis algorithm; and
- step (iii) assigning the amino acid residue or residues and their respective selected rotamer or rotamer combinations selected in step (iii) to said position/s and moving to another position along the sequence; said simulation steps are repeated until for each position along said sequence, the residue and residue's rotamer with the lowest score is selected, to obtain a virtually represented amino acid sequence with the lowest total score;
- step (f) expanding the reduced representation of the amino acid sequence obtained in step (e) to its corresponding all-atom sequence representation thereby obtaining an amino acid sequence compatible with said predefined 3D structure; and
- the method may also provide the step of creating a computer output of the expanded all-atom representation of the amino acid sequence obtained in step (f).
- the invention provides amino acid sequences which fold into predefined 3D structures, the amino acid sequences being obtained by the method of the present invention.
- the invention provides, in accordance with another of its aspects, a computer-based system for predicting an amino acid sequence compatible with a predefined 3D structure
- a computer device equipped with:- (a) input apparatus, such as a keyboard, for specifying said 3D structure; (b) a first memory for storing data indicative of the specified 3D structure; (c) a second memory having a stored thereon an application program which when running, provides at least one amino acid sequence compatible with the specified 3D structure; (d) a third memory for storing data indicative of said at least one amino acid sequence obtained; (e) a processor coupled to said input apparatus, and to said first, second and third memories for controlling said input apparatus, said first memory, second memory and third memory and for processing said computer program to obtain said amino acid sequence; and (f) optionally, a display unit coupled to said processing means for displaying the amino acid sequence.
- the specified 3D structure may be obtained from a data bank accessible through the network or available on diskette, CD or tape which is then downloaded onto the first memory module.
- input apparatus signifies also any suitable means for connecting to a network and retrieving from available databanks accessible thereby the desired 3D structure.
- input apparatus also refers to any apparatus enabling retrieving such sequences from computer readable mediums, e.g. diskettes, CDs, tapes etc.
- the processor may be any computer device stored with an application utility, which when running on the computer device, enables the processing of the stored data so as to provide a an amino acid sequence which substantially folds into a desired 3D structure, i.e. that specified in step (a) of the method of the invention, such a computer device includes, inter alia, a private computer (PC, either Windows or Linux OS), workstation computers (UNIX), a computer-cluster or Super-computers.
- PC private computer
- UNIX workstation computers
- Super-computers a computer-cluster or Super-computers.
- the invention provides a computer program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for predicting at least one amino acid sequence compatible with a specified three-dimensional (3D) structure of a protein or peptide, which method comprises the steps of:-
- step (a) providing a coordinate set representing backbone of said 3D structure and determining for each position along said sequence its solvent accessibility; (b) constructing an initial amino acid sequence by randomly assigning for each position along the structure an amino acid residue, the amino acid residue being selected randomly from groups of amino acids having solvent accessibility compatible with the solvent accessibility of said position; (c) randomly selecting one or more positions along the sequence provided in step (b) and applying on said position/s a simulation comprising one or more of the following scoring function calculating steps :- i) randomly selecting one or more amino acid residues of the same solvent accessibility as that defined for said position to obtain a mutation; ii) calculating an energy scoring function for each possible rotamer of the amino acid residue selected in step (i); iii) selecting a lowest energy scoring rotamer, or when more than one amino acid is manipulated simultaneously, selecting a lowest energy scoring rotamer combination; iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii); and
- the computer program storage device of the invention may, more particularly, perform the following method steps:-
- step (b) constructing a reduced virtual representation for the 3D structure provided in step (a);
- step (c) determining for each position along the virtual structure representation provided in step (b) its solvent accessibility
- step (e) randomly selecting one or more positions along the sequence provided in step (d) and applying on each position a Monte-Carlo simulation in sequence space and rotamer space, said simulation comprising one or more scoring function calculating steps which include :- i) randomly selecting one or more amino acid residues of the same solvent accessibility as that defined for said position to obtain a mutation; ii) calculating an energy scoring function for each possible rotamer of each amino acid residue provided in step (i) based on their said reduced virtual representation; iii) selecting the lowest scoring rotamer or when more than one amino acid is manipulated simultaneously, selecting the lowest scoring rotamer combination; iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii); and v) assigning the amino acid residue or residues and their respective selected rotamer or rotamer combinations selected in step (iii) to said position/s and moving to another position along the sequence;
- the simulation steps are repeated until for each position along said sequence, the residue and residue's rotamer with the lowest energy score is selected, to obtain a virtually represented amino acid sequence with the lowest total energy score;
- step (f) expanding the reduced representation of the virtually represented amino acid sequence obtained in step (e) to its corresponding all-atom sequence representation thereby obtaining an amino acid sequence compatible with the predefined 3D structure.
- the present invention provides a computer program product comprising a computer useable medium having computer readable program code embodied therein for predicting at least one amino acid sequence compatible with a specified three-dimensional (3D) structure of a protein or peptide, which computer program product comprising:
- computer readable program code for causing the computer to provide a coordinate set representing backbone of said 3D structure and to determine for each position along said sequence its solvent accessibility; computer readable program code for causing the computer to construct an initial amino acid sequence by randomly assigning for each position along the structure an amino acid residue, the amino acid residue being selected randomly from groups of amino acids having solvent accessibility compatible with the solvent accessibility of said position; computer readable program code for causing the computer to randomly selecting one or more positions along the sequence and applying on said position/s a simulation comprising one or more of the following scoring function calculating steps:-
- step (ii) calculating an energy scoring function for each possible rotamer of the amino acid residue selected in step (i); (iii) selecting a lowest energy scoring rotamer, or when more than one amino acid is manipulated simultaneously, selecting a lowest energy scoring rotamer combination;
- step (iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii); and (v) assigning a selected amino acid residue or residues and their respective rotamer or rotamer combinations to said position/s and moving to another position along the sequence; computer readable program code for causing the computer to repeat the simulation steps until for each position along said sequence, the residue and residue's rotamer with the lowest energy score is selected, to obtain a virtual representation of an amino acid sequence with the lowest total energy score compatible with the specified 3D structure.
- the computer program product of the invention may, more particularly, comprise: computer readable program code for causing the computer to provide a coordinate set representing the backbone of said 3D structure;
- step (a) computer readable program code for causing the computer to construct a reduced virtual representation for the 3D structure provided in step (a);
- computer readable program code for causing the computer to determine for each position along the virtual structure representation provided in step (b) its solvent accessibility; computer readable program code for causing the computer to construct an initial amino acid sequence by randomly assigning for each position along the structure an amino acid residue selected randomly from a predefined group of amino acids having a solvent accessibility compatible with the solvent accessibility of said position; computer readable program code for causing the computer to randomly select one or more positions along the sequence provided in step (d) and applying on each position a Monte-Carlo simulation in sequence space and rotamer space, said simulation comprising one or more scoring function calculating steps which include:-
- step (ii) calculating an energy scoring function for each possible rotamer of each amino acid residue provided in step (i) based on their said reduced virtual representation;
- step (iv) determining whether to accept or reject the mutation with the rotamer or rotamer combination selected in step (iii);
- step (v) assigning the amino acid residue or residues and their respective selected rotamer or rotamer combinations selected in step (iii) to said position/s and moving to another position along the sequence;
- the computer program product may also comprise, in accordance with the present invention computer readable program code for causing the computer to creating a computer output of the expanded all-atom representation of the primary structure/s obtained in step (f).
- Fig. 1 shows the energy profile obtained for the Zin268 backbone by the system of the invention, during 10 5 iterations, at three temperatures: 1000K (continuous line), 500K (broken line) and 100K (dotted line). The temperature remained constant during the simulation. The energy of the initial random sequence, before simulation initiated was +204kcal/mol.
- Fig 2 shows the energy profile obtained for the Zin268 backbone by the system of the invention during 10 iterations, at three maximal temperatures: 1000K (continuous line), 500K (broken line) and 100K (dotted line) using an annealing temperature profile, with periodicity of 500 Monte Carlo steps during which the temperature is gradually decreased from its initial value, to zero, and then set up again to its initial value for another cycle.
- the energy of the initial random sequence, before the simulation started was +204kcal/mol.
- Fig 3 shows the energies of the 20 lowest sequences generated by the algorithm at different simulation lengths and different temperatures, using an annealing temperature profile with periodicity of 500 Monte Carlo steps.
- Fig 4A-4C shows the three dimensional structure crystallography of Zif268 (Fig. 4 A) compared with the 3D structure of the designed proteins A and B (Figs. 4B and 4C, respectively).
- Fig 5A-5C shows the 3D crystallography structure of Zif268 (Fig 5A) compared with the three dimensional structure of the designed proteins A and B (Figs 5B and 5C, respectively), after minimization of their side chains, displayed by spheres sized to the van der Walls radii of the atoms (not including hydrogens).
- Fig 6 shows a diagram of G ⁇ 1 solvent accessibility, according to the present invention's methodology (black columns) and according to D&M (gray columns).
- Fig 7A-7B shows the 3D structure of G ⁇ l overlaid on that of the designed sequence C from two different angles (Fig. 7A and 7B).
- the present invention relates in general to a method of predicting one or more amino acids compatible with a predefined 3D structure.
- the predefined 3D structure may be that of a native protein, polypeptide, a biologically functional derivative or fraction of the native protein or polypeptide or any other biologically functional polymer, the determination of its lowest free-energy-structure is desirable.
- amino acid sequence refers to an amino acid sequence of a protein or polypeptide.
- the primary structure of a protein or polypeptide is the amino acid sequence wherein the location of disulfide bridges, if any exist, are indicated. The primary structure is thus a complete description of the covalent connections within the polymer.
- amino acid as used herein above and below means any organic compound possessing one or more amino groups and one or more carboxyl groups. Such amino acids may be naturally occurring L-amino acids, their corresponding D- isomers, synthetic amino acids, or any other variations of the same. Within this context, the term variant should be understood as including all possible modifications of the naturally occurring or synthetic amino acids including deletions, insertions, substitutions of group/s therein.
- amino acid residue it should be understood an amino acid, as defined above, which forms part of a chain, the chain consisting two or more amino acid units.
- the coordinate set including the dihedral angles and specific bonds within the predefined 3D structure may be obtained from any suitable databank known to those versed in the art, such as the Protein Data Bank (PDB, supported by the RCSB consortium) and is preferably provided in a computer readable form to enable its easy input into the system of the invention.
- the 3D structure may be defined at will, without relying on any known 3D structure of any specific protein.
- Such novel 3D structures will agree with the general structure constraints of polypeptides, such as backbone geometries, as known to those versed in the art.
- a reduced virtual representation is first constructed for the predefined 3D structure.
- the reduced representation may be obtained by the methodology originally developed by Herzyl and Hubbard for use with dynamic simulated annealing [Herzyk P. and Hubbard R.E. Proteins 17:310-324 (1993)].
- the amino acids are represented by virtual spherical atoms, wherein the main chain of the protein, polypeptide or any other suitable polymer is represented by one virtual atom per residue located at the C ⁇ position and the side chains are represented by one or more additional virtual atoms.
- the number of additional virtual atoms depends on the size and chemical composition of the specific side chain.
- one additional virtual atom will represent amino acid residues having only a ⁇ side chain heavy atom or ⁇ and ⁇ side chains heavy atoms, e.g. serine (Ser, S), threonine (Thr, T), alanine (Ala, A), valine (Nal, N), cysteine (Cys, C). Proline will also consist part of this group as its C ⁇ heavy atom is very close to its C ⁇ and C ⁇ atoms.
- Two additional virtual atoms will represent amino acid residues having ⁇ , ⁇ and ⁇ side chains heavy atoms, ⁇ being represented by one virtual atom and ⁇ and ⁇ together by another virtual atom.
- the representation with two additional side chain virtual atoms exhibit rotational flexibility around the C ⁇ -C ⁇ bond, e.g. histidine (His, H), aspartic acid (Asp, D), asparagine (Asn, ⁇ ), tyrosine (Tyr, Y), leucine (Leu, L), isoleucine (He, I), phenylalanine (Phe, F) and methionine (Met, M).
- Three additional virtual atoms will represent amino acid such as lysine (Lys, L), arginine (Arg, R), glutamic acid (Glu, E), Glutamine (Gin, Q) and tryptophane (Trp, W).
- amino acids, other than the naturally occurring amino acids may be presented by virtual atoms in a similar manner.
- solvent accessibility is a feature assigned for each position along the chain of the folded protein or polypeptide. Each position is categorized as being either buried within the 3D structure (in an internal surface), exposed (part of an external surface) or within a boundary surface (intermediate position).
- the S A is determined by surrounding the reduced representation of the protein with a grid and calculating the number of grid points that fall into the intersection volume of the volume of every virtual atom and the volume of its neighbor virtual atom (the volume determined according to the adequate van der Waals radius [Bernstein F. C. et al. J. Mol. Biol. 112:535-542 (1977)]).
- the volume determined according to the adequate van der Waals radius [Bernstein F. C. et al. J. Mol. Biol. 112:535-542 (1977)].
- Each type of position i.e. buried, exposed or intermediate, may be occupied by several amino acid residues.
- Hydrophobic amino acids being able to form a hydrophobic core, are assigned to the buried positions of the 3D structure, while hydrophilic amino acids are assigned for the solvent-exposed positions. Boundary positions, between those two environments can be occupied by both types of amino acids.
- the buried positions may be occupies by amino acids selected from the group consisting of Ala, Tyr, Trp, Nal, Leu, He, Phe, Met, Cys, Pro, Gly and variants thereof, all of which being hydrophobic in nature.
- the exposed positions may be occupied by amino acid residues selected from the group consisting of Lys, Arg, His, Glu, Asp, Gin, Asn, Ser, Thr and variants thereof all of which being hydrophilic in nature.
- the positions having assigned an intermediate level of SA may be occupied by all types of amino acids, particularly those which serve in nature as building blocks for proteins, i.e. Pro, Lys, Arg, His, Glu, Asp, Gin, Asn, Ser, Thr, Gly, Ala, Tyr, Trp, Nal, Leu, He, Phe, Met, Cys and variants thereof.
- Special assignment of patterns may be set for particular positions in the protein's 3D structure that deviate from the general assignment based on SA, such assignment may be introduced, for example, to preserve buried salt bridges.
- an amino acid for every C ⁇ position is selected randomly, taking into account the solvent accessibility of that position (buried, exposed or intermediate). Alternatively, this selection may be applied only to a sub-set of the polypeptide's amino acid residue, leaving the other positions fixed throughout the design process; (2) the appropriate bonds and angles of the protein's reduced representation are assigned based on the coordinate set provided; and (3) one of the physically permissible rotamers that characterizes each of the amino acids is assigned.
- a scoring function is applied to evaluate the effect of changing the amino acid sequence and residue rotamer for a given 3D backbone structure.
- the scoring function used according to the present invention has two main contributions: a residue-residue interaction term and a residue's secondary-structure propensity term.
- the interaction part of the scoring function is based in part on the Lenard- Jones like potential function, multiplied by the effective attractive inter-residue contact energies ( ⁇ y).
- ⁇ y effective attractive inter-residue contact energies
- each side chain is represented by one or more virtual atoms
- the energy contribution of each residue-residue pair is divided among the virtual atoms such that the sum over energy contributions of the virtual atom pairs (one from each residue) equals the effective residue-residue interaction.
- the Lenard- Jones potential function is modified to make the effect of repulsion smaller (because the virtual atoms are 'softer' than real atoms).
- the effective contact energies between two amino acids may be, for example, those calculated by Miyazawa and Jernigan [Miyazawa S. and Jernigan R.L. J. Mol. Biol. 256:623-644 (1996)].
- the basic assumption on which the contact energies are calculated according to this model is that the average characteristics of residue-residue contacts, observed in a large number of crystal structures of globular proteins, represents the actual intrinsic inter-residue close contacts of protein structures.
- the total energy score of the protein is calculated by adding a residue specific "potential" for ⁇ -helical and ⁇ sheet states. These terms may be, for example, those calculated by Bahar et al. [Bahar I. et al Proteins 29:292-308 (1997)]. These so-called potentials are added only if the residue is situated in a ⁇ -helical or a ⁇ -sheet regions of the 3D backbone template, according to the secondary structure of the designed protein or polypeptide.
- the scoring function is preferably applied as part of a Monte Carlo simulation which combines a search in the sequence space for amino acid residues and in the specific rotamer space of each residue. This process provides the system with the optimal sequence for a given backbone.
- optimal sequence refers to an amino acid sequence compatible with the predefined 3D structure and having the lowest total score.
- sequence space refers to the total number of possible different sequences for a given number of different residues and a given number of residues in the protein, polypeptide or any other appropriate polymer, e.g. for a protein of 100 residues, composed of 20 different amino acids, the sequence space will contain 20 100 possible sequences.
- Rotamer space refers to the total number of physically permissible conformations for a residue in a given amino acid sequence.
- the advantage of the combined reduced representation of the side chains and the grouping of amino acids and structure sites according to the solvent accessibility relies in the high efficiency of searching through both sequence space and rotamer space.
- the combined simplifications dramatically reduce the search space, while retaining a physically reasonable representation that can accurately account for rotamer flexibility.
- the search in sequence space begins when up to three positions along the protein are simultaneously randomly selected and replaced with different amino acids (each replacement referred to as a 'mutant' in that specific 'trial configuration').
- the replacing amino acids are selected randomly from the group of amino acids having the same characteristics (buried, exposed or intermediate) as defined for the specific replaced positions to form the 'mutant'.
- a search in rotamer space begins by calculating the total energy score of the new sequence for each and every allowed rotamers or rotamer combinations of the mutated amino acids (not all rotamers are allowed, as described by Ponder and Richards [Ponder J.W and Richards F. M. J. Mol. Biol.
- the energy score difference ⁇ E between the lowest energy score of the trial configuration being the lowest energy score among all allowed rotamers of the new mutant (or mutants, if more than one amino acid is replaced), and the energy of the last accepted configuration is calculated.
- the Metropolis algorithm [Metropolis N. & Ulam S. J. Am. Stat. Ass. 44:335-341 (1949)] is used to determine whether the new trial sequence is accepted or rejected. If ⁇ E is negative, the mutation is accepted with the best rotamers, otherwise, the trial configuration is accepted at a probability determined according to the Boltzmann distribution e " ⁇ B ⁇ (T being either a fixed or a varying annealing temperature as will be described hereinbelow).
- the search continues through a large number of trials (steps) in order to allow the score to decrease and converge. This number depends on the size of the protein and on the number of residues that are allowed to be mutated (in case the design is just of a certain part of the protein).
- the lowest "scored" sequences are selected.
- the final optimal sequence is the one associated with the lowest total energy score found during the optimization process.
- the resulting sequence is then expanded to its corresponding 3D all-atom representation (as opposed to the virtual representation).
- This all atom representation may then be either saved in a computer readable form or extracted in the form of a computer output. Also collected are additional low energy score sequences, in order to enable analyzing relative consistency patterns of residues in a given position.
- the all atom 3D model that is constructed from the novel sequence can be analyzed.
- One way of performing such an evaluation is to compare the structure of the designed protein (after standard all- atom minimization of its side chains with the structure of the model after molecular dynamics simulation in water or by comparing the molecular mechanics energy of the wild-type protein from which the 3D structure used, was taken, with that of the designed protein, after molecular dynamics of the latter.
- Molecular dynamics programs, such as CHARMM [Brooks, B.R. et al. J. Comp. Chem. 4:187-219 (1983)] may be utilized for this purpose, as illustrated in the following Examples.
- the amino acid sequence designed by the method of the invention is a de novo sequence and preferably a sequence, which under physiological conditions folds substantially into the desired 3D structure. More preferably, the amino acid sequences obtained are biologically functional.
- the designed amino acid sequence is chemically synthesized by procedures known in the art.
- the novel amino acid sequence is used to create a nucleic acid sequence, such as DNA, which encodes the optimal sequence.
- a nucleic acid sequence such as DNA
- a man versed in the art would know based on the existing technologies how to deduce at least one nucleic acid which will encode the amino acid sequence designed.
- the nucleic acid sequence obtained may then be cloned into a host cell and expressed.
- the choice of codons, suitable expression vectors and suitable host cells may vary depending on a number of factors, and can be easily optimized as needed.
- the novel amino acid sequence may be experimentally evaluated and tested for structure, function and stability, as required. This will be performed as is known in the art and will depend in part on the original protein from which the sequence's backbone structure was taken.
- the designed protein will be more stable than the known protein used as the starting point, although, at times, if some constraints are placed on the method disclosed herein, the designed sequence may be less stable. For example, it is possible to fix certain residues for altered biological activity and find the most stable sequence, but it may still be less stable than the wild type protein.
- Stable in this context includes, but is not limited thereto, thermal stability, i.e. an increase in the temperature at which reversible or irreversible denaturing starts to occur; proteolytic stability, i.e.
- the proteins of the invention may be used in a variety of applications, ranging from industrial to pharmacological uses, depending on the protein.
- Example of the different uses are in biotechnology manufacturing of therapeutic peptides and proteins, in gene therapy, design of modified therapeutic peptides and proteins as pharmaceuticals, etc.
- Another application of the invention disclosed herein may be the generation of a library of small stable protein elements that can be later assembled in various ways to design a sequence for a novel larger protein with a desired 3D structure. Yet further, the method of the present invention may be applicable for optimizing the novel larger protein obtained thus ensuring that the peptides from which it was constructed indeed fit the structure.
- the invention further provides amino acid sequences substantially compatible with a specified 3D structure, the amino acid sequences being obtained by the method of the present invention.
- a computer-based system for predicting an amino acid sequence compatible with a specified 3D structure comprising the constituents as defined hereinbefore and after.
- the input apparatus such as a keyboard, employed by the system of the invention, are used for entering a selected set of coordinates representing the predefined 3D structure and other data such as scoring function and optimization process parameters.
- the first and third memory means being preferably a RAM (random access memory) are used for storing the initial and final data while the second memory means, being preferably a ROM (read-only memory) are used to store the program of the method of present invention.
- the system comprises a microprocessor for performing, under control of the stored program, the steps of processing the entered data and displaying via a display unit or printer the novel amino acid sequence.
- a user enters the coordinate set for the predefined 3D structure from an optional, auxiliary storage unit.
- the system inputs the data for processing, stores the data in memory then processes it as described.
- the comparison between the all-atom 3D structure of the designed protein (after minimization of its side chains) with its structure after molecular dynamics simulation is carried out in the following specific Examples using the CHARMM molecular dynamics program [version 29, Brooks B.R. et al. (1983) ibid.]. Further, the comparison between the averaged energy of the designed protein after dynamics with the energy of the native protein is carried out using CHARMM forcefield [Mackerell A.D. et al. J. Phys. Chem. 102:3586-3616 (1998)]. The minimization and the molecular dynamics are performed when the protein is embedded in a water sphere. For native proteins, the coordinates are based on the information provided from PDB.
- the conformation of the designed protein is composed of the backbone conformation of the native protein and the side chains conformation of the new residues, according to the best rotamers chosen by the method of the invention.
- CHARMM executes two minimization algorithms to the protein's side chains, Steepest Descent (SD) and Adopted Basis Newton Raphson (ABNR). After the minimization of the side chains and of the water surrounding the protein, CHARMM performs molecular dynamics of the protein.
- SD Steepest Descent
- ABNR Adopted Basis Newton Raphson
- Zif268 is a well recognized protein. This protein is small enough to be both computationally and experimentally tractable, yet large enough to form an independently folded structure in the absence of disulfide bonds or metal binding. Although this motif consists of fewer than 30 residues, it does contain sheet, helix and turn structures.
- the entire amino acid sequence the buried core, the solvent exposed surface and the boundary between core and surface, except for the Gly27, which was not mutated during the simulation, was computed.
- the input coordinates are those of residues 33-60 of the native proteins obtained from the X-ray structure coordinate of Zif268 immediate early gene (krox-24) complex with an 11 base pair DNA fragment (Protein Data Bank (PDB) code: 1ZAA), as determined at 2.1 A resolution by Pavletich and Pabo [Pavletich N.P., Pabo Science 252:809-817 (1991)]. Recently, this protein was also analyzed by Dahiyat & Mayo (13) .
- SA Solvent accessibility
- FIG. 1 presents the energy profile at three constant temperature parameters, 100K, 500K and 1000K.
- Figure 2 presents the energy profile using an annealing profile of the temperature parameters.
- the maximal temperatures were also 100K, 500K and 1000K and the periodicity was 500 Monte Carlo steps. Namely, during each cycle of 500 MC steps the temperature parameter is gradually reduced until it reaches zero, at which point the temperature parameter is set again to its initial value for a new 500 step annealing cycle to begin.
- the total size of the search space was 3.41x10 but in all cases within less than 2000 iterations the algorithm reached the range of stable sequence between -270 kcal/mol and -300 kcal/mol, according to the scoring function.
- Figs. 1 and 2 show that the optimization reaches lower scores when the periodic annealing temperature profile is used. This profile enables the program to escape local minima by accepting high-energy sequences that would not be accepted, but at the same time, to optimize locally when the temperature is reduced.
- the search reached the lowest values for the scoring function initial temperature parameter was 1000K and an annealing profile was used.
- Figure 3 shows the energies of the 20 lowest sequences generated by the algorithm with different simulation lengths and different temperatures, using an annealing temperature profile with a periodicity of 500 Monte Carlo steps. The results show that at 500K the algorithm converges after 10 6 iterations, at 1000K after 10 5 and at 100K after 10 iterations and reached different energies each time.
- the length of 10 6 iterations of the zinc finger protein simulation required one CPU hour on a single alpha processor workstation, and about 1.5 hours on Pentium III PC.
- Each simulation began with a different random seed but, with the same 3D backbone template.
- the set of 50 simulations was repeated twice, each set with different solvent accessibility (SA) assignment for the protein residues.
- SA solvent accessibility
- the first set used the present invention's automated solvent accessibility algorithm and the second set used Dahiyat and Mayo's (D&M) fitted assignments (see Table 1).
- Tables 2A and 2B present the lowest energy sequences obtained in the first and second sets (A and B respectively), aligned with the second zinc finger module of the DNA binding protein Zi/268 and with D&M designed sequence, FSD-1.
- the coordinates used for the FSD-1 ⁇ motif score evaluations are the experimental NMR coordinates (PDB code 1FSD), which were found by D&M ( .
- All the energy scores in Table 2B were calculated according the method of the present invention's reduced representation of amino acids and its scoring function.
- a and B scores were found to be lower than both Zi/268 score (without considering the His Cys Zn-binding interactions which are not included in the scoring function), and the FSD-1 score.
- the energy score of the most stable sequence, A is -351.8kcal/mol.
- This score is lower than Zif268 score by 111.3kcal/mol which is a significant difference (not tacking into account the Zn interactions).
- the relative stability of both A and B sequences in comparison to the FSD-1 sequence may be in part due to the fact the FSD-1 sequence was designed with a different scoring function.
- Positions 21 and 25 of the optimal sequences were selected to be Phe or Met (position 21) and Leu (position 25) side chains. In the original Zif268, these positions were occupies by the zinc binding His residue. These positions are more than 80 percent buried. Position 5, which is 100 percent buried, was predominantly selected to be Val. The other boundary positions demonstrate the steric constrains on buried residues by packing similar side chains to those of the original Zif268 sequence.
- position 5 on the exposed sheet surface was selected by the algorithm to be Val, which is a very good ⁇ -sheet forming residue, and positions 4 and 10 (and 11 only in sequence B) were selected to be Thr, which is also a good ⁇ - sheet forming residue.
- Sequence A and B were further examined by secondary structure prediction by the SSPAL predictor at Sanger Centere [Salamov A. A. and Solovyev N. V. J. Mol. Biol. 247:11-15 (1995)], which enable to predict the secondary structure of a protein according to its primary structure (amino acid sequence).
- SSPAL predictor at Sanger Centere [Salamov A. A. and Solovyev N. V. J. Mol. Biol. 247:11-15 (1995)], which enable to predict the secondary structure of a protein according to its primary structure (amino acid sequence).
- Table 3 presents the secondary structure of the native protein (Zif268) according to the Protein Data Bank (PDB), and A and B secondary structure prediction, according to SSPAL algorithm at Sanger Centre.
- A was predicted to have one ⁇ -helix (designated H) and two ⁇ -strands (designated E) (the ⁇ motif) while the predicted secondary structure to B contained only one ⁇ -helix and one ⁇ -strand.
- Table 3 Secondary structure of predicted primary structures A and B.
- the reduced representation of the lowest energy designed sequences A and B was expanded to an all-atom representation, using the molecular mechanics package CHARMM.
- the input for this experiment was the backbone coordinates of the native protein, the new designed residues and the dihedral angles of each position along the designed sequence derived from the rotamer with the lowest energy score.
- the number of atoms of A and B after expansion to all atoms, were 459 and 446, respectively.
- Energy minimization was performed for A and B's side chains as well as to Zif268 side chains using CHARMM forcefield, the SHAKE algorithm [Nan Gunsteren W.F. & Berendsen H.J.C. Mol. Phys.
- Figures 4A-4C show the 3D structure of Zi/268 as compared to that of the designed proteins A and B after minimization of their side chains, focusing on the core, which includes hydrophobic side chains in A and B, instead of the zinc ion chelated by two cysteines and two histidines in the native protein.
- the same structures are presented in Fig. 5 but with the core side chains displayed by spheres sized to the van der Waals radii of the atoms, which indicate the good packing of the core in the designed sequences.
- G ⁇ l Streptococcal protein G
- G ⁇ 7 is derived from a larger multi-domain cell surface protein that functions with high affinity binding to the Fc region of IgG. It comprises six ⁇ -strands and one ⁇ -helix.
- An extremely hyperthermophilic variant of the ⁇ l domain of Streptococcal protein G was examined.
- Streptococcal protein G was already reported by Mayo and collaborators ' .
- Solvent accessibility was evaluated as described in Example 1.
- a comparison of the results obtained by the method of the present invention and that of Malakauskas and Mayo (M&M ( , using mainly the Connolly algorithm referred to hereinbefore) is presented in Figure 6.
- 22 positions were found to be exposed in both cases, 11 were found to be buried and 10 in an intermediate level of solvent accessibility.
- the 13 remaining residues were classified differently in the two methods. In 10 of these cases the discrepancy was between the subtle definition of a site as being "buried” or "intermediate”.
- the solvent accessibility of the eighth position (position 8) selected for optimization was identical in two classification schemes.
- the energy score profile for G l was obtained by the same manner as described for Zif268. Further, the energy score profile obtained for G ⁇ 7 was similar to that obtained for Zif268 (Figs. 1 and 2), using the same temperature conditions. However, since only 8 out of 56 residues (14.3%) were mutated, the initial energy was already negative while the final energies obtained were approximately -770 and -790 kcal/mol when an annealing temperature profile was used and the maximal value for the temperature parameter was 1000K.
- Two sets of 50 simulations were conducted.
- the non-mutated positions were kept in their native rotameric conformation while in the second set of simulations, the rotameric states of the side chains of the non-mutating residues were allowed to change, thus providing a larger number or possible mutations and rotamer combinations.
- the total size of the search space in the first set was 1.06xl0 13 and in the second set was 2.52xl0 31 .
- Each simulation in the first set was terminated after 10 iterations with a maximal temperature of 1000K and an annealing periodicity of 100 MC steps.
- Each simulation in the second set was terminated after 10 5 iterations, with a maximal temperature of 1000K and an annealing periodicity of 1000 MC steps.
- Table 4 presents the mutated residues in the lowest energy sequences among the 50 simulations conducted for each set (C and D respectively).
- the buried position 3 changed from Tyr mainly to Leu, which is more 5 hydrophobic and is predicted to improve side chain packing in the interior of the protein.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Immunology (AREA)
- Biomedical Technology (AREA)
- Urology & Nephrology (AREA)
- Hematology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Microbiology (AREA)
- Cell Biology (AREA)
- Food Science & Technology (AREA)
- Medicinal Chemistry (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Pathology (AREA)
- Peptides Or Proteins (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2001280077A AU2001280077A1 (en) | 2000-08-16 | 2001-08-16 | Method and system for predicting amino acid sequence |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IL137886 | 2000-08-16 | ||
IL13788600A IL137886A0 (en) | 2000-08-16 | 2000-08-16 | A novel algorithm for de novo protein design |
US09/718,425 | 2000-11-24 | ||
US09/718,425 US7751987B1 (en) | 2000-08-16 | 2000-11-24 | Method and system for predicting amino acid sequences compatible with a specified three dimensional structure |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2002014875A2 true WO2002014875A2 (fr) | 2002-02-21 |
WO2002014875A3 WO2002014875A3 (fr) | 2003-02-27 |
Family
ID=26323969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IL2001/000769 WO2002014875A2 (fr) | 2000-08-16 | 2001-08-16 | Procede et systeme pour la prediction de sequences d'acides amines compatibles avec une structure tridimensionnelle specifiee |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2001280077A1 (fr) |
WO (1) | WO2002014875A2 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112585684A (zh) * | 2018-09-21 | 2021-03-30 | 渊慧科技有限公司 | 确定蛋白结构的机器学习 |
CN114155912A (zh) * | 2022-02-09 | 2022-03-08 | 北京晶泰科技有限公司 | 蛋白质的序列设计方法、蛋白质的结构设计方法、装置及电子设备 |
CN118571306A (zh) * | 2024-08-01 | 2024-08-30 | 温州大学 | 一种取代基工程下基态结构预测方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5600571A (en) * | 1994-01-18 | 1997-02-04 | The Trustees Of Columbia University In The City Of New York | Method for determining protein tertiary structure |
WO1998047089A1 (fr) * | 1997-04-11 | 1998-10-22 | California Institute Of Technology | Dispositif et methode permettant une mise au point informatisee de proteines |
WO2001037147A2 (fr) * | 1999-11-03 | 2001-05-25 | Algonomics Nv | Dispositif et procede permettant la prevision structurelle de sequences d'acides amines |
-
2001
- 2001-08-16 AU AU2001280077A patent/AU2001280077A1/en not_active Abandoned
- 2001-08-16 WO PCT/IL2001/000769 patent/WO2002014875A2/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5600571A (en) * | 1994-01-18 | 1997-02-04 | The Trustees Of Columbia University In The City Of New York | Method for determining protein tertiary structure |
WO1998047089A1 (fr) * | 1997-04-11 | 1998-10-22 | California Institute Of Technology | Dispositif et methode permettant une mise au point informatisee de proteines |
WO2001037147A2 (fr) * | 1999-11-03 | 2001-05-25 | Algonomics Nv | Dispositif et procede permettant la prevision structurelle de sequences d'acides amines |
Non-Patent Citations (4)
Title |
---|
DAHIYAT B I ET AL: "De novo protein design: fully automated sequence selection" SCIENCE, AAAS. LANCASTER, PA, US, vol. 278, no. 5335, 3 October 1997 (1997-10-03), pages 82-87, XP002188179 ISSN: 0036-8075 * |
DAHIYAT B I ET AL: "PROTEIN DESIGN AUTOMATION" PROTEIN SCIENCE, CAMBRIDGE UNIVERSITY PRESS, CAMBRIDGE, GB, vol. 5, no. 5, 1 May 1996 (1996-05-01), pages 895-903, XP002073372 ISSN: 0961-8368 cited in the application * |
HERZYK PAWEL ET AL: "A reduced representation of proteins for use in restraint satisfaction calculations." PROTEINS STRUCTURE FUNCTION AND GENETICS, vol. 17, no. 3, 1993, pages 310-324, XP008008288 ISSN: 0887-3585 * |
RAHA KAUSHIK ET AL: "Prediction of amino acid sequence from structure." PROTEIN SCIENCE, vol. 9, no. 6, June 2000 (2000-06), pages 1106-1119, XP008008198 ISSN: 0961-8368 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112585684A (zh) * | 2018-09-21 | 2021-03-30 | 渊慧科技有限公司 | 确定蛋白结构的机器学习 |
CN114155912A (zh) * | 2022-02-09 | 2022-03-08 | 北京晶泰科技有限公司 | 蛋白质的序列设计方法、蛋白质的结构设计方法、装置及电子设备 |
CN118571306A (zh) * | 2024-08-01 | 2024-08-30 | 温州大学 | 一种取代基工程下基态结构预测方法 |
Also Published As
Publication number | Publication date |
---|---|
WO2002014875A3 (fr) | 2003-02-27 |
AU2001280077A1 (en) | 2002-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Smith et al. | The relationship between the flexibility of proteins and their conformational states on forming protein–protein complexes with an application to protein–protein docking | |
Kono et al. | Statistical theory for protein combinatorial libraries. Packing interactions, backbone flexibility, and the sequence variability of a main-chain structure | |
Floudas et al. | Advances in protein structure prediction and de novo protein design: A review | |
Bordoli et al. | Protein structure homology modeling using SWISS-MODEL workspace | |
De Bakker et al. | Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all‐atom statistical potential and the AMBER force field with the Generalized Born solvation model | |
Koehl et al. | De novo protein design. I. In search of stability and specificity | |
US20070016380A1 (en) | Protein engineering | |
US20110245463A1 (en) | Apparatus and method for structure-based prediction of amino acid sequences | |
JP2002536301A (ja) | タンパク質モデリングツール | |
Dumontier et al. | Armadillo: domain boundary prediction by amino acid composition | |
Saven | Designing protein energy landscapes | |
Nnyigide et al. | Protein repair and analysis server: A web server to repair PDB structures, add missing heavy atoms and hydrogen atoms, and assign secondary structures by amide interactions | |
Takada | Protein folding simulation with solvent‐induced force field: folding pathway ensemble of three‐helix‐bundle proteins | |
Zhao et al. | Protein–ligand docking with multiple flexible side chains | |
US8452542B2 (en) | Structure-sequence based analysis for identification of conserved regions in proteins | |
US7751987B1 (en) | Method and system for predicting amino acid sequences compatible with a specified three dimensional structure | |
Jaramillo et al. | Automatic procedures for protein design | |
WO2002014875A2 (fr) | Procede et systeme pour la prediction de sequences d'acides amines compatibles avec une structure tridimensionnelle specifiee | |
Steipe | Protein design concepts | |
Mayewski | A multibody, whole‐residue potential for protein structures, with testing by Monte Carlo simulated annealing | |
Liou et al. | A hydrophobic spine stabilizes a surface-exposed α-helix according to analysis of the solvent-accessible surface area | |
Fernandez-Ballester et al. | Prediction of protein–protein interaction based on structure | |
US20070244651A1 (en) | Structure-Based Analysis For Identification Of Protein Signatures: CUSCORE | |
CA2537872A1 (fr) | Procedes pour etablir et analyser des conformations de sequences d'acides amines | |
Schenk et al. | Protein sequence and structure alignments within one framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |