Genome Research -- Web Site Links

WEB SITE LINKS FOR DATA SUBMISSION, APPROPRIATE NOMENCLATURE, AND ADDITIONAL RESOURCES

Genome Research requires that all data generated in a study described in a manuscript considered for publication is made available to the broader community in publicly held databases when available, and at the Genome Research Web site, when they are not. The following list of public databases and resources serves as an introductory guide to data submission and appropriate nomenclature for authors contributing to Genome Research. However, this list should not be considered to be comprehensive. If there is an additional database or resource not listed here that would be of use to other authors, please denli{at}cshl.edu.

SEQUENCE DATA

All new sequence data should be submitted to and assigned an accession number(s) by an internationally recognized public database prior to publication. Reviewer links or tokens should be provided in the manuscript for data submissions that are private.

The NCBI Gene Expression Omnibus (GEO) is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Instructions for data submission.

The NCBI Sequence Read Archive (SRA) is a public high-throughput sequencing data repository. Instructions for data submission.

The DNA Data Bank of Japan (DDBJ) collects nucleotide sequence data as a member of International Nucleotide Sequence Database Collaboration (INSDC) and provides freely available nucleotide sequence data and supercomputer system, to support research activities in life science. Instructions for sequence data submission.

The EMBL-EBI ArrayExpress stores data from high-throughput functional genomics experiments. Instructions for data submission.

The EMBL-EBI European Nucleotide Archive (ENA) stores nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. Instructions for sequence data submission.

GenBank, the NIH genetic sequence database, is an annotated collection of all publicly available DNA sequences. Instructions for sequence data submission.

The CNGB Sequence Archive (CNSA) stores and manages global omics data for sharing. Instructions for sequence data submission.

miRBase collects microRNA (miRNA) data, containing all published miRNA sequences, genomic locations, and associated annotations. The miRBase Registry section provides a confidential service assigning official names for novel miRNA genes prior to publication of their discovery.

GENOTYPE/PHENOTYPE AND GENOMIC VARIATION DATA

It is currently recommended that authors review the structural variation data guidelines recommended by Mills et al., (2011) Nature 470(7332):59-65 and Alkan et al. (2011) Nature Reviews Genetics 12, 363-376. Please use nomenclature for the description of human gene variants recommended by the Human Genome Variation Society.

The NCBI Database of Single Nucleotide Polymorphisms (dbSNP) includes data on small genetic variation such as single nucleotide polymorphisms (SNPs), small-scale insertion/deletions, polymorphic repetitive elements, and microsatellite variation in humans. Instructions for SNP data submission.

The EMBL-EBI European Variation Archive (EVA) is a member of International Nucleotide Sequence Database Collaboration (INSDC) and contains open-access genetic variation data from all species. Instructions for data submission.

The NCBI Database of Genotype and Phenotype (dbGaP) archives and distributes the results of studies investigating the interaction of genotype and phenotype, including genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. Instructions for data submission.

The EMBL-EBI European Genome-phenome Archive (EGA) is a repository for genotype experiments, including case control, population, and family studies. SNP and CNV genotypes from array based methods and genotyping done with re-sequencing methods are accepted. Data may be either publicly available or require authorized access depending on the study design. Instructions for data submission.

The NCBI Database of Genomic Structural Variation (dbVar) dbVar is a widely available public archive of large variants (>50 bp) such as CNVs, insertions, deletions, duplications, deletion-insertions, inversions, mobile element events, translocations, and complex variants in humans. Instructions for data submission.

The Database of Genomic Variants (DGV) provides a comprehensive summary of structural variation in the human genome and serves as a catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The DGV presents detailed information on a few selected studies, while databases such as DGVa and dbVar provide a comprehensive archive of publicly available structural variation data.

PROTEOMICS AND MOLECULAR INTERACTIONS

The International Molecular Exchange Consortium (IMEx), a group of major public interaction data providers, has established standards for the collection and curation of molecular interaction data. The IMEx site provides instructions for submitting interaction data to any of the partner databases (DIP, IntAct, MINT, MatrixDB, I2D, InnateDB).

The The PRIDE Archive is a centralized, standards-compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence.

The Database of Interacting Proteins (DIP) catalogs experimentally determined protein interactions from a variety of sources to create a single set of protein-protein interactions.

IntAct is an open-source database system and analysis tool for freely available protein interaction data derived from literature curation or direct user submissions.

The Protein Data Bank (PDB) contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies, curating and annotating data according to community standards.

GENE AND GENE PRODUCT NOMENCLATURE

Nomenclature for genes and proteins should be in the appropriate format (including appropriate italics and/or capitalization as it applies for each organism's standard nomenclature format) in text and figures, and where available, submitted and approved by the appropriate nomenclature committees. Specific nomenclature guidelines for commonly studied organisms are listed below.

Human nomenclature guidelines from the Human Genome Organisation (HUGO) Gene Nomenclature Committee. Search for current and approved gene names/symbols.

Mouse nomenclature guidelines from the Mouse Genomic Nomenclature Committee (MGNC). Search for current and approved gene names/symbols.

Rat nomenclature guidelines from the Rat Genome Nomenclature Committee (RGNC). Search for current and approved gene names/symbols.

Avian nomenclature guidelines from the Chicken Gene Nomenclature Consortium (CGNC). Search for current and approved gene names/symbols.

Zebrafish nomenclature guidelines from the Zebrafish Nomenclature Committee (ZNC). Search for current and approved gene names/symbols.

Vertebrate nomenclature guidelines from the Vertebrate Gene Nomenclature Committee (VGNC). Search for current and approved gene names/symbols.

Drosophila nomenclature guidelines adopted by FlyBase. Search for current and approved gene names/symbols.

Arabidopsis nomenclature guidelines adopted by The Arabidopsis Information Resource (TAIR). Search for current and approved gene names/symbols.

C. elegans nomenclature guidelines from WormBase and the Caenorhabditis Genetics Center (CGC). Search for current and approved gene names/symbols.

Xenopus nomenclature guidelines from Xenbase. Search for current and approved gene names/symbols.

S. cerevisiae nomenclature guidelines adopted by the Saccharomyces Genome Database (SGD). Search for current and approved gene names/symbols.

Bacteria nomenclature should follow the guidelines established by Demerec et al., (1966) Genetics 54:61-76.

ADDITIONAL RESOURCES

The ENCyclopedia Of DNA Elements (ENCODE) project aims to identify all functional elements in the sequence of the human genome. The recently completed pilot project phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence.

The modENCODE Project will attempt to identify all of the sequence-based functional elements in the Caenorhabditis elegans and Drosophila melanogaster genomes. modENCODE is operated as a Research Network and data is publicly available, with some restrictions on its use for nine months following publication.

The 1000 Genomes Project aims to find most genetic variants that have frequencies of at least 1% in the populations studied. Data from the 1000 Genomes Project will be made available rapidly to the scientific community through freely accessible public databases.

The International HapMap Project developed a haplotype map resource to describe the common patterns of human DNA sequence variation to help researchers find genes associated with human disease and response to pharmaceuticals. The HapMap site is taken down but archived data will continue to be available via HapMap FTP site.

The Genotype-Tissue Expression (GTEx) Program of NIH Common Fund provides a data resource and tissue bank to study the relationship between genetic variation and gene expression in multiple human tissues.

The Gene Ontology (GO) project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

The Cancer Genome Atlas (TCGA) is a comprehensive effort to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database of biological systems, consisting of genes and proteins, endogenous and exogenous chemicals, interaction and reaction networks, and hierarchies and relationships of various biological objects.

The Human Protein Reference Database (HPRD) is a centralized platform to depict and integrate information manually extracted from the literature regarding domain architecture, post-translational modifications, interaction networks, and disease association for each protein in the human proteome.

The Reactome project is a curated resource of core pathways and reactions in human biology, as well as electronically inferred orthologous events in 22 non-human species including mouse, rat, chicken, puffer fish, C. elegans, Drosophila, yeast, two plants, and E. coli.

The Clusters of Orthologous Groups (COGs) resource was constructed by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

Online Mendelian Inheritance in Man (OMIM), a phenotypic companion to the human genome project, is a catalog of human genes and genetic disorders, focusing primarily on heritable genetic diseases.

Repbase Update (RU) is a database of prototypic sequences representing repetitive DNA from a number of eukaryotic species, with instructions for the submission of sequence data.

Psuedogene.org is a comprehensive database of identified pseudogenes, utilities to identify pseudogenes, various publication data sets, and a pseudogene knowledgebase.

The H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts, containing curated annotations of human genes and transcripts that include gene structures, alternative splicing isoforms, non-coding functional RNAs, genetic polymorphisms (SNPs, indels and microsatellite repeats), relation with diseases, gene expression profiling, molecular evolutionary features, protein-protein interactions (PPIs) and gene families/groups.

[Back to Instructions to Authors]

Current Issue

July 2025, 35 (7)

WEB SITE LINKS FOR DATA SUBMISSION, APPROPRIATE NOMENCLATURE, AND ADDITIONAL RESOURCES

Current Issue

In This Issue