International Nucleotide Sequence Database Collaboration

Last updated

The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. [1] It involves the following computerized databases: NIG's DNA Data Bank of Japan (Japan), NCBI's GenBank (USA) and the EMBL-EBI's European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

Contents

All of the data in INSDC is available for free and unrestricted access, for any purpose, with no restrictions on analysis, redistribution, or re-publication of the data. This policy has been a foundational principle of the INSDC since its inception. [2] The official policy statement can be found at http://www.insdc.org/. [3] Since the 1990s, most of the world's major scientific journals have required that sequence data be deposited in an INSDC database as a pre-condition for publication.

The DDBJ/EMBL/GenBank synchronization is maintained according to a number of guidelines which are produced and published by an International Advisory Board. [4] The guidelines consist of a common definition of the feature tables [5] for the databases, which regulate the content and syntax of the database entries, [6] in the form of a common DTD (Document Type Definition).

The syntax is called INSDSeq and its core consists of the letter sequence of the gene expression (amino acid sequence) and the letter sequence for nucleotide bases in the gene or decoded segment. In a DBFetch operation shows a typical INSD entry at the EBI database; [7] the same entry at NCBI. [8]

See also

Related Research Articles

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration (INSDC).

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus. Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.

<span class="mw-page-title-main">Takashi Gojobori</span> Japanese molecular biologist

Takashi Gojobori is a Japanese molecular biologist, Vice-Director of the National Institute of Genetics (NIG) and the DNA Data Bank of Japan (DDBJ) at NIG, in Mishima, Japan. Gojobori is a Distinguished Professor at King Abdullah University of Science and Technology (KAUST) in Thuwal, Saudi Arabia. He is a Professor of Bioscience and Acting Director at the Computational Bioscience Research Center at KAUST.

<span class="mw-page-title-main">Genomic Standards Consortium</span>

The Genomic Standards Consortium (GSC) is an initiative working towards richer descriptions of our collection of genomes, metagenomes and marker genes. Established in September 2005, this international community includes representatives from a range of major sequencing and bioinformatics centres and research institutions. The goal of the GSC is to promote mechanisms for standardizing the description of (meta)genomes, including the exchange and integration of (meta)genomic data. The number and pace of genomic and metagenomic sequencing projects will only increase as the use of ultra-high-throughput methods becomes common place and standards are vital to scientific progress and data sharing.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

<span class="mw-page-title-main">KIAA0895</span> Protein-coding gene in the species Homo sapiens

KIAA0895 is a protein that in Homo sapiens is encoded by the KIAA0895 gene. The gene encodes a protein commonly known as the KIAA0895 protein. It's aliases include hypothetical protein LOC23366, OTTHUMP00000206979, OTTHUMP00000206980, 9530077C05Rik, and 1110003N12Rik. It is located at 7p14.2.

<span class="mw-page-title-main">Sequence Read Archive</span>

The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

The genomic epidemiological database for global identification of microorganisms or global microbial identifier is a platform for storing whole genome sequencing data of microorganisms, for the identification of relevant genes and for the comparison of genomes to detect and track-and-trace infectious disease outbreaks and emerging pathogens. The database holds two types of information: 1) genomic information of microorganisms, linked to, 2) metadata of those microorganism such as epidemiological details. The database includes all genera of microorganisms: bacteria, viruses, parasites and fungi.

SMIM23 or Small Integral Membrane Protein 23 is a protein which in humans is encoded by the SMIM23 or c5orf50 gene. The longer mRNA isoform is 519 nucleotides which translates to 172 amino acids of a protein. In recent advancements, researchers have identified this gene, along with a few others, could potentially play a role in how facial morphology arises in humans.

RING Finger Protein 227, also known as RNF227 and LINC02581, is a protein which in humans is encoded by the RNF227 gene. According to DNA microarray data, it is found in at least 15 tissues.

bMERB domain containing 1 is a gene expressed in humans which has broad expression across the brain. This gene codes for bMERB1 domain-containing protein 1 isoform 1. It is predicted that this gene is involved in actin cytoskeleton regulation, microtubule regulation and glial cell migration.

References

  1. Karsch-Mizrachi, I.; Nakamura, Y.; Cochrane, G.; International Nucleotide Sequence Database Collaboration (2011). "The International Nucleotide Sequence Database Collaboration". Nucleic Acids Research. 40 (Database issue): D33–D37. doi:10.1093/nar/gkr1006. PMC   3244996 . PMID   22080546.
  2. Brunak, Soren; Danchin, Antoine; Hattori, Masahira; Nakamura, Haruki; Shinozaki, Kazuo; Matise, Tara; Preuss, Daphne (15 November 2002). "Nucleotide sequence database policies". Science. 298 (5597): 1333. doi:10.1126/science.298.5597.1333b. ISSN   1095-9203. PMID   12436968. S2CID   42740562.
  3. "insdc.org".
  4. "INSDC :: Advisors". Archived from the original on 2007-12-09. Retrieved 2019-06-29.
  5. "The DDBJ/ENA/GenBank Feature Table Definition". Ebi.ac.uk. Archived from the original on 2005-03-24. Retrieved 2019-06-29.
  6. "European Nucleotide Archive < EMBL-EBI". www.ebi.ac.uk.
  7. "Database Browsing". Archived from the original on 2005-02-12. Retrieved 2005-03-02.
  8. USA (2019-05-06). "Trifolium repens mRNA for non-cyanogenic beta-glucosidase - Nucleotide - NCBI". Ncbi.nlm.nih.gov. Retrieved 2019-06-29.