GenBank

Last updated
GenBank
Content
DescriptionNucleotide sequences for more than 300,000 organisms with supporting bibliographic and biological annotation.
Data types
captured
  • Nucleotide sequence
  • Protein sequence
Organisms All
Contact
Research center NCBI
Primary citation PMID   21071399
Release date1982;37 years ago (1982)
Access
Data format
Website NCBI
Download URL ncbi ftp
Web service URL
Tools
Web BLAST
Standalone BLAST
Miscellaneous
License Unclear [1]

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC).

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

Nucleotide biological molecules that form the building blocks of nucleic acids

Nucleotides are organic molecules that serve as the monomer units for forming the nucleic acid polymers deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules within all life-forms on Earth. Nucleotides are the building blocks of nucleic acids; they are composed of three sub unit molecules: a nitrogenous base, a five-carbon sugar, and at least one phosphate group.

Protein Biological molecule consisting of chains of amino acid residues

Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific three-dimensional structure that determines its activity.

Contents

GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months. [2] [3]

Organism Any individual living physical entity

In biology, an organism is any individual entity that propagates the properties of life. It is a synonym for "life form".

Walter Goad was a nuclear physicist at the Los Alamos National Laboratory. During the 1960s, Goad turned his attention from physics to biology and he is best known for his contributions to the founding of GenBank, the most widely used repository for DNA sequence data.

Exponential growth Growth of quantities at rate proportional to the current amount

Exponential growth is exhibited when the rate of change—the change per instant or unit of time—of the value of a mathematical function of time is proportional to the function's current value, resulting in its value at any time being an exponential function of time, i.e., a function in which the time value is the exponent. Exponential decay occurs in the same way when the growth rate is negative. In the case of a discrete domain of definition with equal intervals, it is also called geometric growth or geometric decay, the function values forming a geometric progression. In either exponential growth or exponential decay, the ratio of the rate of change of the quantity to its current size remains constant over time.

Release 194, produced in February 2013, contained over 150 billion nucleotide bases in more than 162 million sequences. [4] GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

Submissions

Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.

An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository. Because of its relative stability, accession numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to a unique sequence. All sequence information repositories implement the concept of "accession number" but might do so with subtle variations.

Entrez cross-database search engine, or web portal

The Entrez Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.

The File Transfer Protocol (FTP) is a standard network protocol used for the transfer of computer files between a client and server on a computer network.

History

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank. [5] Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.

National Institutes of Health Medical research organization in the United States

The National Institutes of Health (NIH) is the primary agency of the United States government responsible for biomedical and public health research. It was founded in the late 1870s and is now part of the United States Department of Health and Human Services. The majority of NIH facilities are located in Bethesda, Maryland. The NIH conducts its own scientific research through its Intramural Research Program (IRP) and provides major biomedical research funding to non-NIH research facilities through its Extramural Research Program.

In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL. [6] As one of the earliest bioinformatics community projects on the Internet, the GenBank project started BIOSCI/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information. [7]

Stanford University Private research university in Stanford, California

Leland Stanford Junior University is a private research university in Stanford, California. Stanford is known for its academic strength, selectivity, wealth, proximity to Silicon Valley, and ranking as one of the world's top universities. Often ranking first among all universities both domestically and internationally has led Stanford to be known as America's "dream college".

Bioinformatics Software tools for understanding biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

BIOSCI, also known as Bionet, is a set of electronic communication forum used by life scientists around the world. It includes the Bionet Usenet newsgroups and parallel e-mail lists, with public archives since 1992 at www.bio.net. BIOSCI/Bionet provides public, open access biology news and discussion for areas such as molecular biology methods and reagents, bioinformatics software and computational biology, toxicology, and several organism communities including yeast, C.elegans and annelida (worms), the plant arabidopsis, fruitfly, maize (corn), and others.

Genbank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII. NucleotideSequences 86 87.jpeg
Genbank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII.
CDRom of Genbank v100 Genbank100CD.jpg
CDRom of Genbank v100

Growth

Growth in GenBank base pairs, 1982 to 2018, on a semi-log scale Growth of Genbank.svg
Growth in GenBank base pairs, 1982 to 2018, on a semi-log scale

The GenBank release notes for release 162.0 (October 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months". [4] [8] As of 15 June 2019, GenBank release 232.0 has 213,383,758 loci, 329,835,282,370 bases, from 213,383,758 reported sequences. [4]

The GenBank database includes additional data sets that are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.

Top organisms in GenBank (Release 191) [9]
Organism base pairs
Homo sapiens1.6310774187×10^10
Mus musculus9.974977889×10^9
Rattus norvegicus6.521253272×10^9
Bos taurus5.386258455×10^9
Zea mays5.062731057×10^9
Sus scrofa4.88786186×10^9
Danio rerio3.120857462×10^9
Strongylocentrotus purpuratus1.435236534×10^9
Macaca mulatta1.256203101×10^9
Oryza sativa Japonica Group1.255686573×10^9
Nicotiana tabacum1.197357811×10^9
Xenopus (Silurana) tropicalis1.249938611×10^9
Drosophila melanogaster1.11996522×10^9
Pan troglodytes1.008323292×10^9
Arabidopsis thaliana1.144226616×10^9
Canis lupus familiaris951,238,343
Vitis vinifera999,010,073
Gallus gallus899,631,338
Glycine max906,638,854
Triticum aestivum898,689,329

Incomplete identifications

Public databases which may be searched using the National Center for Biotechnology Information Basic Local Alignment Search Tool (NCBI BLAST), lack peer-reviewed sequences of type strains and sequences of non-type strains. On the other hand, while commercial databases potentially contain high-quality filtered sequence data, there are a limited number of reference sequences.

A paper released in the Journal of Clinical Microbiology [10] evaluated the 16S rRNA gene sequencing results analyzed with GenBank in conjunction with other freely available, quality-controlled, web-based public databases, such as the EzTaxon-e (http://eztaxon-e.ezbiocloud.net/) and the BIBI (http://pbil.univ-lyon1.fr/bibi/) databases. The results showed that analyses performed using GenBank combined with EzTaxon-e (kappa = 0.79) were more discriminative than using GenBank (kappa = 0.66) or other databases alone.

See also

Related Research Articles

National Center for Biotechnology Information database arm of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

Biological database database of biological information

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and are instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases.

The International Nucleotide Sequence Database Collaboration consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

Amos Bairoch Swiss bioinformatician

Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.

David J. Lipman American biologist

David J. Lipman is an American biologist who since 1989 to 2017 had been the Director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In May 2017, it was announced that he would be leaving NCBI and would be taking the position of Chief Science Officer at Impossible Foods.

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

MicrobesOnline

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank. The original version of the database was developed around 1995 by Roman Laskowski and collaborators at University College London. As of 2014, PDBsum is maintained by Laskowski and collaborators in the laboratory of Janet Thornton at the European Bioinformatics Institute (EBI).

Sequence Read Archive

The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

European Nucleotide Archive Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

The ascidian mitochondrial code is a genetic code found in the mitochondria of Ascidia.

Donna R Maglott is known for her sequencing of the mouse genome by understanding the human genome which is ultimately a great experimental tool for biomedical research.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

VFDB also known as Virulence Factor Database is a database that provides scientist quick access to virulence factors in bacterial pathogens. It can be navigated and browsed using genus or words. A BLAST tool is provided for search against known virulence factors. VFDB contains a collection of 16 important bacterial pathogens. Perl scripts were used to extract positions and sequences of VF from GenBank. Clusters of Orthologous Groups (COG) was used to update incomplete annotations. More information was obtained by NCBI. VFDB was built on Linux operation systems on DELL PowerEdge 1600SC servers. VFDB can be accessed online at http://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi?func=VFanalyzer.

References

  1. The download page at UCSC says "NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank."
  2. Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Wheeler, D. L.; et al. (2008). "GenBank". Nucleic Acids Research. 36 (Database): D25–D30. doi:10.1093/nar/gkm929. PMC   2238942 . PMID   18073190.
  3. Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W.; et al. (2009). "GenBank". Nucleic Acids Research. 37 (Database): D26–D31. doi:10.1093/nar/gkn723. PMC   2686462 . PMID   18940867.
  4. 1 2 3 "GenBank release notes". NCBI.
  5. Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Newsbulletin: obituary. Los Alamos National Laboratory.
  6. LANL GenBank History
  7. Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC   330520 . PMID   2326192.
  8. Benson, D. A.; Cavanaugh, M.; Clark, K.; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W. (2012). "GenBank". Nucleic Acids Research. 41 (Database issue): D36–D42. doi:10.1093/nar/gks1195. PMC   3531190 . PMID   23193287.
  9. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (January 2011). "GenBank". Nucleic Acids Res. 39 (Database issue): D32–37. doi:10.1093/nar/gkq1079. PMC   3013681 . PMID   21071399.
  10. Kyung Sun Parka, Chang-Seok Kia, Cheol-In Kangb, Yae-Jean Kimc, Doo Ryeon Chungb, Kyong Ran Peckb, Jae-Hoon Songb and Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC   3347139 . PMID   22403421.CS1 maint: Uses authors parameter (link)