National Center for Biotechnology Information

Last updated
National Center for Biotechnology Information
US-NLM-NCBI-Logo.svg
AbbreviationNCBI
Founded1988;31 years ago (1988)
Headquarters Bethesda, Maryland, U.S.
Coordinates 38°59′45″N77°05′56″W / 38.995872°N 77.098811°W / 38.995872; -77.098811 Coordinates: 38°59′45″N77°05′56″W / 38.995872°N 77.098811°W / 38.995872; -77.098811
Website www.ncbi.nlm.nih.gov

The National Center for Biotechnology Information (NCBI) [1] [2] is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

United States National Library of Medicine Worlds largest medical library

The United States National Library of Medicine (NLM), operated by the United States federal government, is the world's largest medical library.

National Institutes of Health Medical research organization in the United States

The National Institutes of Health (NIH) is the primary agency of the United States government responsible for biomedical and public health research. It was founded in the late 1870s, and is now part of the United States Department of Health and Human Services. The majority of NIH facilities are located in Bethesda, Maryland. The NIH conducts its own scientific research through its Intramural Research Program (IRP) and provides major biomedical research funding to non-NIH research facilities through its Extramural Research Program.

Bethesda, Maryland Census-designated place in Maryland, United States

Bethesda is an unincorporated, census-designated place in southern Montgomery County, Maryland, United States, located just northwest of the U.S. capital of Washington, D.C. It takes its name from a local church, the Bethesda Meeting House, which in turn took its name from Jerusalem's Pool of Bethesda. The National Institutes of Health main campus and the Walter Reed National Military Medical Center are in Bethesda, as are a number of corporate and government headquarters.

Contents

The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrez search engine. NCBI was directed by David Lipman, [2] one of the original authors of the BLAST sequence alignment program [3] and a widely respected figure in bioinformatics. He also led an intramural research program, including groups led by Stephen Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017. [4]

Biotechnology Use of living systems and organisms to develop or make useful products

Biotechnology is the broad area of biology involving living systems and organisms to develop or make products, or "any technological application that uses biological systems, living organisms, or derivatives thereof, to make or modify products or processes for specific use". Depending on the tools and applications, it often overlaps with the (related) fields of molecular biology, bio-engineering, biomedical engineering, biomanufacturing, molecular engineering, etc.

Biomedicine is a branch of medical science that applies biological and physiological principles to clinical practice. The branch especially applies to biology and physiology. Biomedicine also can relate to many other categories in health and biological related fields. It has been the dominant health system for more than a century.

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration (INSDC).

GenBank

NCBI has had responsibility for making available the GenBank DNA sequence database since 1992. [5] GenBank coordinates with individual laboratories and other sequence databases such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ). [5]

DNA Molecule that encodes the genetic instructions used in the development and functioning of all known organisms and many viruses

Deoxyribonucleic acid is a molecule composed of two chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids; alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

European Molecular Biology Laboratory molecular biology research institution

The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 25 member states, four prospect and two associate member states. EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states. Research at EMBL is conducted by approximately 85 independent groups covering the spectrum of molecular biology. The list of independent groups at EMBL or any other research institute in Molecular Biology in Europe can be found at http://www.europe-bio.com. The Laboratory operates from six sites: the main laboratory in Heidelberg, and outstations in Hinxton, Grenoble (France), Hamburg (Germany), Monterotondo and Barcelona (Spain). EMBL groups and laboratories perform basic research in molecular biology and molecular medicine as well as training for scientists, students and visitors. The organization aids in the development of services, new instruments and methods, and technology in its member states. Israel is the only full member state located outside Europe

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI assigns a unique identifier (taxonomy ID number) to each species of organism. [6]

Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship. As of 28 June 2019, approximately 9,000 of the over 25,000 entries in OMIM represented phenotypes; the rest represented genes, many of which were related to known phenotypes.

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

Single-nucleotide polymorphism single nucleotide position in genomic DNA at which different sequence alternatives exist

A single-nucleotide polymorphism is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.

The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds.

World Wide Web System of interlinked hypertext documents accessed over the Internet

The World Wide Web (WWW), commonly known as the Web, is an information system where documents and other web resources are identified by Uniform Resource Locators, which may be interlinked by hypertext, and are accessible over the Internet. The resources of the WWW may be accessed by users by a software application called a web browser.

NCBI Bookshelf

The "NCBI Bookshelf [7] is a collection of freely accessible, downloadable, on-line versions of selected biomedical books. The Bookshelf covers a wide range of topics including molecular biology, biochemistry, cell biology, genetics, microbiology, disease states from a molecular and cellular point of view, research methods, and virology. Some of the books are online versions of previously published books, while others, such as Coffee Break, are written and edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository of peer-reviewed publication abstracts in that Bookshelf contents provide established perspectives on evolving areas of study and a context in which many disparate individual pieces of reported research can be organized.[ citation needed ]

Molecular biology Branch of biology dealing with biological activitys molecular basis

Molecular biology is a branch of biology that concerns the molecular basis of biological activity between biomolecules in the various systems of a cell, including the interactions between DNA, RNA, proteins and their biosynthesis, as well as the regulation of these interactions.

Biochemistry study of chemical processes in living organisms

Biochemistry, sometimes called biological chemistry, is the study of chemical processes within and relating to living organisms. Biochemical processes give rise to the complexity of life.

Cell biology Scientific Discipline that Studies Cells

Cell biology is a branch of biology that studies the structure and function of the cell, which is the basic unit of life. Cell biology is concerned with the physiological properties, metabolic processes, signaling pathways, life cycle, chemical composition, and interactions of the cell with their environment. This is done both on a microscopic and molecular level as it encompasses prokaryotic cells and eukaryotic cells. Knowing the components of cells and how cells work is fundamental to all biological sciences; it is also essential for research in bio-medical fields such as cancer, and other diseases. Research in cell biology is closely related to genetics, biochemistry, molecular biology, immunology, and cytochemistry. For some extra information, the recommendation is to check the biology resource in the external link.

Basic Local Alignment Search Tool (BLAST)

BLAST is an algorithm used for calculating sequence similarity between biological sequences such as nucleotide sequences of DNA and amino acid sequences of proteins. [8] BLAST is a powerful tool for finding sequences similar to the query sequence within the same organism or in different organisms. It searches the query sequence on NCBI databases and servers and post the results back to the person's browser in chosen format. Input sequences to the BLAST are mostly in FASTA or Genbank format while output could be delivered in variety of formats such as HTML, XML formatting and plain text. HTML is the default output format for NCBI's web-page. Results for NCBI-BLAST are presented in graphical format with all the hits found, a table with sequence identifiers for the hits having scoring related data, along with the alignments for the sequence of interest and the hits received with analogous BLAST scores for these [9]

Entrez

The Entrez Global Query Cross-Database Search System is used at NCBI for all the major databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy, Complete Genomes, OMIM, and several others. [10] Entrez is both indexing and retrieval system having data from various sources for biomedical research. NCBI distributed the first version of Entrez in 1991, composed of nucleotide sequences from PDB and GenBank, protein sequences from SWISS-PROT, translated GenBank, PIR, PRF , PDB and associated abstracts and citations from PubMed. Entrez is specially designed to integrate the data from several different sources, databases and formats into a uniform information model and retrieval system which can efficiently retrieve that relevant references, sequences and structures. [11]

Gene

Gene has been implemented at NCBI to characterize and organize the information about genes. It serves as a major node in the nexus of genomic map, expression, sequence, protein function, structure and homology data. A unique GeneID is assigned to each gene record that can be followed through revision cycles. Gene records for known or predicted genes are established here and are demarcated by map positions or nucleotide sequence. Gene has several advantages over its predecessor, LocusLink, including, better integration with other databases in NCBI, broader taxonomic scope, and enhanced options for query and retrieval provided by Entrez system. [12]

Protein

Protein database maintains the text record for individual protein sequences, derived from many different resources such as NCBI Reference Sequence (RefSeq) project, GenBank, PDB and UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and XML and are linked to other NCBI resources. Protein provides the relevant data to the users such as genes, DNA/RNA sequences, biological pathways, expression and variation data and literature. It also provides the pre-determined sets of similar and identical proteins for each sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate sets for experimentally-determined structures in PDB that are imported by NCBI. The Conserved Domain database (CDD) of protein contains sequence profiles that characterize highly conserved domains within protein sequences. It also has records from external resources like SMART and Pfam. There is another database in protein known as Protein Clusters database which contains sets of proteins sequences that are clustered according to the maximum alignments between the individual sequences as calculated by BLAST. [13]

Pubchem database

PubChem database of NCBI is a public resource for molecules and their activities against biological assays. PubChem is searchable and accessible by Entrez information retrieval system. [14]

Implications of low-price DNA sequencing

In 2008 The New York Times wrote "The cost of determining a person’s complete genetic blueprint is about to plummet again — to $5,000." and added that the long-term goal was "the $1,000 genome ." [15] Today's beneficiaries are AncestryDNA, the 2006-founded 23andMe, and those who've used their services.

See also

Related Research Articles

In bioinformatics, BLAST is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

Entrez cross-database search engine, or web portal

The Entrez Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.

A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.

The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

David J. Lipman American biologist

David J. Lipman is an American biologist who since 1989 to 2017 had been the Director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In May 2017, it was announced that he would be leaving NCBI and would be taking the position of Chief Science Officer at Impossible Foods.

formatdb is a discontinued software tool that was used in molecular bioinformatics to format protein or nucleotide databases for BLAST. It has been replaced by makeblastdb and the NCBI "strongly encourage[s]" users to stop using formatdb.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.

DGLUCY protein-coding gene in the species Homo sapiens

DGLUCY is a protein that in humans is encoded by the DGLUCY gene.

C16orf42 protein-coding gene in the species Homo sapiens

C16orf42, or chromosome 16 open reading frame 42, is a hypothetical human protein found on chromosome 16. Its protein is 312 amino acids long. and its cDNA has 1214 base pairs

PATRIC is the Bacterial Bioinformatics Resource Center, an information system designed to support the biomedical research community’s work on bacterial infectious diseases via integration of vital pathogen information with rich data and analysis tools. PATRIC sharpens and hones the scope of available bacterial phylogenomic data from numerous sources specifically for the bacterial research community, in order to save biologists time and effort when conducting comparative analyses. The freely available PATRIC platform provides an interface for biologists to discover data and information and conduct comprehensive comparative genomics and other analyses in a one-stop shop. PATRIC, a project of Virginia Tech’s Cyberinfrastructure Division, is funded by the National Institutes of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH).

Chromosome 16 open reading frame 13 protein-coding gene in the species Homo sapiens

Chromosome 16 open reading frame 13, also called C16orf13, is a protein-coding gene of unknown function, also known as JFP2. Though the function of this gene is unknown, various data have revealed that it is expressed at high levels in various cancerous tissues. Underexpression of this gene has also been linked to disease consequences in humans.

LOC105377021 is a protein which in humans is encoded by the LOC105377021 gene. LOC105377021 exhibits expressional pathology related to breast cancer, specifically triple negative breast cancer. LOC105377021 contains a serine rich region in addition to predicted alpha helix motifs.

C17orf53 protein-coding gene in the species Homo sapiens

C17orf53 is a gene in humans that encodes a protein known as C17orf53, uncharacterized protein C17orf53. It has been shown to target the nucleus, with minor localization in the cytoplasm. Based on current findings C17orf53 is predicted to perform functions of transport, however further research into the protein could provide more specific evidence regarding its function.

Transmembrane protein 44 mammalian protein found in Homo sapiens

Transmembrane protein 44 is a protein that in humans is encoded by the TMEM44 gene.

C19orf44 (gene) protein-coding gene in the species Homo sapiens

Chromosome 19 open reading frame 44 is a protein that in humans is encoded by the C19orf44 gene. C19orf44 is an uncharacterized protein with an unknown function in humans. C19orf44 is non-limiting implying that the protein exists in other species besides human. The protein contains one domain of unknown function (DUF) that is highly conserved throughout its orthologs. This protein is most highly expressed in the testis and ovary, but also has significant expression in the thyroid and parathyroid. Other names for this protein include: LOC84167.

References

  1. "The Human Genome Project". The New York Times .
  2. 1 2 "Research Institute Posts Gene Data on Internet". The New York Times . June 26, 1997.
  3. "Sense from Sequences: Stephen F. Altschul on Bettering BLAST". 2000.
  4. "National Library of Medicine Announces Departure of NCBI Director Dr. David Lipman". www.nlm.nih.gov. Retrieved 2017-05-06.
  5. 1 2 Mizrachi, Ilene (22 August 2007). "GenBank: The Nucleotide Sequence Database". National Center for Biotechnology Information (US) via www.ncbi.nlm.nih.gov.
  6. "Home - Taxonomy - NCBI". www.ncbi.nlm.nih.gov.
  7. USA (2019-05-06). "Home - Books - NCBI". Ncbi.nlm.nih.gov. Retrieved 2019-06-12.
  8. Altschul Stephen; Gish Warren; Miller Webb; Myers Eugene; Lipman David (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/s0022-2836(05)80360-2. PMID   2231712.
  9. Madden T. (2002). The NCBI handbook, 2nd edition, Chapter 16, The BLAST Sequence Analysis Tool
  10. NCBI Resource Coordinators (2012). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research 41 (Database issue): D8–D20.
  11. Ostell J. (2002). The NCBI handbook, 2nd edition, Chapter 15, The Entrez Search and Retrieval System
  12. Maglott D. Pruitt K. & Tatusova T. (2005). The NCBI handbook, 2nd edition, Chapter 19, Gene: A Directory of Genes
  13. Sayers E. (2013). The NCBI handbook, 2nd edition, NCBI Protein Resources
  14. Wang Y. & Bryant S H. (2014). The NCBI handbook, 2nd edition, NCBI PubChem BioAssay Database
  15. Catherine Hutchings (October 7, 2008). "Your DNA: What Can You Afford (Not) To Know?". The New York Times .