TIGRFAMs

Last updated

TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. [1] [2] [3] Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined cutoffs of a given TIGRFAMs HMM are assigned to that protein family and may be assigned the corresponding annotations. Most models describe protein families found in Bacteria and Archaea.

Contents

Like Pfam, TIGRFAMs uses the HMMER package written by Sean Eddy. [4]

History

TIGRFAMs was produced originally at The Institute for Genomic Research (TIGR) and its successor, J. Craig Venter Institute (JCVI), but it moved in April 2018 to the National Center for Biotechnology Information (NCBI). TIGRFAMs remains a member database in InterPro. The last version from JCVI, release 15.0, contained 4488 models. TIGRFAMs now continues at NCBI as part of a larger collection of HMMs, called NCBIFAMs, used in its RefSeq and PGAP genome annotation pipelines. [5] Active curation and revision of TIGRFAMs models continues at NCBI, but the creation of TIGRFAMs models per se has ended, as newly constructed HMMs from the RefSeq group receive different designations when added to NCBIFAMs.

Related Research Articles

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration (INSDC).

UniProt Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The Protein Information Resource (PIR), located at Georgetown University Medical Center, is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. It contains protein sequences databases

Ensembl genome database project

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project. Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Pfam

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 34.0, was released in March 2021 and contains 19,179 families.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

David J. Lipman American biologist

David J. Lipman is an American biologist who from 1989 to 2017 was the Director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In 2017, he left NCBI and became Chief Science Officer at Impossible Foods.

MicrobesOnline

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" in each of six possible reading frames or being "non-coding". Original GeneMark is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

Pathema was one of the eight bioinformatics resource centers funded by the National Institute of Allergy and Infectious Diseases (NIAID), a component of the National Institute of Health (NIH), which is an agency of the United States Department of Health and Human Services.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

Donna R. Maglott is a staff scientist at the National Center for Biotechnology Information known for her research on large-scale genomics projects, including the mouse genome and development of databases required for genomics research.

References

  1. Haft, DH; Selengut, JD; White, O (2003). "The TIGRFAMs database of protein families". Nucleic Acids Research. 31 (1): 371–3. doi:10.1093/nar/gkg128. PMC   165575 . PMID   12520025.
  2. Selengut, JD; Haft, DH; Davidsen, T; Ganapathy, A; Gwinn-Giglio, M; Nelson, WC; Richter, AR; White, O (2007). "TIGRFAMs and Genome Properties: Tools for the assignment of molecular function and biological process in prokaryotic genomes". Nucleic Acids Research. 35 (Database issue): D260–4. doi:10.1093/nar/gkl1043. PMC   1781115 . PMID   17151080.
  3. Haft, DH; Selengut, JD; Richter, RA; Harkins, DM; Basu, MK; Beck, E (2012). "TIGRFAMs and Genome Properties in 2013". Nucleic Acids Research. 41 (Database issue): D387-95. doi:10.1093/nar/gks1234. PMC   3531188 . PMID   23197656.
  4. Eddy, SR (2009). "A new generation of homology search tools based on probabilistic inference". Genome Informatics. International Conference on Genome Informatics. 23 (1): 205–11. PMID   20180275.
  5. Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A; et al. (2021). "RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation". Nucleic Acids Res. 49 (D1): D1020–D1028. doi:10.1093/nar/gkaa1105. PMC   7779008 . PMID   33270901.{{cite journal}}: CS1 maint: multiple names: authors list (link)