Open protein structure annotation network

TOPSAN
Content
Description	Collaborative annotation environment for structural genomics
Contact
Research center	Sanford-Burnham Medical Research Institute
Laboratory	Joint Center for Structural Genomics
Authors	Dana Weekes
Primary citation	Weekes & al. (2010)
Release date	2010
Access
Website	http://www.topsan.org

Last updated December 04, 2023

The Open Protein Structure Annotation Network (TOPSAN) is a wiki designed to collect, share and distribute information about protein three-dimensional structures ^[1] The site runs on the MindTouch software.

Related Research Articles

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

In biochemistry, a hypothetical protein is a protein whose existence has been predicted, but for which there is a lack of experimental evidence that it is expressed in vivo. Sequencing of several genomes has resulted in numerous predicted open reading frames to which functions cannot be readily assigned. These proteins, either orphan or conserved hypothetical proteins, make up an estimated 20% to 40% of proteins encoded in each newly sequenced genome. The real evidences for the hypothetical protein functioning in the metabolism of the organism can be predicted by comparing its sequence or structure homology by considering the conserved domain analysis. Even when there is enough evidence that the product of the gene is expressed, by techniques such as microarray and mass spectrometry, it is difficult to assign a function to it given its lack of identity to protein sequences with annotated biochemical function. Nowadays, most protein sequences are inferred from computational analysis of genomic DNA sequence. Hypothetical proteins are created by gene prediction software during genome analysis. When the bioinformatic tool used for the gene identification finds a large open reading frame without a characterised homologue in the protein database, it returns "hypothetical protein" as an annotation remark.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

Olfactory receptor 2B11 is a protein that in humans is encoded by the OR2B11 gene.

The Protein Structure Initiative (PSI) was a USA based project that aimed at accelerating discovery in structural genomics and contribute to understanding biological function. Funded by the U.S. National Institute of General Medical Sciences (NIGMS) between 2000 and 2015, its aim was to reduce the cost and time required to determine three-dimensional protein structures and to develop techniques for solving challenging problems in structural biology, including membrane proteins. Over a dozen research centers have been supported by the PSI for work in building and maintaining high-throughput structural genomics pipelines, developing computational protein structure prediction methods, organizing and disseminating information generated by the PSI, and applying high-throughput structure determination to study a broad range of important biological and biomedical problems.

DAVID is a free online bioinformatics resource developed by the Laboratory of Human Retrovirology and Immunoinformatics. All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies, e.g. microarray and proteomics studies. DAVID can be found at https://david.ncifcrf.gov/

PDBWiki was a wiki that functioned as a user-contributed database of protein structure annotations, listing all the protein structures available in the Protein Data Bank (PDB). It ran on the MediaWiki wiki application from 2007 to 2013. The website went offline in 2014 and there has not been any way to subsequently access the information that was contributed. PDBWiki contained details of more than 50,000 protein structures and over 50 'user-contributed' annotations, making it a significant resource for the structural biology community.

Rootletin also known as ciliary rootlet coiled-coil protein (CROCC) is a protein that in humans is encoded by the CROCC gene. Rootletin is a component of the ciliary rootlet, and, together with CEP68 and CEP250, is required for centrosome cohesion.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Conserved Domain Database (CDD) is a database of well-annotated multiple sequence alignment models and derived database search models, for ancient domains and full-length proteins.

References

1 2 Weekes, Dana; Krishna S Sri; Bakolitsa Constantina; Wilson Ian A; Godzik Adam; Wooley John (2010). "TOPSAN: a collaborative annotation environment for structural genomics". BMC Bioinformatics. 11: 426. doi: 10.1186/1471-2105-11-426 . PMC 2936398 . PMID 20716366.

External links

http://www.topsan.org.

This Biological database-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[pmid20716366-1] 1 2 Weekes, Dana; Krishna S Sri; Bakolitsa Constantina; Wilson Ian A; Godzik Adam; Wooley John (2010). "TOPSAN: a collaborative annotation environment for structural genomics". BMC Bioinformatics. 11: 426. doi: 10.1186/1471-2105-11-426 . PMC 2936398 . PMID 20716366.

[1]

Open protein structure annotation network

Contents

See also

Related Research Articles

References

External links