Biomolecular Object Network Databank

Last updated

The Biomolecular Object Network Databank is a bioinformatics databank containing information on small molecule structures and interactions. The databank integrates a number of existing databases to provide a comprehensive overview of the information currently available for a given molecule.

Contents

Background

BOND
Developer(s) Christopher Hogue et al., Samuel Lunenfeld Research Institute, Mount Sinai. Commercial rights: Unleashed Informatics
Stable release
BIND 4.0, SMIDsuite
Type Bioinformatics tool
License Open Access
Website

The Blueprint Initiative started as a research program in the lab of Dr. Christopher Hogue at the Samuel Lunenfeld Research Institute at Mount Sinai Hospital in Toronto. On December 14, 2005, Unleashed Informatics Limited acquired the commercial rights to The Blueprint Initiative intellectual property. This included rights to the protein interaction database BIND, the small molecule interaction database SMID, as well as the data warehouse SeqHound. Unleashed Informatics is a data management service provider and is overseeing the management and curation of The Blueprint Initiative under the guidance of Dr. Hogue. [1]

Construction

BOND integrates the original Blueprint Initiative databases as well as other databases, such as Genbank, combined with many tools required to analyze these data. Annotation links for sequences, including taxon identifiers, redundant sequences, Gene Ontology descriptions, Online Mendelian Inheritance in Man identifiers, conserved domains, data base cross-references, LocusLink Identifiers and complete genomes are also available. BOND facilitates cross-database queries and is an open access resource which integrates interaction and sequence data. [2]

Small Molecule Interaction Database (SMID)

The Small Molecule Interaction Database is a database containing protein domain-small molecule interactions. It uses a domain-based approach to identify domain families, found in the Conserved Domain Database (CDD), which interact with a query small molecule. The CDD from NCBI amalgamates data from several different sources; Protein FAMilies (PFAM), Simple Modular Architecture Research Tool (SMART), Cluster of Orthologous Genes (COGs), and NCBI's own curated sequences. The data in SMID is derived from the Protein Data Bank (PDB), a database of known protein crystal structures. SMID can be queried by entering a protein GI, domain identifier, PDB ID or SMID ID. The results of a search provide small molecule, protein, and domain information for each interaction identified in the database. Interactions with non-biological contacts are normally screened out by default.

SMID-BLAST is a tool developed to annotate known small-molecule binding sites as well as to predict binding sites in proteins whose crystal structures have not yet been determined. The prediction is based on extrapolation of known interactions, found in the PDB, to interactions between an uncrystallized protein with a small molecule of interest. SMID-BLAST was validated against a test set of known small molecule interactions from the PDB. It was shown to be an accurate predictor of protein-small molecule interactions; 60% of predicted interactions identically matched the PDB annotated binding site, and of these 73% had greater than 80% of the binding residues of the protein correctly identified. Hogue, C et al. estimated that 45% of predictions that were not observed in the PDB data do in fact represent true positives. [3]

Biomolecular Interaction Network Database (BIND)

Introduction

The idea of a database to document all known molecular interactions was originally put forth by Tony Pawson in the 1990s and was later developed by scientists at the University of Toronto in collaboration with the University of British Columbia. The development of the Biomolecular Interaction Network Database (BIND) has been supported by grants from the Canadian Institutes of Health Research (CIHR), Genome Canada, [4] the Canadian Foundation for Innovation and the Ontario Research and Development Fund. BIND was originally designed to be a constantly growing depository for information regarding biomolecular interactions, molecular complexes and pathways. As proteomics is a rapidly advancing field, there is a need to have information from scientific journals readily available to researchers. BIND facilitates the understanding of molecular interactions and pathways involved in cellular processes and will eventually give scientists a better understanding of developmental processes and disease pathogenesis

The major goals of the BIND project are: to create a public proteomics resource that is available to all; to create a platform to enable datamining from other sources (PreBIND); to create a platform capable of presenting visualizations of complex molecular interactions. From the beginning, BIND has been open access and software can be freely distributed and modified. Currently, BIND includes a data specification, a database and associated data mining and visualization tools. Eventually, it is hoped that BIND will be a collection of all the interactions occurring in each of the major model organisms.

Database structure

BIND contains information on three types of data: interactions, molecular complexes and pathways.

  1. Interactions are the basic component of BIND and describe how 2 or more objects (A and B) interact with each other. The objects can be a variety of things: DNA, RNA, genes, proteins, ligands, or photons. The interaction entry contains the most information about a molecule; it provides information on its name and synonyms, where it is found (e.g. where in the cell, what species, when it is active, etc.), and its sequence or where its sequence can be found. The interaction entry also outlines the experimental conditions required to observe binding in vitro, chemical dynamics (including thermodynamics and kinetics).
  2. The second type of BIND entries are the molecular complexes. Molecular complexes are defined as an aggregate of molecules that are stable and have a function when bound to each other. The record may also contain some information on the role of the complex in various interactions and the molecular complex entry links data from 2 or more interaction records.
  3. The third component of BIND is the pathway record section. A pathway consists of a network of interactions that are involved in the regulation of cellular processes. This section may also contain information on phenotypes and diseases related to the pathway.


The minimum amount of information needed to create an entry in BIND is a PubMed publication reference and an entry in another database (e.g. GenBank). Each entry within the database provides references/authors for the data. As BIND is a constantly growing database, all components of BIND track updates and changes. [5]

BIND is based on a data specification written using Abstract Syntax Notation 1 (ASN.1) language. ASN.1 is used also by NCBI when storing data for their Entrez system and because of this BIND uses the same standards as NCBI for data representation. The ASN.1 language is preferred because it can be easily translated into other data specification languages (e.g. XML), can easily handle complex data and can be applied to all biological interactions – not just proteins. [5] Bader and Hogue (2000) have prepared a detailed manuscript on the ASN.1 data specification used by BIND. [6]

Data submission and curation

User submission to the database is encouraged. To contribute to the database, one must submit: contact info, PubMed identifier and the two molecules that interact. The person who submits a record is the owner of it. All records are validated before being made public and BIND is curated for quality assurance. BIND curation has two tracks: high-throughput (HTP) and low-throughput (LTP). HTP records are from papers which have reported more than 40 interaction results from one experimental methodology. HTP curators typically have a bioinformatics backgrounds. The HTP curators are responsible for the collection of storage of experimental data and they also create scripts to update BIND based on new publications. LTP records are curated by individuals with either an MSc or PhD and laboratory experience in interaction research. LTP curators are given further training through the Canadian Bioinformatics Workshops. Information on small molecule chemistry is curated separately by chemists to ensure the curator is knowledgeable about the subject. The priority for BIND curation is to focus on LTP to collect information as it is published. Although, HTP studies provide more information at once, there are more LTP studies being reported and similar numbers of interactions are being reported by both tracks. In 2004, BIND collected data from 110 journals. [7]

Database growth

BIND has grown significantly since its conception; in fact, the database saw a 10 fold increase in entries between 2003 and 2004. By September 2004, there were over 100,000 interaction records by 2004 (including 58,266 protein-protein, 4,225 genetic, 874 protein-small molecule, 25,857 protein-DNA, and 19,348 biopolymer interactions). The database also contains sequence information for 31,972 proteins, 4560 DNA samples and 759 RNA samples. These entries have been collected from 11,649 publications; therefore, the database represents an important amalgamation of data. The organisms with entries in the database include: Saccharomyces cerevisiae , Drosophila melanogaster , Homo sapiens , Mus musculus , Caenorhabditis elegans , Helicobacter pylori , Bos taurus , HIV-1, Gallus gallus , Arabidopsis thaliana , as well as others. In total, 901 taxa were included by September 2004 and BIND has been split up into BIND-Metazoa, BIND-Fungi, and BIND-Taxroot. [7]

Not only is the information contained within the database continually updated, the software itself has gone through several revisions. Version 1.0 of BIND was released in 1999 and based on user feedback it was modified to include additional detail on experimental conditions required for binding and a hierarchical description of cellular location of the interaction. Version 2.0 was released in 2001 and included the capability to link to information available in other databases. [5] Version 3.0 (2002) expanded the database from physical/biochemical interactions to also include genetic interactions. [8] Version 3.5 (2004) included a refined user-interface that aimed to simplify information retrieval. [7] In 2006, BIND was incorporated into the Biomolecular Object Network Database (BOND) where it continues to be updated and improved.

Special features

BIND was the first database of its kind to contain info on biomolecular interactions, reactions and pathways in one schema. It is also the first to base its ontology on chemistry which allows 3D representation of molecular interactions. The underlying chemistry allows molecular interactions to be described down to the atomic level of resolution. [7]

PreBIND an associated system for data mining to locate biomolecular interaction information in the scientific literature. The name or accession number of a protein can be entered and PreBIND will scan the literature and return a list of potentially interacting proteins. BIND BLAST is also available to find interactions with proteins that are similar to the one specified in the query. [7]

BIND offers several “features” that many other proteomics databases do not include. The authors of this program have created an extension to traditional IUPAC nomenclature to help describe post-translational modifications that occur to amino acids. These modifications include: acetylation, formylation, methylation, palmitoylation, etc. the extension of the traditional IUPAC codes allows these amino acids to be represented in sequence form as well. BIND also utilizes a unique visualization tool known as OntoGlyphs. The OntoGlyphs were developed based on Gene Ontology (GO) and provide a link back to the original GO information. A number of GO terms have been grouped into categories, each one representing a specific function, binding specificity, or localization in the cell. There are 83 OntoGlyph characters in total. There are 34 functional OntoGlyphs which contain information about the role of the molecule (e.g. cell physiology, ion transport, signaling). There are 25 binding OntoGlyphs which describe what the molecule binds (e.g. ligands, DNA, ions). The other 24 OntoGlyphs provide information about the location of the molecule within a cell (e.g. nucleus, cytoskeleton). The OntoGlyphs can be selected and manipulated to include or exclude certain characteristics from search results. The visual nature of the OntoGlyphs also facilitates pattern recognition when looking at search results. [7] ProteoGlyphs are graphical representations of the structural and binding properties of proteins at the level of conserved domains. The protein is diagrammed as a straight horizontal line and glyphs are inserted to represent conserved domains. Each glyph is displayed to represent the relative position and length of its alignment in the protein sequence.

Accessing the database

Figure 1: Screen shot of sequence results obtained using BOND Copy of BIND Screen.JPG
Figure 1: Screen shot of sequence results obtained using BOND

The database user interface is web-based and can be queried using text or accession numbers/identifiers. Since its integration with the other components of BOND, sequences have been added to interactions, molecular complexes and pathways in the results. Records include information on: BIND ID, description of the interaction/complex/pathway, publications, update records, organism, OntoGlyphs, ProteoGlyphs, and links to other databases where additional information can be found. BIND records include various viewing formats (e.g. HTML, ASN.1, XML, FASTA), various formats for exporting results (e.g. ASN.1, XML, GI list, PDF), and visualizations (e.g. Cytoscape). The exact viewing and exporting options vary depending on what type of data has been retrieved.

User statistics

The number of Unleashed Registrants has increased 10 fold since the integration of BIND. As of December 2006 registration fell just short of 10,000. Subscribers to the commercial versions of BOND fall into six general categories; agriculture and food, biotechnology, pharmaceuticals, informatics, materials and other. The biotechnology sector is the largest of these groups, holding 28% of subscriptions. Pharmaceuticals and informatics follow with 22% and 18% respectively. The United States holds the bulk of these subscriptions, 69%. Other countries with access to the commercial versions of BOND include Canada, the United Kingdom, Japan, China, Korea, Germany, France, India and Australia. All of these countries fall below 6% in user share. [2]

Related Research Articles

<span class="mw-page-title-main">G protein-coupled receptor</span> Class of cell surface receptors coupled to G-protein-associated intracellular signaling

G protein-coupled receptors (GPCRs), also known as seven-(pass)-transmembrane domain receptors, 7TM receptors, heptahelical receptors, serpentine receptors, and G protein-linked receptors (GPLR), form a large group of evolutionarily related proteins that are cell surface receptors that detect molecules outside the cell and activate cellular responses. They are coupled with G proteins. They pass through the cell membrane seven times in the form of six loops of amino acid residues, which is why they are sometimes referred to as seven-transmembrane receptors. Ligands can bind either to the extracellular N-terminus and loops or to the binding site within transmembrane helices. They are all activated by agonists, although a spontaneous auto-activation of an empty receptor has also been observed.

<span class="mw-page-title-main">Protein</span> Biomolecule consisting of chains of amino acid residues

Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific 3D structure that determines its activity.

<span class="mw-page-title-main">Protein quaternary structure</span> Number and arrangement of multiple folded protein subunits in a multi-subunit complex

Protein quaternary structure is the fourth classification level of protein structure. Protein quaternary structure refers to the structure of proteins which are themselves composed of two or more smaller protein chains. Protein quaternary structure describes the number and arrangement of multiple folded protein subunits in a multi-subunit complex. It includes organizations from simple dimers to large homooligomers and complexes with defined or variable numbers of subunits. In contrast to the first three levels of protein structure, not all proteins will have a quaternary structure since some proteins function as single units. Protein quaternary structure can also refer to biomolecular complexes of proteins with nucleic acids and other cofactors.

<span class="mw-page-title-main">Signal transduction</span> Cascade of intracellular and molecular events for transmission/amplification of signals

Signal transduction is the process by which a chemical or physical signal is transmitted through a cell as a series of molecular events. Most commonly, protein phosphorylation is catalyzed by protein kinases, ultimately resulting in a cellular response. Proteins responsible for detecting stimuli are generally termed receptors, although in some cases the term sensor is used. The changes elicited by ligand binding in a receptor give rise to a biochemical cascade, which is a chain of biochemical events known as a signaling pathway.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

<span class="mw-page-title-main">Binding site</span> Molecule-specific coordinate bonding area in biological systems

In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may include other proteins, enzyme substrates, second messengers, hormones, or allosteric modulators. The binding event is often, but not always, accompanied by a conformational change that alters the protein's function. Binding to protein binding sites is most often reversible, but can also be covalent reversible or irreversible.

A hormone receptor is a receptor molecule that binds to a specific hormone. Hormone receptors are a wide family of proteins made up of receptors for thyroid and steroid hormones, retinoids and Vitamin D, and a variety of other receptors for various ligands, such as fatty acids and prostaglandins. Hormone receptors are of mainly two classes. Receptors for peptide hormones tend to be cell surface receptors built into the plasma membrane of cells and are thus referred to as trans membrane receptors. An example of this is Actrapid. Receptors for steroid hormones are usually found within the protoplasm and are referred to as intracellular or nuclear receptors, such as testosterone. Upon hormone binding, the receptor can initiate multiple signaling pathways, which ultimately leads to changes in the behavior of the target cells.

<span class="mw-page-title-main">Receptor (biochemistry)</span> Protein molecule receiving signals for a cell

In biochemistry and pharmacology, receptors are chemical structures, composed of protein, that receive and transduce signals that may be integrated into biological systems. These signals are typically chemical messengers which bind to a receptor and produce physiological responses such as change in the electrical activity of a cell. For example, GABA, an inhibitory neurotransmitter, inhibits electrical activity of neurons by binding to GABAA receptors. There are three main ways the action of the receptor can be classified: relay of signal, amplification, or integration. Relaying sends the signal onward, amplification increases the effect of a single ligand, and integration allows the signal to be incorporated into another biochemical pathway.

<span class="mw-page-title-main">Structural Classification of Proteins database</span> Biological database of proteins

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

<span class="mw-page-title-main">DNA-binding protein</span> Proteins that bind with DNA, such as transcription factors, polymerases, nucleases and histones

DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that identify a base pair.

A biochemical cascade, also known as a signaling cascade or signaling pathway, is a series of chemical reactions that occur within a biological cell when initiated by a stimulus. This stimulus, known as a first messenger, acts on a receptor that is transduced to the cell interior through second messengers which amplify the signal and transfer it to effector molecules, causing the cell to respond to the initial stimulus. Most biochemical cascades are series of events, in which one event triggers the next, in a linear fashion. At each step of the signaling cascade, various controlling factors are involved to regulate cellular actions, in order to respond effectively to cues about their changing internal and external environments.

<span class="mw-page-title-main">Protein–protein interaction</span> Physical interactions and constructions between multiple proteins

Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding and the hydrophobic effect. Many are physical contacts with molecular associations between chains that occur in a cell or in a living organism in a specific biomolecular context.

In biology, cell signaling is the process by which a cell interacts with itself, other cells and the environment. Cell signaling is a fundamental property of all cellular life in prokaryotes and eukaryotes.

The Human Protein Reference Database (HPRD) is a protein database accessible through the Internet. It is closely associated with the premier Indian Non-Profit research organisation Institute of Bioinformatics (IOB), Bangalore, India. This database is a collaborative output of IOB and the Pandey Lab of Johns Hopkins University.

<span class="mw-page-title-main">Biomolecular structure</span> 3D conformation of a biological sequence, like DNA, RNA, proteins

Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length scales ranging from the level of individual atoms to the relationships among entire protein subunits. This useful distinction among scales is often expressed as a decomposition of molecular structure into four levels: primary, secondary, tertiary, and quaternary. The scaffold for this multiscale organization of the molecule arises at the secondary level, where the fundamental structural elements are the molecule's various hydrogen bonds. This leads to several recognizable domains of protein structure and nucleic acid structure, including such secondary-structure features as alpha helixes and beta sheets for proteins, and hairpin loops, bulges, and internal loops for nucleic acids. The terms primary, secondary, tertiary, and quaternary structure were introduced by Kaj Ulrik Linderstrøm-Lang in his 1951 Lane Medical Lectures at Stanford University.

In biology, a protein structure database is a database that is modeled around the various experimentally determined protein structures. The aim of most protein structure databases is to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way. Data included in protein structure databases often includes three-dimensional coordinates as well as experimental information, such as unit cell dimensions and angles for x-ray crystallography determined structures. Though most instances, in this case either proteins or a specific structure determinations of a protein, also contain sequence information and some databases even provide means for performing sequence based queries, the primary attribute of a structure database is structural information, whereas sequence databases focus on sequence information, and contain no structural information for the majority of entries. Protein structure databases are critical for many efforts in computational biology such as structure based drug design, both in developing the computational methods used and in providing a large experimental dataset used by some methods to provide insights about the function of a protein.

The PDBbind database is a comprehensive collection of experimentally measured binding affinity data for the protein-ligand complexes deposited in the Protein Data Bank (PDB). It thus provides a link between energetic and structural information of protein-ligand complexes, which is of great value to various studies on molecular recognition occurred in biological systems.

<span class="mw-page-title-main">Short linear motif</span>

In molecular biology short linear motifs (SLiMs), linear motifs or minimotifs are short stretches of protein sequence that mediate protein–protein interaction.

Computational Resources for Drug Discovery (CRDD) is one of the important silico modules of Open Source for Drug Discovery (OSDD). The CRDD web portal provides computer resources related to drug discovery on a single platform. It caters to researchers of computer-aided drug design, providing computational resources, a discussion forum, and wiki resources related to drug discovery, predicting inhibitors, and predicting the ADME-Tox properties of molecules. One of the major objectives of CRDD is to promote open source software in the field of cheminformatics and pharmacoinformatics.

Molecular Operating Environment (MOE) is a drug discovery software platform that integrates visualization, modeling and simulations, as well as methodology development, in one package. MOE scientific applications are used by biologists, medicinal chemists and computational chemists in pharmaceutical, biotechnology and academic research. MOE runs on Windows, Linux, Unix, and macOS. Main application areas in MOE include structure-based design, fragment-based design, ligand-based design, pharmacophore discovery, medicinal chemistry applications, biologics applications, structural biology and bioinformatics, protein and antibody modeling, molecular modeling and simulations, virtual screening, cheminformatics & QSAR. The Scientific Vector Language (SVL) is the built-in command, scripting and application development language of MOE.

References

  1. Blueprint.org
  2. 1 2 BOND at Unleashed Informatics Archived March 14, 2007, at the Wayback Machine
  3. Snyder, K, et al.. Domain-based small molecule binding site annotation. BMC Bioinformatics 7: 152 (2006)
  4. BIND at genomecanada.ca
  5. 1 2 3 Bader, GD, et al. BIND- The Biomolecular Interaction Network Database. Nucleic Acids Research 29: 242-245 (2001).
  6. Bader, GD, Hogue, CWV. BIND- a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16(5): 465-477 (2000).
  7. 1 2 3 4 5 6 Alfarano, C, et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Research 33: D418-D424 (2005).
  8. Bader, GD, et al.. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research 31: 248-250 (2003).