Integrative bioinformatics

Last updated

Integrative bioinformatics is a discipline of bioinformatics that focuses on problems of data integration for the life sciences.

Bioinformatics Software tools for understanding biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users

Contents

With the rise of high-throughput (HTP) technologies in the life sciences, particularly in molecular biology, the amount of collected data has grown in an exponential fashion. Furthermore, the data are scattered over a plethora of both public and private repositories, and are stored using a large number of different formats. This situation makes searching these data and performing the analysis necessary for the extraction of new knowledge from the complete set of available data very difficult. Integrative bioinformatics attempts to tackle this problem by providing unified access to life science data.

High-throughput screening Drug discovery experimental technique

High-throughput screening (HTS) is a method for scientific experimentation especially used in drug discovery and relevant to the fields of biology and chemistry. Using robotics, data processing/control software, liquid handling devices, and sensitive detectors, high-throughput screening allows a researcher to quickly conduct millions of chemical, genetic, or pharmacological tests. Through this process one can rapidly identify active compounds, antibodies, or genes that modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and for understanding the noninteraction or role of a particular location.

Molecular biology branch of biology that deals with the molecular basis of biological activity

Molecular biology is a branch of biology that concerns the molecular basis of biological activity between biomolecules in the various systems of a cell, including the interactions between DNA, RNA, proteins and their biosynthesis, as well as the regulation of these interactions. Writing in Nature in 1961, William Astbury described molecular biology as:

...not so much a technique as an approach, an approach from the viewpoint of the so-called basic sciences with the leading idea of searching below the large-scale manifestations of classical biology for the corresponding molecular plan. It is concerned particularly with the forms of biological molecules and [...] is predominantly three-dimensional and structural – which does not mean, however, that it is merely a refinement of morphology. It must at the same time inquire into genesis and function.

An information repository is an easy way to deploy a secondary tier of data storage that can comprise multiple, networked data storage technologies running on diverse operating systems, where data that no longer needs to be in primary storage is protected, classified according to captured metadata, processed, de-duplicated, and then purged, automatically, based on data service level objectives and requirements. In information repositories, data storage resources are virtualized as composite storage sets and operate as a federated environment.

Approaches

Semantic web approaches

In the Semantic Web approach, data from multiple websites or databases is searched via metadata. Metadata is machine-readable code, which defines the contents of the page for the program so that the comparisons between the data and the search terms are more accurate. This serves to decrease the number of results that are irrelevant or unhelpful. Some meta-data exists as definitions called ontologies, which can be tagged by either users or programs; these serve to facilitate searches by using key terms or phrases to find and return the data. [1] Advantages of this approach include the general increased quality of the data returned in searches and with proper tagging, ontologies finding entries that may not explicitly state the search term but are still relevant. One disadvantage of this approach is that the results that are returned come in the format of the database of their origin and as such, direct comparisons may be difficult. Another problem is that the terms used in tagging and searching can sometimes be ambiguous and may cause confusion among the results. [2] In addition, the semantic web approach is still considered an emerging technology and is not in wide-scale use at this time. [3]

The Semantic Web is an extension of the World Wide Web through standards by the World Wide Web Consortium (W3C). The standards promote common data formats and exchange protocols on the Web, most fundamentally the Resource Description Framework (RDF). According to the W3C, "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries". The Semantic Web is therefore regarded as an integrator across different content, information applications and systems.

Metadata data about data

Metadata is "data [information] that provides information about other data". Many distinct types of metadata exist, among these descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.

Machine-readable data, or computer-readable data, is data in a format that can be easily processed by a computer.

One of the current applications of ontology-based search in the biomedical sciences is GoPubMed, which searches the PubMed database of scientific literature. [1] Another use of ontologies is within databases such as SwissProt, Ensembl and TrEMBL, which use this technology to search through the stores of human proteome-related data for tags related to the search term. [4]

GoPubMed was a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) served as "Table of contents" in order to structure the millions of articles in the MEDLINE database. MeshPubMed was at one point a separate project, but the two were merged.

PubMed online database with abstracts of medical articles, hosted by US National Library of Medicine

PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.

Some of the research in this field has focused on creating new and specific ontologies. [5] Other researchers have worked on verifying the results of existing ontologies. [2] In a specific example, the goal of Verschelde, et al. was the integration of several different ontology libraries into a larger one that contained more definitions of different subspecialties (medical, molecular biological, etc.) and was able to distinguish between ambiguous tags; the result was a data-warehouse like effect, with easy access to multiple databases through the use of ontologies. [4] In a separate project, Bertens, et al. constructed a lattice work of three ontologies (for anatomy and development of model organisms) on a novel framework ontology of generic organs. For example, results from a search of ‘heart’ in this ontology would return the heart plans for each of the vertebrate species whose ontologies were included. The stated goal of the project is to facilitate comparative and evolutionary studies. [6]

Data warehousing approaches

In the data warehousing strategy, the data from different sources are extracted and integrated in a single database. For example, various 'omics' datasets may be integrated to provide biological insights into biological systems. Examples include data from genomics, transcriptomics, proteomics, interactomics, metabolomics. Ideally, changes in these sources are regularly synchronized to the integrated database. The data is presented to the users in a common format. Many programs aimed to aid in the creation of such warehouses are designed to be extremely versatile to allow for them to be implemented in diverse research projects. [7] One advantage of this approach is that data is available for analysis at a single site, using a uniform schema. Some disadvantages are that the datasets are often huge and difficult to keep up to date. Another problem with this method is that it is costly to compile such a warehouse. [8]

Standardized formats for different types of data (ex: protein data) are now emerging due to the influence of groups like the Proteomics Standards Initiative (PSI). Some data warehousing projects even require the submission of data in one of these new formats. [9]

Other approaches

Data mining uses statistical methods to search for patterns in existing data. This method generally returns many patterns, of which some are spurious and some are significant, but all of the patterns the program finds must be evaluated individually. Currently, some research is focused on incorporating existing data mining techniques with novel pattern analysis methods that reduce the need to spend time going over each pattern found by the initial program, but instead, return a few results with a high likelihood of relevance. [10] One drawback of this approach is that it does not integrate multiple databases, which means that comparisons across databases are not possible. The major advantage to this approach is that it allows for the generation of new hypotheses to test.

See also

Related Research Articles

In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains.

National Center for Biotechnology Information database arm of the US National Library of Medicine

The National Centre for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

Biological database database of biological information

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Gene ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies (OBO).

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Open Biomedical Ontologies is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO forms part of the resources of the U.S. National Center for Biomedical Ontology where it will form a central element of the NCBO's BioPortal.

BioMOBY is a registry of web services used in bioinformatics. It allows interoperability between biological data hosts and analytical services by annotating services with terms taken from standard ontologies. BioMOBY is released under the Artistic License.

Ontotext is a Bulgarian software company headquartered in Sofia. It is the semantic technology branch of Sirma Group. Its main domain of activity is the development of software based on the Semantic Web languages and standards, in particular RDF, OWL and SPARQL. Ontotext is best known for the Ontotext GraphDB semantic graph database engine. Another major business line is the development of enterprise knowledge management and analytics systems that involve big knowledge graphs. Those systems are developed on top of the Ontotext Platform that builds on top of GraphDB capabilities for text mining using big knowledge graphs.

Robert David Stevens Professor of Bio-Health Informatics

Robert David Stevens is a Professor of Bio-Health Informatics in the School of Computer Science at the University of Manchester. He has served as head of the School of Computer Science since 2016.

The Biomolecular Object Network Databank is a bioinformatics databank containing information on small molecule and, structures and interactions. The databank integrates a number of existing databases to provide a comprehensive overview of the information currently available for a given molecule.

The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice, and information on TM technologies and to disseminate information from the larger TM community, while also providing tailored services and tools in response to the requirements of the United Kingdom academic community.

The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.

Ontology-based data integration involves the use of ontology(s) to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

Lawrence Hunter Computational Biology

Lawrence E. Hunter is a Professor and Director of the Center for Computational Pharmacology and of the Computational Bioscience Program at the University of Colorado School of Medicine and Professor of Computer Science at the University of Colorado Boulder. He is an internationally known scholar, focused on computational biology, knowledge-driven extraction of information from the primary biomedical literature, the semantic integration of knowledge resources in molecular biology, and the use of knowledge in the analysis of high-throughput data, as well as for his foundational work in compnd utational biology, which led to the genesis of the major professional organization in the field and two international conferences.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Translational bioinformatics (TBI) is an emerging field in the study of health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.

Semantic Automated Discovery and Integration (SADI) is a lightweight set of fully standards-compliant Semantic Web service design patterns that simplify the publication of services of the type commonly found in bioinformatics and other scientific domains. SADI services utilize Semantic Web technologies at every level of the Web services "stack". Services are described in OWL-DL, where the property restrictions in OWL classes are used to define the properties expected of the input and output data. Invocation of SADI Services is achieved through HTTP POST of RDF data representing OWL Individuals ('instances') of the defined input OWL Class, and the resulting output data will be OWL Individuals of the defined output OWL Class.

In Bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text-mining derived associations including Mendelian, complex and environmental diseases.

References

  1. 1 2 Doms, A.; Schroeder, M. (2005). "GoPubMed: exploring PubMed with the Gene Ontology" (PDF). Nucleic Acids Research. 33 (Web Server issue): W783–6. doi:10.1093/nar/gki470. PMC   1160231 Lock-green.svg. PMID   15980585 . Retrieved 28 September 2012.
  2. 1 2 Van Ophuizen, E.A.A. & Leunissen, J.A.M. (2010). "An evaluation of the performance of three semantic background knowledge sources in comparative anatomy." Journal of Integrative Bioinformatics. Retrieved 28 October 2012.
  3. Ruttenberg, et al. (2007). "Advancing translational research with the Semantic Web." BMC Bioinformatics. Retrieved 28 September 2012
  4. 1 2 Verschelde, et al. (2007). "Ontology-Assisted Database Integration to Support Natural Language Processing and Biomedical Data-mining." Journal of Integrative Bioinformatics. Retrieved 28 October 2012.
  5. Castillo, et al. (2012). "Construction of coffee transcriptome networks based on gene annotation semantics." Journal of Integrative Bioinformatics. Retrieved 29 October 2012.
  6. Bertens, et al. (2011). "A generic organ based ontology system, applied to vertebrate heart anatomy, development and physiology." Journal of Integrative Bioinformatics. Retrieved 30 October 2012.
  7. Shah, et al. (2005). "Atlas – a data warehouse for integrative bioinformatics." BMC Bioinformatics. Retrieved 30 September 2012.
  8. Kuenne, et al. (2007). "Using Data Warehouse Technology in Crop Plant Bioinformatics." Journal of Integrative Bioinformatics. Retrieved 30 September 2012.
  9. Thiele, et al. (2010). "Bioinformatics Strategies in Life Sciences: From Data Processing and Data Warehousing to Biological Knowledge Extraction." Journal of Integrative Bioinformatics. Retrieved 29 October 2012.
  10. Belmamoune, et al. (2010). "Mining and Analysing Spatio-Temporal Patterns of Gene Expression in An Integrative database Framework." Journal of Integrative Bioinformatics. Retrieved 27 October 2012.