Gene Disease Database

Last updated

Gene Disease Database
Classification Bioinformatics
Subclassification Databases
Type of Databases Biological
Subtype of DatabasesGene-Disease

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. [1] Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases. [2] [3]

Contents

Introduction

Experts in different areas of biology and bioinformatics have been trying to comprehend the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some illnesses, it has become apparent that it is the right amount of animosity is made for not enough to obtain an index of the disease-related genes but to uncover how disruptions of molecular grids in the cell give rise to disease phenotypes. [4] Moreover, even with the unprecedented wealth of information available, obtaining such catalogues is extremely difficult.

Genetic Broadly speaking, genetic diseases are caused by aberrations in genes or chromosomes. Many genetic diseases are developed from before birth. Genetic disorders account for a significant number of the health care problems in our society. Advances in the understanding of this diseases have increased both the life span and quality of life for many of those affected by genetic disorders. Recent developments in bioinformatics and laboratory genetics have made possible the better delineation of certain malformation and mental retardation syndromes, so that their mode of inheritance can be understood. This information enables the genetic counselor to predict the risk for occurrence of a large number of genetic disorders. [2] Most genetic counseling is done, however, only after the birth of at least one affected individual has alerted the family to their predilection for having children with a genetic disorder. The association of a single gene to a disease is rare and a genetic disease may or may not be a transmissible disorder. [5] Some genetic diseases are inherited from the parent's genes, but others are caused by new mutations or changes to the DNA. In other occurrences, the same disease, for instance, some forms of carcinoma or melanoma, may stem from an inbred condition in some people, from new changes in other people, and from non-genetic causes in still other individuals. [6]

There are more than six thousand known single-gene disorders (monogenic), which occur in about 1 out of every 200 births. [1] As their term suggests, these diseases are caused by a mutation in one gene. By contrast, polygenic disorders are caused by several genes, regularly in combination with environmental factors. [7] Examples of genetic phenotypes include Alzheimer's disease, breast cancer, leukemia, Down syndrome, heart defects, and deafness; therefore, cataloguing to sort out all the diseases related to genes is needed.

Challenges with creation

Gene prioritization workflow of human diseases: Typical lists come from linkage regions, chromosomal aberrations, association study loci, deferentially expressed gene lists or genes identified by sequencing variants. Alternatively, the complete genome can be prioritized, but substantially more false positives would then be expected. Geneprioritizationworkflow.jpg
Gene prioritization workflow of human diseases: Typical lists come from linkage regions, chromosomal aberrations, association study loci, deferentially expressed gene lists or genes identified by sequencing variants. Alternatively, the complete genome can be prioritized, but substantially more false positives would then be expected.

At different stages of any gene disease project, molecular biologists need to choose, even after careful statistical data analysis, which genes or proteins to investigate further experimentally and which to leave out because of limited resources. Computational methods that integrate complex, heterogeneous data sets, such as expression data, sequence information, functional annotation and the biomedical literature, allow prioritizing genes for future study in a more informed way. Such methods can substantially increase the yield of downstream studies and are becoming invaluable to researchers. So one of the main concerns in biological and biomedical research is to recognise the underlying mechanisms behind this intricate genetic phenotypes. Great effort has been spent on finding the genes related to diseases [8]

However, increasingly evidences point out that most human diseases cannot be attributed to a single gene but arise due to complex interactions among multiple genetic variants and environmental risk factors. Several databases have been developed storing associations between genes and diseases such as the Comparative Toxicogenomics Database (CTD), Online Mendelian Inheritance in Man (OMIM), the genetic Association Database (GAD) or the Disease genetic Association Database (DisGeNET). Each of these databases focuses on different aspects of the phenotype-genotype relationship, and due to the nature of the database curation process, they are not complete, but in a way they are fully complementary between each other. [9]

Types of databases

Essentially, there are four types of databases: curated databases, predictive databases, literature databases and integrative databases [1]

Curated databases

The term curated data refers to information, that may comprise the most sophisticated computational formats for structured data, scientific updates, and curated knowledge, that has been composed and prepared under the regulation of one or more experts considered to be qualified to engage in such an activity [10] The implication is that the resulting database is of high quality. The contrast is with data which may have been gathered through some automated process or using particularly low or inexpert unsupported data quality and possibly untrustworthy. [10] Some of the most common examples include: CTD and UNIPROT.

The Comparative Toxicogenomics Database (CTD)

The Comparative Toxicogenomics Database, helps to understand about the effects of environmental compounds on human health by integrating data from curated scientific literature to describe biochemical interactions with genes and proteins, and links between diseases and chemicals, and diseases and genes or proteins. [11] CTD contains curated data defining cross-species chemical–gene/protein interactions and chemical– and gene–disease associations to illuminate molecular mechanisms underlying variable susceptibility and environmentally influenced diseases. These data deliver insights into complex chemical–gene and protein interaction networks. One of the main sources in this Database is curated information from OMIM. [11]

CTD is a unique resource where bioinformatics specialists read the scientific literature and manually curate four types of core data:

  • Chemical-gene interactions
  • Chemical-disease associations
  • Gene-disease associations
  • Chemical-phenotype associations

The Universal Protein Resource (UNIPROT)

The Universal Protein Resource (UniProt) is an inclusive resource for protein sequence and annotation data. It is a comprehensive, first-class and freely accessible database of protein sequence and functional information, that has many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the study literature, which can hint to a direct connection between gene-protein-disease. [12]

UniProt
Content
DescriptionUniProt is the universal protein resource, a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases.
Data types
captured
Protein annotation
Organisms All
Contact
Research center EMBL-EBI, UK; SIB, Switzerland; PIR, US.
Primary citationOngoing and future developments at the Universal Protein Resource [13]
Access
Data format Custom flat file, FASTA, GFF, RDF, XML.
Website www.uniprot.org
www.uniprot.org/news/
Download URL www.uniprot.org/downloads & for downloading complete data sets ftp.uniprot.org
Web service URLYes – JAVA API see info here & REST see info here
Tools
Web Advanced search, BLAST, ClustalO, bulk retrieval/download, ID mapping
Miscellaneous
License Creative Commons Attribution-NoDerivs
Versioning Yes
Data release
frequency
4 weeks
Curation policyYes – manual and automatic. Rules for automatic annotation generated by database curators and computational algorithms.
Bookmarkable
entities
Yes – both individual protein entries and searches
The process of database compilation and curation
The curated data may comprise a process from practical experience and literature review to web publication of the database Databasecompilation1.png
The process of database compilation and curation
The curated data may comprise a process from practical experience and literature review to web publication of the database

Predictive databases

A predictive database is one based on statistical inference. One particular approach to such inference is known as predictive inference, but the prediction can be undertaken within any of the several approaches to statistical inference. Indeed, one description of biostatistics is that it provides a means of transferring knowledge about a sample of a genetic population to the whole population (genomics), and to other related genes or genomes, which the same as prediction over time is not necessarily. [15] When information is transferred across time, often to specific points in time, the process is known as forecasting. Three of the main examples of databases that can be considered in this category include: The Mouse genome Database (MGD), The Rat genome Database (RGD), OMIM and the SIFT Tool from Ensembl. [1]

The Mouse genome Database (MGD)

The Mouse genome Database (MGD) is the international community resource for integrated genetic, genomic and biological data about the laboratory mouse. MGD provides full annotation of phenotypes and human disease associations for mouse models (genotypes) using terms from the Mammalian Phenotype Ontology and disease names from OMIM. [16]

The Rat Genome Database (RGD)

RGD
Content
DescriptionThe Rat Genome Database
Organisms Rattus norvegicus (rat)
Contact
Research center Medical College of Wisconsin
Laboratory Human Molecular and Genetics Center
AuthorsMary E. Shimoyama, PhD; Howard J. Jacob, PhD
Primary citation PMID   25355511
Access
Website rgd.mcw.edu
Download URL RGD Data Release

The Rat Genome Database (RGD) began as a collaborative effort between leading research institutions involved in rat genetic and genomic research. The rat continues to be extensively used by researchers as a model organism for investigating the biology and pathophysiology of disease. In the past several years, there has been a rapid increase in rat genetic and genomic data. [17] This explosion of information highlighted the need for a centralized database to efficiently and effectively collect, manage, and distribute a rat-centric view of this data to researchers around the world. The Rat Genome Database was created to serve as a repository of rat genetic and genomic data, as well as mapping, strain, and physiological information. It also facilitates investigators research efforts by providing tools to search, mine, and predict this data. [17]

Data at RGD that is useful for researchers investigating disease genes include disease annotations for rat, mouse and human genes. Annotations are manually curated from the literature, or downloaded via automated pipelines from other disease-related databases. Downloaded annotations are mapped to the same disease vocabulary used for manual annotations to provide consistency across the dataset. RGD also maintains disease-related quantitative phenotype data for the rat (PhenoMiner). [18]

The Online Mendelian Inheritance in Man (OMIM)

The Online Mendelian Inheritance in Man
Content
DescriptionOMIM is a compendium of human genes and genetic phenotypes.
Organisms Human (H. Sapiens)
Contact
Research center NCBI
Primary citation PMID   25398906
Access
Website www.ncbi.nlm.nih.gov/omim

Supported by the NCBI, The Online Mendelian Inheritance in Man (OMIM) is a database that catalogues all the known diseases with a genetic component, and predicts their relationship to relevant genes in the human genome and provides references for further research and tools for genomic analysis of a catalogued gene. [19] OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The database has been used as a resource for predicting relevant information to inherited conditions. [19]

Pathway Hogeneity vs Associated Genes Showing the concept that diseases have large association with a variety of genes, a mean pathway homogeneity values of single diseases and random controls are plotted for four networks binned by the number of associated gene products per disease. This graph shows how difficult is to correlate a bigger number of diseases vs concordance in 4 different databases, hence Gene Disease Databases test these relationships Pathwayhomogeneity1.png
Pathway Hogeneity vs Associated Genes Showing the concept that diseases have large association with a variety of genes, a mean pathway homogeneity values of single diseases and random controls are plotted for four networks binned by the number of associated gene products per disease. This graph shows how difficult is to correlate a bigger number of diseases vs concordance in 4 different databases, hence Gene Disease Databases test these relationships

Ensembl SIFT tool

The Ensembl genome database project.
Ensembl release58 sgcb screenshot.png
Content
DescriptionEnsembl
Contact
Research center
Primary citationHubbard, et al. (2002) [20]
Access
Website www.ensembl.org

This one of the largest resources available for all genomic and genetic studies, it provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model disease organisms. Ensembl is one of several well-known genome browsers for the retrieval of genomic-disease information. Ensembl imports variation data from a variety of different sources, Ensembl predicts the effects of variants. [21] For each variation that is mapped to the reference genome, each Ensembl transcript is identified that overlap the variation. Then it uses a rule-based approach to predict the effects that each allele of the variation may have on the transcript. The set of consequence terms, defined by the Sequence Ontology (SO) can be currently assigned to each combination of an allele and a transcript. Each allele of each variation may have a different effect in different transcripts. A variety of different tools are used to predict human mutations in the Ensembl database, one of the most widely used is SIFT, that predicts whether an amino acid substitution is likely to affect protein function based on sequence homology and the physic-chemical similarity between the alternate amino acids. The data provided for each amino acid substitution is a score and a qualitative prediction (either 'tolerated' or 'deleterious'). The score is the normalized probability that the amino acid change is tolerated so scores near 0 are more likely to be deleterious. The qualitative prediction is derived from this score such that substitutions with a score < 0.05 are called 'deleterious' and all others are called 'tolerated'. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations, that will lead to build relationships in phenotype characteristics, proteomics and genomics. [21]

Literature databases

This sort of databases summarize books, articles, book reviews, dissertations, and annotations about gene-disease databases. Some of the following are examples of this type: GAD, LGHDN and BeFree Data.

Genetic Association Database (GAD)

The Genetic Association Database is an archive of human genetic association studies of complex diseases. GAD is primarily focused on archiving information on common complex human disease rather than rare Mendelian disorders as found in the OMIM. It includes curated summary data extracted from published papers in peer reviewed journals on candidate gene and genome Wide Association Studies (GWAS). [22] The GAD was frozen as of 09/01/2014 but is still available for download. [23]

Literature-derived human gene-disease network (LHGDN)

The literature-derived human gene-disease network (LHGDN) is a text mining derived database with focus on extracting and classifying gene-disease associations with respect to several biomolecular conditions. It uses a machine learning based algorithm to extract semantic gene-disease relations from a textual source of interest. It is part of the Linked Life Data, of the LMU in Munchen, Germany. [1]

BeFree Data

Extracts gene-disease associations from MEDLINE abstract using the BeFree system. BeFree is composed of a biomedical Named Entity Recognition (BioNER) module to detect diseases and genes and a relation extraction module based on morphosyntactic information. [24]

Integrative databases

This sort of databases include Mendelian, compound and environmental diseases in an integrated gene-disease association archive and show that the concept of modularity applies for all of them They provide a functional analysis of diseases in case of important new biological insights, which might not be discovered when considering each of the gene-disease associations independently. Hence, they present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. The best example for this sort of database is DisGeNET. [8] [25]

The Gene Disease Associations Database DisGeNET

DisGeNET
Content
DescriptionIntegrates human gene-disease associations
Data types
captured
Associations Database
Organisms Human (H. Sapiens)
Contact
Research center Research Programme on Biomedical Informatics (GRIB) IMIM-UPF
LaboratoryIntegrative Biomedical Informatics Group
AuthorsFerran Sanz and Laura I. Furlong (Pinero et al, 2015)
Primary citation PMID   25877637
Access
Website www.disgenet.org
Miscellaneous
Data release
frequency
annual
Version3

DisGeNET is a comprehensive gene-disease association database that integrates associations from several sources that covers different biomedical aspects of diseases. [25] In particular, it is focused on the current knowledge of human genetic diseases including Mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, this database performs a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. [1] The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including Mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. Obtaining similar results when studying clusters of diseases, the findings in this database suggest that related diseases might arise due to dysfunction of common biological processes in the cell. The network analysis of this integrated database points out that data integration is needed to obtain a comprehensive view of the genetic landscape of human diseases and that the genetic origin of complex diseases is much more common than expected. [1]

DisGeNET gene-disease association ontology
The description of each association type in this ontology is: #Therapeutic Association: The gene/protein has a therapeutic role in the amelioration of the disease. #Biomarker Association: The gene/protein either plays a role in the etiology of the disease (e.g. participates in the molecular mechanism that leads to disease) or is a biomarker for a disease. #Genetic Variation Association: Used when a sequence variation (a mutation, a SNP) is associated to the disease phenotype, but there is still no evidence to say that the variation causes the disease. In some cases the presence of the variants increase the susceptibility to the disease. In general, the NCBI SNP identifiers are provided. #Altered Expression Association: Alterations in the function of the protein by means of altered expression of the gene are associated with the disease phenotype. #Post-translational Modification Association: Alterations in the function of the protein by means of post-translational modifications (methylation or phosphorylation of the protein) are associated with the disease phenotype. Disgenet.png
DisGeNET gene-disease association ontology
The description of each association type in this ontology is: #Therapeutic Association: The gene/protein has a therapeutic role in the amelioration of the disease. #Biomarker Association: The gene/protein either plays a role in the etiology of the disease (e.g. participates in the molecular mechanism that leads to disease) or is a biomarker for a disease. #Genetic Variation Association: Used when a sequence variation (a mutation, a SNP) is associated to the disease phenotype, but there is still no evidence to say that the variation causes the disease. In some cases the presence of the variants increase the susceptibility to the disease. In general, the NCBI SNP identifiers are provided. #Altered Expression Association: Alterations in the function of the protein by means of altered expression of the gene are associated with the disease phenotype. #Post-translational Modification Association: Alterations in the function of the protein by means of post-translational modifications (methylation or phosphorylation of the protein) are associated with the disease phenotype.

Some use cases

Some of the most interesting cases using Gene-Disease Databases can be found in the following papers: [1] [8]

Remarks about the future in Gene Disease Databases

Relationships in Gene Diseases Relationships in Gene Diseases.jpg
Relationships in Gene Diseases

The completion of the human genome has changed the way the search for disease genes is performed. In the past, the approach was to focus on one or a few genes at a time. Now, projects like the DisGeNET exemplify the efforts to systematically analyze all the gene alterations involved in a single or multiple diseases. [26] The next step is to produce a complete picture of the mechanistic aspects of the diseases and the design of drugs against them. For that, a combination of two approaches will be needed: a systematic search and in-depth study of each gene. The future of the field will be defined by new techniques to integrate large bodies of data from different sources and to incorporate functional information into the analysis of large-scale data generated by bioinformatics studies. [1]

Bioinformatics is both a term for the body of biological gene disease studies that use computer programming as part of their methodology, as well as a reference to specific analysis pipelines that are repeatedly used, particularly in the fields of genetics and genomics. [1] Common uses of bioinformatics include the identification of candidate genes and nucleotides, SNPs. Often, such identification is made with the aim of better understanding the genetic basis of disease, unique adaptations, desirable properties, or differences between populations. In a less formal way, bioinformatics also tries to understand the organisational principles within nucleic acid and protein sequences. [1]

The response of bioinformatics to new experimental techniques brings a new perspective into the analysis of the experimental data, as demonstrated by the advances in the analysis of information from gene disease databases and other technologies. It is expected that this trend will continue with novel approaches to respond to new techniques, such as next-generation sequencing technologies. For instance, the availability of large numbers of individual human genomes will promote the development of computational analyses of rare variants, including the statistical mining of their relations to lifestyles, drug interactions and other factors. [1] Biomedical research will also be driven by our ability to efficiently mine the large body of existing and continuously generated biomedical data. Text-mining techniques, in particular, when combined with other molecular data, can provide information about gene mutations and interactions and will become crucial to stay ahead of the exponential growth of data generated in biomedical research. Another field that is benefiting from the advances in mining and integration of molecular, clinical and drug analysis is pharmacogenomics. In silico studies of the relationships between human variations and their effect on diseases will be key to the development of personalized medicine. [8] In summary, Gene Disease Databases have already transformed the search for disease genes and has the potential to become a crucial component of other areas of medical research. [1]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Biological database</span>

Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship. As of 28 June 2019, approximately 9,000 of the over 25,000 entries in OMIM represented phenotypes; the rest represented genes, many of which were related to known phenotypes.

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which is a hypothesis-free approach that scans the entire genome for associations between common genetic variants and traits of interest. Candidate genes are most often selected for study based on a priori knowledge of the gene's biological functional impact on the trait or disease in question. The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?" Candidate genes hypothesized to be associated with complex traits have generally not been replicated by subsequent GWASs or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient statistical power, low prior probability that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and data dredging.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

In academia, computational immunology is a field of science that encompasses high-throughput genomic and bioinformatics approaches to immunology. The field's main aim is to convert immunological data into computational problems, solve these problems using mathematical and computational approaches and then convert these results into immunologically meaningful interpretations.

<span class="mw-page-title-main">BioGRID</span> Biological database

The Biological General Repository for Interaction Datasets (BioGRID) is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications created in 2003 (originally referred to as simply the General Repository for Interaction Datasets by Mike Tyers, Bobby-Joe Breitkreutz, and Chris Stark at the Lunenfeld-Tanenbaum Research Institute at Mount Sinai Hospital. It strives to provide a comprehensive curated resource for all major model organism species while attempting to remove redundancy to create a single mapping of data. Users of The BioGRID can search for their protein, chemical or publication of interest and retrieve annotation, as well as curated data as reported, by the primary literature and compiled by in house large-scale curation efforts. The BioGRID is hosted in Toronto, Ontario, Canada and Dallas, Texas, United States and is partnered with the Saccharomyces Genome Database, FlyBase, WormBase, PomBase, and the Alliance of Genome Resources. The BioGRID is funded by the NIH and CIHR. BioGRID is an observer member of the International Molecular Exchange Consortium.

The Human Protein Reference Database (HPRD) is a protein database accessible through the Internet. It is closely associated with the premier Indian Non-Profit research organisation Institute of Bioinformatics (IOB), Bangalore, India. This database is a collaborative output of IOB and the Pandey Lab of Johns Hopkins University.

Mouse Genome Informatics (MGI) is a free, online database and bioinformatics resource hosted by The Jackson Laboratory, with funding by the National Human Genome Research Institute (NHGRI), the National Cancer Institute (NCI), and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). MGI provides access to data on the genetics, genomics and biology of the laboratory mouse to facilitate the study of human health and disease. The database integrates multiple projects, with the two largest contributions coming from the Mouse Genome Database and Mouse Gene Expression Database (GXD). As of 2018, MGI contains data curated from over 230,000 publications.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

The human interactome is the set of protein–protein interactions that occur in human cells. The sequencing of reference genomes, in particular the Human Genome Project, has revolutionized human genetics, molecular biology, and clinical medicine. Genome-wide association study results have led to the association of genes with most Mendelian disorders, and over 140 000 germline mutations have been associated with at least one genetic disease. However, it became apparent that inherent to these studies is an emphasis on clinical outcome rather than a comprehensive understanding of human disease; indeed to date the most significant contributions of GWAS have been restricted to the “low-hanging fruit” of direct single mutation disorders, prompting a systems biology approach to genomic analysis. The connection between genotype and phenotype remain elusive, especially in the context of multigenic complex traits and cancer. To assign functional context to genotypic changes, much of recent research efforts have been devoted to the mapping of the networks formed by interactions of cellular and genetic components in humans, as well as how these networks are altered by genetic and somatic disease.

PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.

Donna R. Maglott is a staff scientist at the National Center for Biotechnology Information known for her research on large-scale genomics projects, including the mouse genome and development of databases required for genomics research.

The Monarch Initiative is a large scale bioinformatics web resource focused on leveraging existing biomedical knowledge to connect genotypes with phenotypes in an effort to aid research that combats genetic diseases. Monarch does this by integrating multi-species genotype, phenotype, genetic variant and disease knowledge from various existing biomedical data resources into a centralized and structured database. While this integration process has been traditionally done manually by basic researchers and clinicians on a case-by-case basis, The Monarch Initiative provides an aggregated and structured collection of data and tools that make biomedical knowledge exploration more efficient and effective.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases," PLOS One, pp. 1-3, 2011.
  2. 1 2 Botstein, D (2003). "Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease". Nature Genetics. 33 (1): 228–237. doi:10.1038/ng1090. PMID   12610532. S2CID   10599219.
  3. Wren JD, Bateman A (2008). "Databases, data tombs and dust in the wind". Bioinformatics. 24 (19): 2127–8. doi: 10.1093/bioinformatics/btn464 . PMID   18819940.
  4. "American Medical Informatics Association Strategic Plan". American Medical Informatics Association. Archived from the original on 26 October 2009.
  5. Oti, M (2007). "The modular nature of genetic diseases". Clinical Genetics. 71 (1): 1–11. doi: 10.1111/j.1399-0004.2006.00708.x . PMID   17204041. S2CID   24615025.
  6. Davis, A.; King, B. (2011). "The Comparative Toxicogenomics Database: update 2011". Nucleic Acids Res. 39 (1): 1067–1072. doi:10.1093/nar/gkq813. PMC   3013756 . PMID   20864448.
  7. Davis, A.; Wiegers, T. (2013). "Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database". PLOS ONE. 8 (4): 1–29. Bibcode:2013PLoSO...858201D. doi: 10.1371/journal.pone.0058201 . PMC   3629079 . PMID   23613709.
  8. 1 2 3 4 Bauer-Mehren, A.; Rautscha, M. (2010). "DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks". Bioinformatics. 26 (22): 2924–2926. doi: 10.1093/bioinformatics/btq538 . PMID   20861032.
  9. Vogt, I. (2014). "Systematic analysis of gene properties influencing organ system phenotypes in mammalian perturbations". Bioinformatics. 30 (21): 3093–3100. doi: 10.1093/bioinformatics/btu487 . PMC   4609011 . PMID   25061072.
  10. 1 2 Buneman, P. (2008). "Curated Databases". Bibliometrics. 978 (1): 152–162.
  11. 1 2 Murphy, C.; Davis, A. (2009). "Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks". Bioinformatics. 37 (1): 786–792. doi:10.1093/nar/gkn580. PMC   2686584 . PMID   18782832.
  12. Uniprot, Consortium (2008). "The Universal Protein Resource (UniProt)". Nucleic Acids Research. 36 (1): 190–195. doi:10.1093/nar/gkm895. PMC   1669721 . PMID   18045787.
  13. Uniprot, C. (2010). "Ongoing and future developments at the Universal Protein Resource". Nucleic Acids Research. 39 (Database issue): D214–D219. doi:10.1093/nar/gkq1020. PMC   3013648 . PMID   21051339.
  14. K. Brown, "Online Predicted human Interaction Database," Bioinformatics, vol. 21, no. 9, pp. 2076-2082, 2005.
  15. S. Hunter and P. Jones, "InterPro in 2011: new developments in the family and domain prediction database," Nucleic Acids Research, vol. 10, no. 1, pp. 12-22, 2011
  16. C. Bult and J. Eppig, "The Mouse genome Database (MGD): mouse biology and model systems," Nucleic Acids Research, vol. 36, no. 1, pp. 724-728, 2007
  17. 1 2 M. Dwinell, E. Worthey and S. M, "The Rat genome Database 2009: variation, ontologies and pathways," Nucleic Acids Research, vol. 37, no. 1, pp. 744-749, 2009
  18. Shimoyama M, De Pons J, Hayman GT, et al. (2015). "The Rat Genome Database 2015: genomic, phenotypic and environmental variations and disease". Nucleic Acids Research. 43 (Database issue): D743–50. doi:10.1093/nar/gku1026. PMC   4383884 . PMID   25355511.
  19. 1 2 A. Homosh, "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders," Nucleic Acids Research, vol. 33, no. 1, pp. 514-517, 2005
  20. Hubbard T, et al. (January 2002). "The Ensembl genome database project". Nucleic Acids Research. 30 (1): 38–41. doi:10.1093/nar/30.1.38. PMC   99161 . PMID   11752248.
  21. 1 2 P. Flicek and M. Ridwan, "Ensembl 2012," Nucleic Acids Research, vol. 40, no. 1, pp. 84-90, 2012
  22. Becker, K.; Barnes, K. (2004). "The genetic Association Database". Nature Genetics. 36 (5): 431–432. doi: 10.1038/ng0504-431 . PMID   15118671.
  23. "Archived copy". Archived from the original on 24 February 2021. Retrieved 18 November 2016.{{cite web}}: CS1 maint: archived copy as title (link)
  24. Bravo, A; et al. (2014). "Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research". BMC Bioinformatics. 16 (1): 55. doi: 10.1186/s12859-015-0472-9 . PMC   4466840 . PMID   25886734.
  25. 1 2 Piñero; et al. (2015). "DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes". Database. 2015: bav028. doi:10.1093/database/bav028. PMC   4397996 . PMID   25877637.
  26. Oti, M (2006). "Predicting disease genes using protein-protein interactions". J. Med. Genet. 43 (8): 691–698. doi:10.1136/jmg.2006.041376. PMC   2564594 . PMID   16611749.