GeneNetwork

Last updated
GeneNetwork
Developer(s) GeneNetwork Development Team, University of Tennessee
Initial release15 January 1994;30 years ago (1994-01-15)
Stable release
2.0 / 29 May 2016;8 years ago (2016-05-29)
Repository github.com/genenetwork/genenetwork2
Written in JavaScript, HTML, Python, CSS, CoffeeScript, PHP
License Affero General Public License
Website www.genenetwork.org

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. [1] This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes (e.g., SNPs) and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. [2] The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

Contents

History

Development of GeneNetwork started at the University of Tennessee Health Science Center in 1994 as a web-based version of the Portable Dictionary of the Mouse Genome (1994). [3] GeneNetwork is both the first and the longest continuously operating web service in biomedical research [see https://en.wikipedia.org/wiki/List_of_websites_founded_before_1995]. In 1999 the Portable Gene Dictionary was combined with Kenneth F. Manly's Map Manager QT mapping program to produce an online system for real-time genetic analysis. [4] In early 2003, the first large Affymetrix gene expression data sets (whole mouse brain mRNA and hematopoietic stem cells) were incorporated and the system was renamed WebQTL. [5] [6] GeneNetwork is now developed by an international group of developers and has mirror and development sites in Europe, Asia, and Australia. Production services are hosted on systems at University of Tennessee Health Science Center with a backup instance in Europe.

A the current production version of GeneNetwork (also known as GN2) was released in 2016. [7] The current version of GeneNetwork uses the same database as its predecessor, GN1, but has much more modular and maintainable open source code (available on GitHub). GeneNetwork now also has significant new features including support for:

Organization and use

GeneNetwork consists of two major components:

Four levels of data are usually obtained for each family or population:

  1. DNA sequences and genotypes
  2. Molecular expression data often generated using arrays, RNA-seq, epigenomic, proteomic, metabolomic, and metagenomic methods (molecular phenotypes)
  3. Standard quantitative phenotypes that are often parts of a typical medical record (e.g., blood chemistry, body weight)
  4. Annotation files and metadata for traits and data sets

The combined data types are housed together in a relational database and IPSF fileserver, and are conceptually organized and grouped by species, cohort, and family. The system is implemented as a LAMP (software bundle) stack. Code and a simplified version of the MariaDB database are available on GitHub.

GeneNetwork is primarily used by researchers, but has also been adopted successfully for undergraduate and graduate courses in genetics and bioinformatics (see YouTube example), bioinformatics, physiology, and psychology. [11] Researchers and students typically retrieve sets of genotypes and phenotypes from one or more families and use built-in statistical and mapping functions to explore relations among variables and to assemble networks of associations. Key steps include the analysis of these factors:

  1. The range of variation of traits
  2. Covariation among traits (scatterplots and correlations, principal component analysis)
  3. Architecture of larger networks of traits
  4. Quantitative trait locus mapping and causal models of the linkage between sequence differences and phenotype differences

Data sources

Traits and molecular expression data sets are submitted by researchers directly or are extracted from repositories such as National Center for Biotechnology Information Gene Expression Omnibus. Data cover a variety of cells and tissues—from single cell populations of the immune system, specific tissues (retina, prefrontal cortex), to entire systems (whole brain, lung, muscle, heart, fat, kidney, flower, whole plant embryos). A typical data set covers hundreds of fully genotyped individuals and may also include technical and biological replicates. Genotypes and phenotypes are usually taken from peer-reviewed papers. GeneNetwork includes annotation files for several RNA profiling platforms (Affymetrix, Illumina, and Agilent). RNA-seq and quantitative proteomic, metabolomic, epigenetics, and metagenomic data are also available for several species, including mouse and human.

Tools and features

There are tools on the site for a wide range of functions that range from simple graphical displays of variation in gene expression or other phenotypes, scatter plots of pairs of traits (Pearson or rank order), construction of both simple and complex network graphs, analysis of principal components and synthetic traits, QTL mapping using marker regression, interval mapping, and pair scans for epistatic interactions. Most functions work with up to 100 traits and several functions work with an entire transcriptome.

The database can be browsed and searched at the main search page. An on-line tutorial is available. Users can also download the primary data sets as text files, Excel, or in the case of network graphs, as SBML. As of 2017, GN2 is available as a beta release.

Code

GeneNetwork is an open source project released under the Affero General Public License (AGPLv3). The majority of code is written in Python, but includes modules and other code written in C, R, and JavaScript. The code is mainly Python 2.4. GN2 is mainly written in Python 2.7 in a Flask framework with Jinja2 HTML templates) but with conversion to Python 3.X planned over the next few years. GN2 calls many statistical procedures written in the R programming language. The original source code from 2010 along with a compact database are available on SourceForge. While GN1 was actively maintained through 2019 GitHub, as of 2020 all work is focused on GN2.

See also

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Phenotype</span> Composite of the organisms observable characteristics or traits

In genetics, the phenotype is the set of observable characteristics or traits of an organism. The term covers the organism's morphology, its developmental processes, its biochemical and physiological properties, its behavior, and the products of behavior. An organism's phenotype results from two basic factors: the expression of an organism's genetic code and the influence of environmental factors. Both factors may interact, further affecting the phenotype. When two or more clearly different phenotypes exist in the same population of a species, the species is called polymorphic. A well-documented example of polymorphism is Labrador Retriever coloring; while the coat color depends on many genes, it is clearly seen in the environment as yellow, black, and brown. Richard Dawkins in 1978 and then again in his 1982 book The Extended Phenotype suggested that one can regard bird nests and other built structures such as caddisfly larva cases and beaver dams as "extended phenotypes".

<span class="mw-page-title-main">Molecular genetics</span> Scientific study of genes at the molecular level

Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens. 

A genetic screen or mutagenesis screen is an experimental technique used to identify and select individuals who possess a phenotype of interest in a mutagenized population. Hence a genetic screen is a type of phenotypic screen. Genetic screens can provide important information on gene function as well as the molecular events that underlie a biological process or pathway. While genome projects have identified an extensive inventory of genes in many different organisms, genetic screens can provide valuable insight as to how those genes function.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

A quantitative trait locus (QTL) is a locus that correlates with variation of a quantitative trait in the phenotype of a population of organisms. QTLs are mapped by identifying which molecular markers correlate with an observed trait. This is often an early step in identifying the actual genes that cause the trait variation.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which is a hypothesis-free approach that scans the entire genome for associations between common genetic variants and traits of interest. Candidate genes are most often selected for study based on a priori knowledge of the gene's biological functional impact on the trait or disease in question. The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?" Candidate genes hypothesized to be associated with complex traits have generally not been replicated by subsequent GWASs or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient statistical power, low prior probability that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and data dredging.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

In molecular biology and other fields, a molecular marker is a molecule, sampled from some source, that gives information about its source. For example, DNA is a molecular marker that gives information about the organism from which it was taken. For another example, some proteins can be molecular markers of Alzheimer's disease in a person from which they are taken. Molecular markers may be non-biological. Non-biological markers are often used in environmental studies.

Marker assisted selection or marker aided selection (MAS) is an indirect selection process where a trait of interest is selected based on a marker linked to a trait of interest, rather than on the trait itself. This process has been extensively researched and proposed for plant- and animal- breeding.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

<span class="mw-page-title-main">Neurogenetics</span> Study of role of genetics in the nervous system

Neurogenetics studies the role of genetics in the development and function of the nervous system. It considers neural characteristics as phenotypes, and is mainly based on the observation that the nervous systems of individuals, even of those belonging to the same species, may not be identical. As the name implies, it draws aspects from both the studies of neuroscience and genetics, focusing in particular how the genetic code an organism carries affects its expressed traits. Mutations in this genetic sequence can have a wide range of effects on the quality of life of the individual. Neurological diseases, behavior and personality are all studied in the context of neurogenetics. The field of neurogenetics emerged in the mid to late 20th century with advances closely following advancements made in available technology. Currently, neurogenetics is the center of much research utilizing cutting edge techniques.

Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs.

Molecular breeding is the application of molecular biology tools, often in plant breeding and animal breeding. In the broad sense, molecular breeding can be defined as the use of genetic manipulation performed at the level of DNA to improve traits of interest in plants and animals, and it may also include genetic engineering or gene manipulation, molecular marker-assisted selection, and genomic selection. More often, however, molecular breeding implies molecular marker-assisted breeding (MAB) and is defined as the application of molecular biotechnologies, specifically molecular markers, in combination with linkage maps and genomics, to alter and improve plant or animal traits on the basis of genotypic assays.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

<span class="mw-page-title-main">Complex traits</span>

Complex traits are phenotypes that are controlled by two or more genes and do not follow Mendel’s Law of Dominance. They may have a range of expression which is typically continuous. Both environmental and genetic factors often impact the variation in expression. Human height is a continuous trait meaning that there is a wide range of heights. There are an estimated 50 genes that affect the height of a human. Environmental factors, like nutrition, also play a role in a human’s height. Other examples of complex traits include: crop yield, plant color, and many diseases including diabetes and Parkinson's disease. One major goal of genetic research today is to better understand the molecular mechanisms through which genetic variants act to influence complex traits. Complex Traits are also known as polygenic traits and multigenic traits.

Transcriptome-wide association study (TWAS) is a genetic methodology that can be used to compare the genetic components of gene expression and the genetic components of a trait to determine if an association is present between the two components. TWAS are useful for the identification and prioritization of candidate causal genes in candidate gene analysis following genome-wide association studies. TWAS looks at the RNA products of a specific tissue and gives researchers the abilities to look at the genes being expressed as well as gene expression levels, which varies by tissue type. TWAS are valuable and flexible bioinformatics tools that looks at the associations between the expressions of genes and complex traits and diseases. By looking at the association between gene expression and the trait expressed, genetic regulatory mechanisms can be investigated for the role that they play in the development of specific traits and diseases.

References

  1. Morahan, G; Williams, RW (2007). "Systems Genetics: The Next Generation in Genetics Research?". Decoding the Genomic Control of Immune Reactions. Novartis Foundation Symposia. Vol. 281. pp. 181–8, discussion 188–91, 208–9. doi:10.1002/9780470062128.ch15. ISBN   9780470062128. PMID   17534074.{{cite book}}: |journal= ignored (help)
  2. Druka, A; Druka, I; Centeno, AG; Li, H; Sun, Z; Thomas, WT; Bonar, N; Steffenson, BJ; Ullrich, SE; Kleinhofs, Andris; Wise, Roger P; Close, Timothy J; Potokina, Elena; Luo, Zewei; Wagner, Carola; Schweizer, Günther F; Marshall, David F; Kearsey, Michael J; Williams, Robert W; Waugh, Robbie (2008). "Towards systems genetic analyses in barley: Integration of phenotypic, expression and genotype data into GeneNetwork". BMC Genetics. 9: 73. doi: 10.1186/1471-2156-9-73 . PMC   2630324 . PMID   19017390.
  3. Williams, RW (1994). "The Portable Dictionary of the Mouse Genome: a personal database for gene mapping and molecular biology". Mammalian Genome. 5 (6): 372–5. doi:10.1007/bf00356557. PMID   8043953. S2CID   655396.
  4. Chesler, EJ; Lu, L; Wang, J; Williams, RW; Manly, KF (2004). "WebQTL: rapid exploratory analysis of gene expression and genetic networks for brain and behavior". Nature Neuroscience. 7 (5): 485–6. doi:10.1038/nn0504-485. PMID   15114364. S2CID   20241963.
  5. Chesler, EJ; Lu, L; Shou, S; Qu, Y; Gu, J; Wang, J; Hsu, HC; Mountz, JD; et al. (2005). "Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function". Nature Genetics. 37 (3): 233–42. doi:10.1038/ng1518. PMID   15711545. S2CID   13189340.
  6. Bystrykh, L; Weersing, E; Dontje, B; Sutton, S; Pletcher, MT; Wiltshire, T; Su, AI; Vellenga, E; et al. (2005). "Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'". Nature Genetics. 37 (3): 225–32. doi:10.1038/ng1497. PMID   15711547. S2CID   5622506.
  7. Sloan, Z (2016). "GeneNetwork: framework for web-based genetics". The Journal of Open Source Software. 1 (2): 25. Bibcode:2016JOSS....1...25S. doi: 10.21105/joss.00025 .
  8. Zhou, X (2014). "Efficient multivariate linear mixed model algorithms for genome-wide association studies". Nature Methods. 11 (2): 407–9. doi:10.1038/nmeth.2848. PMC   4211878 . PMID   24531419.
  9. Arends, D (2016). "Correlation Trait Loci (CTL) mapping: phenotype network inference subject to genotype". The Journal of Open Source Software. 1 (6): 87. Bibcode:2016JOSS....1...87A. doi: 10.21105/joss.00087 .
  10. Ziebarth, JD (2013). "Bayesian Network Webserver: a comprehensive tool for biological network modeling". Bioinformatics. 29 (2 1): 2803–3. doi: 10.1093/bioinformatics/btt472 . PMID   23969134.
  11. Grisham, W; Schottler, NA; Valli-Marill, J; Beck, L; Beatty, J (2010). "Teaching bioinformatics and neuroinformatics by using free web-based tools". CBE: Life Sciences Education. 9 (2): 98–107. doi:10.1187/cbe.09-11-0079. PMC   2879386 . PMID   20516355.
Related resources

Other systems genetics and network databases