Computational and Statistical Genetics

Last updated

The interdisciplinary research field of Computational and Statistical Genetics uses the latest approaches in genomics, quantitative genetics, computational sciences, bioinformatics and statistics to develop and apply computationally efficient and statistically robust methods to sort through increasingly rich and massive genome wide data sets to identify complex genetic patterns, gene functionalities and interactions, disease and phenotype associations involving the genomes of various organisms. [1] [2] This field is also often referred to as computational genomics. This is an important discipline within the umbrella field computational biology.

Contents

Haplotype phasing

During the last two decades, there has been a great interest in understanding the genetic and genomic makeup of various species, including humans primarily aided by the different genome sequencing technologies to read the genomes that has been rapidly developing. However, these technologies are still limited, and computational and statistical methods are a must to detect and process errors and put together the pieces of partial information from the sequencing and genotyping technologies.

A haplotype is defined the sequence of nucleotides (A,G,T,C) along a single chromosome. In humans, we have 23 pairs of chromosomes. Another example is maize which is also a diploid with 10 pairs of chromosomes. However, with current technology, it is difficult to separate the two chromosomes within a pair and the assays produce the combined haplotype, called the genotype information at each nucleotide. The objective of haplotype phasing is to find the phase of the two haplotypes given the combined genotype information. Knowledge of the haplotypes is extremely important and not only gives us a complete picture of an individuals genome, but also aids other computational genomic processes such as Imputation among many significant biological motivations.

For diploid organisms such as humans and maize, each organism has two copies of a chromosome - one each from the two parents. The two copies are highly similar to each other. A haplotype is the sequence of nucleotides in a chromosome. the haplotype phasing problem is focused on the nucleotides where the two homologous chromosomes differ. Computationally, for a genomic region with K differing nucleotide sites, there are 2^K - 1 possible haplotypes, so the phasing problem focuses on efficiently finding the most probable haplotypes given an observed genotype. For more information, see Haplotype.

Prediction of SNP genotypes by imputation

Although the genome of a higher organism (eukaryotes) contains millions of single nucleotide polymorphisms (SNPs), genotyping arrays are pre- determined to detect only a handful of such markers. The missing markers are predicted using imputation analysis. Imputation of un-genotyped markers has now become an essential part of genetic and genomic studies. It utilizes the knowledge of linkage disequilibrium (LD) from haplotypes in a known reference panel (for example, HapMap and the 1000 Genomes Projects) to predict genotypes at the missing or un-genotyped markers. The process allows the scientists to accurately perform analysis of both the genotyped polymorphic markers and the un-genotyped markers that are predicted computationally. It has been shown that downstream studies [3] benefit a lot from imputation analysis in the form of improved the power to detect disease-associated loci. Another crucial contribution of imputation is that it also facilitates combining genetic and genomic studies that used different genotyping platforms for their experiments. For example. although 415 million common and rare genetic variants exist in the human genome, the current genotyping arrays such as Affymetrix and Illumina microarrays can only assay up to 2.5 million SNPs. Therefore, imputation analysis is an important research direction and it is important to identify methods and platforms to impute high quality genotype data using existing genotypes and reference panels from publicly available resources, such as the International HapMap Project and the 1000 Genomes Project. For humans, the analysis has successfully generated predicted genotypes in many races including Europeans [4] and African Americans. [5] For other species such as plants, imputation analysis is an ongoing process using reference panels such as in maize. [6]

A number of different methods exist for genotype imputation. The three most widely used imputation methods are - Mach, [7] Impute [8] and Beagle. [9] All three methods utilize hidden markov models as the underlying basis for estimating the distribution of the haplotype frequencies. Mach and Impute2 are more computationally intensive compared with Beagle. Both Impute and Mach are based on different implementations of the product of the conditionals or PAC model. Beagle groups the reference panel haplotypes into clusters at each SNP to form localized haplotype-cluster model that allows it to dynamically vary the number of clusters at each SNP making it computationally faster than Mach and Impute2.

For more information, see imputation (genetics).

Genome-wide association analysis

Over the past few years, genome-wide association studies (GWAS) have become a powerful tool for investigating the genetic basis of common diseases and has improved our understanding of the genetic basis of many complex traits. [10] Traditional single SNP (single-nucleotide polymorphism) GWAS is the most commonly used method to find trait associated DNA sequence variants - associations between variants and one or more phenotypes of interest are investigated by studying individuals with different phenotypes and examining their genotypes at the position of each SNP individually. The SNPs for which one variant is statistically more common in individuals belonging to one phenotypic group are then reported as being associated with the phenotype. However, most complex common diseases involve small population-level contributions from multiple genomic loci. To detect such small effects as genome-wide significant, traditional GWAS rely on increased sample size e.g. to detect an effect which accounts for 0.1% of total variance, traditional GWAS needs to sample almost 30,000 individuals. Although the development of high throughput SNP genotyping technologies has lowered the cost and improved the efficiency of genotyping. Performing such a large scale study still costs considerable money and time. Recently, association analysis methods utilizing gene-based tests have been proposed [11] that are based on the fact that variations in protein-coding and adjacent regulatory regions are more likely to have functional relevance. These methods have the advantage that they can account for multiple independent functional variants within a gene, with the potential to greatly increase the power to identify disease/trait associated genes. Also, imputation of ungenotyped markers using known reference panels (e.g., HapMap and the 1000 Genomes Project) predicts genotypes at the missing or untyped markers thereby allowing one to accurately evaluate the evidence for association at genetic markers that are not directly genotyped (in addition to the typed markers) and has been shown to improve the power of GWAS to detect disease-associated loci.

For more information, see Genome-wide association study

In this era of large amount of genetic and genomic data, accurate representation and identification of statistical interactions in biological/genetic/genomic data constitutes a vital basis for designing interventions and curative solutions for many complex diseases. Variations in human genome have been long known to make us susceptible to many diseases. We are hurtling towards the era of personal genomics and personalized medicine that require accurate predictions of disease risk posed by predisposing genetic factors. Computational and statistical methods for identifying these genetic variations, and building these into intelligent models for diseaseassociation and interaction analysis studies genome-wide are a dire necessity across many disease areas. The principal challenges are: (1) most complex diseases involve small or weak contributions from multiple genetic factors that explain only a minuscule fraction of the population variation attributed to genetic factors. (2) Biological data is inherently extremely noisy, so the underlying complexities of biological systems (such as linkage disequilibrium and genetic heterogeneity) need to be incorporated into the statistical models for disease association studies. The chances of developing many common diseases such as cancer, autoimmune diseases and cardiovascular diseases involves complex interactions between multiple genes and several endogenous and exogenous environmental agents or covariates. Many previous disease association studies could not produce significant results because of the lack of incorporation of statistical interactions in their mathematical models explaining the disease outcome. Consequently much of the genetic risks underlying several diseases and disorders remain unknown. Computational methods such as [12] [13] [14] [15] [16] [17] to model and identify the genetic/genomic variations underlying disease risks has a great potential to improve prediction of disease outcomes, understand the interactions and design better therapeutic methods based on them.

Related Research Articles

An allele is one of two, or more, forms of a given gene variant. For example, the ABO blood grouping is controlled by the ABO gene, which has six common alleles. Nearly every living human's phenotype for the ABO gene is some combination of just these six alleles. An allele is one of two, or more, versions of the same gene at the same place on a chromosome. It can also refer to one of multiple different sequence variations of several-hundred base-pairs long or longer regions of the genome that code for proteins. Alleles can come in different extremes of size. At the lowest extreme, an allele can be a single nucleotide polymorphism (SNP). At higher extremes, it can be up to several thousand base-pairs long. Most alleles result in little or no observable change in the function of the protein the gene codes for.

Single-nucleotide polymorphism Single nucleotide position in genomic DNA at which different sequence alternatives exist

In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

Haplotype Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

Identity by descent Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

A molecular marker is a molecule contained within a sample taken from an organism or other matter. It can be used to reveal certain characteristics about the respective source. DNA, for example, is a molecular marker containing information about genetic disorders and the evolutionary history of life. Specific regions of the DNA are used for diagnosing the autosomal recessive genetic disorder cystic fibrosis, taxonomic affinity (phylogenetics) and identity. Further, life forms are known to shed unique chemicals, including DNA, into the environment as evidence of their presence in a particular location. Other biological markers, like proteins, are used in diagnostic tests for complex neurodegenerative disorders, such as Alzheimer's disease. Non-biological molecular markers are also used, for example, in environmental studies.

In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.

Preimplantation genetic haplotyping (PGH) is a clinical method of preimplantation genetic diagnosis (PGD) used to determine the presence of single gene disorders in offspring. PGH provides a more feasible method of gene location than whole-genome association experiments, which are expensive and time-consuming.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Genome-wide association study Study of genetic variants in different individuals

In genomics, a genome-wide association study, also known as whole genome association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

A recombinant inbred strain is an organism with chromosomes that incorporate an essentially permanent set of recombination events between chromosomes inherited from two or more inbred strains. F1 and F2 generations are produced by intercrossing the inbred strains; pairs of the F2 progeny are then mated to establish inbred strains through long-term inbreeding.

Microfluidic whole genome haplotyping is a technique for the physical separation of individual chromosomes from a metaphase cell followed by direct resolution of the haplotype for each allele.

In genetics, haplotype estimation refers to the process of statistical estimation of haplotypes from genotype data. The most common situation arises when genotypes are collected at a set of polymorphic sites from a group of individuals. For example in human genetics, genome-wide association studies collect genotypes in thousands of individuals at between 200,000-5,000,000 SNPs using microarrays. Haplotype estimation methods are used in the analysis of these datasets and allow genotype imputation of alleles from reference databases such as the HapMap Project and the 1000 Genomes Project.

Imputation in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

Predictive genomics is at the intersection of multiple disciplines: predictive medicine, personal genomics and translational bioinformatics. Specifically, predictive genomics deals with the future phenotypic outcomes via prediction in areas such as complex multifactorial diseases in humans. To date, the success of predictive genomics has been dependent on the genetic framework underlying these applications, typically explored in genome-wide association (GWA) studies. The identification of associated single-nucleotide polymorphisms underpin GWA studies in complex diseases that have ranged from Type 2 Diabetes (T2D), Age-related macular degeneration (AMD) and Crohn's disease.

Genotype-first approach

The genotype-first approach is a type of strategy used in genetic epidemiological studies to associate specific genotypes to apparent clinical phenotypes of a complex disease or trait. As opposed to “phenotype-first”, the traditional strategy that has been guiding genome-wide association studies (GWAS) so far, this approach characterizes individuals first by a statistically common genotype based on molecular tests prior to clinical phenotypic classification. This method of grouping leads to patient evaluations based on a shared genetic etiology for the observed phenotypes, regardless of their suspected diagnosis. Thus, this approach can prevent initial phenotypic bias and allow for identification of genes that pose a significant contribution to the disease etiology.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for variance component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's heritability of a particular subset of genetic variants. This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development. It can also be extended to analyse bivariate genetic correlations between traits. There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to population stratification.

Jonathan Marchini

Jonathan Laurence Marchini is a Bayesian statistician and professor of statistical genomics in the Department of Statistics at the University of Oxford, a tutorial fellow in statistics at Somerville College, Oxford and a co-founder and director of Gensci Ltd. He co-leads the Haplotype Reference Consortium.

Personality traits are patterns of thoughts, feelings and behaviors that reflect the tendency to respond in certain ways under certain circumstances.

References

  1. Peltz, Gary, ed. (2005). Computational Genetics and Genomics - Springer. Link.springer.com. doi:10.1007/978-1-59259-930-1. ISBN   978-1-58829-187-5.[ page needed ]
  2. "Nature Reviews Genetics - Focus on Computational Genetics". Nature.com. Retrieved 2013-10-20.[ page needed ]
  3. Hao, Ke; Chudin, Eugene; McElwee, Joshua; Schadt, Eric E (2009). "Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies". BMC Genetics. 10: 27. doi:10.1186/1471-2156-10-27. PMC   2709633 . PMID   19531258.
  4. Nothnagel, M; Ellinghaus, D; Schreiber, S; Krawczak, M; Franke, A (2009). "A comprehensive evaluation of SNP genotype imputation". Human Genetics. 125 (2): 163–71. doi:10.1007/s00439-008-0606-5. PMID   19089453. S2CID   6678626.
  5. Chanda, P; Yuhki, N; Li, M; Bader, JS; Hartz, A; Boerwinkle, E; Kao, WH; Arking, DE (2012). "Comprehensive evaluation of imputation performance in African Americans". Journal of Human Genetics. 57 (7): 411–21. doi:10.1038/jhg.2012.43. PMC   3477509 . PMID   22648186.
  6. Hickey, John M.; Crossa, Jose; Babu, Raman; De Los Campos, Gustavo (2012). "Factors Affecting the Accuracy of Genotype Imputation in Populations from Several Maize Breeding Programs". Crop Science. 52 (2): 654. doi:10.2135/cropsci2011.07.0358.
  7. "Mach".
  8. "Impute2".
  9. "Beagle".
  10. McCarthy, MI; Abecasis, GR; Cardon, LR; Goldstein, DB; Little, J; Ioannidis, JP; Hirschhorn, JN (2008). "Genome-wide association studies for complex traits: Consensus, uncertainty and challenges". Nature Reviews Genetics. 9 (5): 356–69. doi:10.1038/nrg2344. PMID   18398418. S2CID   15032294.
  11. Chanda, Pritam; Huang, Hailiang; Arking, Dan E.; Bader, Joel S. (2013). Veitia, Reiner Albert (ed.). "Fast Association Tests for Genes with FAST". PLOS ONE. 8 (7): e68585. Bibcode:2013PLoSO...868585C. doi: 10.1371/journal.pone.0068585 . PMC   3720833 . PMID   23935874.
  12. Chanda, P; Zhang, A; Brazeau, D; Sucheston, L; Freudenheim, JL; Ambrosone, C; Ramanathan, M (2007). "Information-theoretic metrics for visualizing gene-environment interactions". American Journal of Human Genetics. 81 (5): 939–63. doi:10.1086/521878. PMC   2265645 . PMID   17924337.
  13. Chanda, Pritam; Sucheston, Lara; Liu, Song; Zhang, Aidong; Ramanathan, Murali (2009). "Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits". BMC Genomics. 10: 509. doi:10.1186/1471-2164-10-509. PMC   2779196 . PMID   19889230.
  14. Chanda, P.; Sucheston, L.; Zhang, A.; Brazeau, D.; Freudenheim, J. L.; Ambrosone, C.; Ramanathan, M. (2008). "AMBIENCE: A Novel Approach and Efficient Algorithm for Identifying Informative Genetic and Environmental Associations with Complex Phenotypes". Genetics. 180 (2): 1191–210. doi:10.1534/genetics.108.088542. PMC   2567367 . PMID   18780753.
  15. "MDR".
  16. Shang, Junliang; Zhang, Junying; Sun, Yan; Zhang, Yuanke (2013). "EpiMiner: A three-stage co-information based method for detecting and visualizing epistatic interactions". Digital Signal Processing. 24: 1–13. doi:10.1016/j.dsp.2013.08.007.
  17. "BOOST".