Ziheng Yang | |
|---|---|
| Born | 1 November 1964 (age 59) Gansu, China |
| Citizenship | United Kingdom |
| Alma mater | Beijing Agricultural University |
| Known for | Models of DNA sequence evolution and methods of statistical inference in molecular evolution and phylogenetics |
| Awards | Darwin–Wallace Medal (2023) Frink Medal (2010) Royal Society Wolfson Research Merit Award (2009) SSB Presidents' Award for Lifetime Achievement (2008) Contents
Young Investigator’s Prize, American Society of Naturalists (1995) |
| Scientific career | |
| Fields | molecular evolution molecular phylogenetics population genetics computational biology computational statistics Markov chain Monte Carlo |
| Institutions | University College London Beijing Agricultural University |
| Website | abacus |
Ziheng Yang FRS (Chinese :杨子恒; born 1 November 1964) is a Chinese biologist. He holds the R.A. Fisher Chair of Statistical Genetics [1] at University College London, [2] and is the Director of R.A. Fisher Centre for Computational Biology at UCL. He was elected a Fellow of the Royal Society in 2006. [2]
Yang graduated from Gansu Agricultural University with a BSc in 1984, and from Beijing Agricultural University with a MSc in 1987, and PhD in 1992. [3]
After the PhD, he worked as a postdoctoral researcher in Department of Zoology, University of Cambridge (1992–3), The Natural History Museum (London) (1993–4), Pennsylvania State University (1994–5), and University of California at Berkeley (1995–7), before taking up a faculty position in Department of Biology, University College London. He was a Lecturer (1997), Reader (2000), and then Professor (2001) in the same department. He was appointed to the R.A. Fisher Chair in Statistical Genetics in UCL in 2010.
Yang held a number of visiting appointments. He was a Visiting Associate Professor at Institute of Statistical Mathematics (Tokyo, 1997–8), a visiting professor at University of Tokyo (2007–8), Institute of Zoology in Beijing (2010–1), Peking University (2010), National Institute of Genetics, Mishima, Japan (2011), and Swiss Institute of Technology (ETH), Zurich (2011). In 2008–2011, he was the Changjiang Chair Professor at Sun Yat-sen University, with an award from the Ministry of Education of China. From 2016 to 2019, he was a visiting professor at National Institute of Genetics, Japan. In 2017–8, he was a Radcliffe Fellow at Harvard University's Radcliffe Institute for Advanced Study. [4]
Yang developed a number of statistical models and methods in the 1990s, which have been implemented in maximum likelihood and Bayesian software programs for phylogenetic analysis of DNA and protein sequence data. Two decades ago, Felsenstein had described the pruning algorithm for calculating the likelihood on a phylogeny. [5] [6] However, the assumed model of character change was simple and, for example, does not account for variable rates among sites in the sequence. By illustrating the power of statistical models to accommodate major features of the evolutionary process and to address important evolutionary questions using molecular sequence data, the models and methods Yang developed had a major impact on the cladistic-statistical controversy at the time and played a major role in the transformation of molecular phylogenetics.
Yang developed a maximum likelihood model of gamma-distributed evolutionary rate variation among sites in the sequence in 1993–4. [7] [8] The models he developed for combined analysis of heterogeneous data [9] [10] are later known as partition models and mixture models.
Together with Nick Goldman, Yang developed the codon model of nucleotide substitution in 1994. [11] This formed the basis for phylogenetic analysis of protein-coding genes to detect molecular adaptation or Darwinian evolution at the molecular level. A stream of papers followed this to extend the original model to accommodate variable selection pressures (measured by the dN/dS ratio) among evolutionary lineages or among sites in the protein sequence. The branch models allow different branches to have different dN/dS ratios among branches on the tree and can be used to test for positive selection affecting particular lineages. [12] The site models allow different selective pressures on different amino acids in the protein and can be used to test for positive selection affecting only a few amino acid sites. [13] [14] [15] And the branch-site models attempt to detect positive selection that affects only a few amino acid sites along pre-specific lineages. [16] [15] A recent book reviews the recent developments in this area. [17]
Yang developed the statistical (empirical Bayes) method for reconstructing ancestral sequences in 1995. [18] Compared with the parsimony method of ancestral sequence reconstruction (that is, the Fitch–Hartigan algorithm), [19] [20] this has the advantages of using branch-length information and of providing a probabilistic assessment of the reconstruction uncertainties.
Together with Bruce Rannala, Yang introduced Bayesian statistics into molecular phylogenetics in 1996. [21] [22] The Bayesian is now one of the most popular statistical methodologies used in modeling and inference in molecular phylogenetics. Recent exciting developments in Bayesian phylogenetics are summarized in an edited book [23] and in chapter 8 of Yang's book. [24]
Yang and Rannala also developed the multispecies coalescent model, [25] which has emerged as the natural framework for comparative analysis of genomic sequence data from multiple species, incorporating the coalescent process in both modern species and extinct ancestors. The model has been used to estimate the species tree despite gene tree heterogeneity among genomic regions, [26] [27] [28] and to delimit/identify species. [29] Yang champions the Bayesian full-likelihood method of inference, using Markov chain Monte Carlo to average over gene trees (gene genealogies), accommodating phylogenetic uncertainties. [28]
Yang maintains the program package PAML (for Phylogenetic Analysis by Maximum Likelihood) [30] and the Bayesian Markov chain Monte Carlo program BPP (for Bayesian Phylogenetics and Phylogeography). [31]
Yang studied the star tree paradox, which is that Bayesian model selection produces spuriously high posterior probabilities for the binary trees if the data are simulated under the star tree. [32] [33] A simpler case showing similar behaviours is the fair-coin paradox. [33] The work suggests that Bayesian model selection may produce unpleasant polarized behavior supporting one model with full force while rejecting the others, when the competing models are all misspecified and equally wrong. [34]
Yang has worked extensively on Markov chain Monte Carlo algorithms, deriving many Metropolis-Hastings algorithms in Bayesian phylogenetics. [35] A study examining the efficiency of simple MCMC proposals revealed that the well-studied Gaussian random-walk move is less efficient than the simple uniform random-walk move, which is in turn less efficient than the Bactrian moves, bimodal moves that suppress values very close to the current state. [36]
Yang taught in Woods Hole Workshop on Molecular Evolution.
He was a co-organizer of the Royal Society Discussion Meeting on "Statistical and computational challenges in molecular phylogenetics and evolution" on 28–29 April 2008, [37] and the Royal Society Discussion Meeting on "Dating species divergence using rocks and clocks", on 9–10 November 2015. [38]
Since 2009, he has been a co-organizer of an annual workshop on Computational Molecular Evolution (CoME), which has been running in Sanger/Hinxton in odd years and in Hiraklion, Crete in even years.
He also organized and taught in a number of workshops in Beijing, China.
2023-2025, President, Society for Molecular Biology and Evolution
2023, Darwin–Wallace Medal, Linnean Society of London [39]
2010, Frink Medal for British Zoologists, Zoological Society of London [40]
2009, Royal Society Wolfson Research Merit Award
2008, President's Award for Lifetime Achievement, Society for Systematic Biology [41]
2006, Fellow of the Royal Society, The Royal Society of London
1995, Young Investigator’s Prize, American Society of Naturalists
In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference, methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.
Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.
The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleotide sequences for DNA, RNA, or amino acid sequences for proteins.
In biology, a substitution model, also called models of sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules, such as DNA sequences or protein sequences, that can be represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.
Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.
Phylogenomics is the intersection of the fields of evolution and genomics. The term has been used in multiple ways to refer to analysis that involves genome data and evolutionary reconstructions. It is a group of techniques within the larger fields of phylogenetics and genomics. Phylogenomics draws information by comparing entire genomes, or at least large portions of genomes. Phylogenetics compares and analyzes the sequences of single genes, or a small number of genes, as well as many other types of data. Four major areas fall under phylogenomics:
Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.
In genetics, the Ka/Ks ratio, also known as ω or dN/dS ratio, is used to estimate the balance between neutral mutations, purifying selection and beneficial mutations acting on a set of homologous protein-coding genes. It is calculated as the ratio of the number of nonsynonymous substitutions per non-synonymous site (Ka), in a given period of time, to the number of synonymous substitutions per synonymous site (Ks), in the same period. The latter are assumed to be neutral, so that the ratio indicates the net balance between deleterious and beneficial mutations. Values of Ka/Ks significantly above 1 are unlikely to occur without at least some of the mutations being advantageous. If beneficial mutations are assumed to make little contribution, then Ka/Ks estimates the degree of evolutionary constraint.
Masatoshi Nei was a Japanese-born American evolutionary biologist.
Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals, populations, or species to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.
Human evolutionary genetics studies how one human genome differs from another human genome, the evolutionary past that gave rise to the human genome, and its current effects. Differences between genomes have anthropological, medical, historical and forensic implications and applications. Genetic data can provide important insights into human evolution.
Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley, Bob Mau in Madison, and Shuying Li in University of Iowa, the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001, and is now one of the most popular methods in molecular phylogenetics.
T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.
Cross-species transmission (CST), also called interspecies transmission, host jump, or spillover, is the transmission of an infectious pathogen, such as a virus, between hosts belonging to different species. Once introduced into an individual of a new host species, the pathogen may cause disease for the new host and/or acquire the ability to infect other individuals of the same species, allowing it to spread through the new host population. The phenomenon is most commonly studied in virology, but cross-species transmission may also occur with bacterial pathogens or other types of microorganisms.
Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.
Minimum evolution is a distance method employed in phylogenetics modeling. It shares with maximum parsimony the aspect of searching for the phylogeny that has the shortest total sum of branch lengths.
In phylogenetics, reconciliation is an approach to connect the history of two or more coevolving biological entities. The general idea of reconciliation is that a phylogenetic tree representing the evolution of an entity can be drawn within another phylogenetic tree representing an encompassing entity to reveal their interdependence and the evolutionary events that have marked their shared history. The development of reconciliation approaches started in the 1980s, mainly to depict the coevolution of a gene and a genome, and of a host and a symbiont, which can be mutualist, commensalist or parasitic. It has also been used for example to detect horizontal gene transfer, or understand the dynamics of genome evolution.
In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".