Statistics |
---|
UniFrac is a distance metric used for comparing biological communities. It differs from dissimilarity measures such as Bray-Curtis dissimilarity in that it incorporates information on the relative relatedness of community members by incorporating phylogenetic distances between observed organisms in the computation.
Both weighted (quantitative) and unweighted (qualitative) variants of UniFrac [1] are widely used in microbial ecology, where the former accounts for abundance of observed organisms, while the latter only considers their presence or absence. The method was devised by Catherine Lozupone, when she was working with Rob Knight [2] of the University of Colorado at Boulder in 2005. [3] [4]
The distance is calculated between pairs of samples (each sample represents an organismal community). All taxa found in one or both samples are placed on a phylogenetic tree. A branch leading to taxa from both samples is marked as "shared" and branches leading to taxa which appears only in one sample are marked as "unshared". The distance between the two samples is then calculated as:
This definition satisfies the requirements of a distance metric, being non-negative, zero only when entities are identical, transitive, and conforming to the triangle inequality.
If there are several different samples, a distance matrix can be created by making a tree for each pair of samples and calculating their UniFrac measure. Subsequently, standard multivariate statistical methods such as data clustering and principal co-ordinates analysis can be used.
One can determine the statistical significance of the UniFrac distance between two samples using Monte Carlo simulations. By randomizing the sample classification of each taxon on the tree (leaving the branch structure unchanged) and creating a distribution of UniFrac distance values, one can obtain a distribution of UniFrac values. From this, a p-value can be given to the actual distance between the samples.
Additionally, there is a weighted version of the UniFrac metric which accounts for the relative abundance of each of the taxa within the communities. This is commonly used in metagenomic studies, where the number of metagenomic reads can be in the tens of thousands, and it is appropriate to 'bin' these reads into operational taxonomic units, or OTUs, which can then be dealt with as taxa within the UniFrac framework.
In 2012, a generalized UniFrac version, [5] which unifies the weighted and unweighted UniFrac distance in a single framework, was proposed. The authors argued that the weighted and unweighted UniFrac distances place too much emphasis on either abundant lineages or rare lineages, respectively, leading to “loss of power when the important composition change occurs in moderately abundant lineages”. The generalized UniFrac distance aims to address this limitation by down-weighting the emphasis on abundant or rare lineages.
A phylogenetic tree is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. All life on Earth is part of a single phylogenetic tree, indicating common ancestry.
A weight function is a mathematical device used when performing a sum, integral, or average to give some elements more "weight" or influence on the result than other elements in the same set. The result of this application of a weight function is a weighted sum or weighted average. Weight functions occur frequently in statistics and analysis, and are closely related to the concept of a measure. Weight functions can be employed in both discrete and continuous settings. They can be used to construct systems of calculus called "weighted calculus" and "meta-calculus".
In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.
UPGMA is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener.
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts:
Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species and the relationships between specific genes shared by many types of organisms.
Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.
16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.
Microbiota are the range of microorganisms that may be commensal, symbiotic, or pathogenic found in and on all multicellular organisms, including plants. Microbiota include bacteria, archaea, protists, fungi, and viruses, and have been found to be crucial for immunologic, hormonal, and metabolic homeostasis of their host.
The class Zetaproteobacteria is the sixth and most recently described class of the Pseudomonadota. Zetaproteobacteria can also refer to the group of organisms assigned to this class. The Zetaproteobacteria were originally represented by a single described species, Mariprofundus ferrooxydans, which is an iron-oxidizing neutrophilic chemolithoautotroph originally isolated from Kamaʻehuakanaloa Seamount in 1996 (post-eruption). Molecular cloning techniques focusing on the small subunit ribosomal RNA gene have also been used to identify a more diverse majority of the Zetaproteobacteria that have as yet been unculturable.
Nanohaloarchaea is a clade of diminutive archaea with small genomes and limited metabolic capabilities, belonging to the DPANN archaea. They are ubiquitous in hypersaline habitats, which they share with the extremely halophilic haloarchaea.
The Earth Microbiome Project (EMP) is an initiative founded by Janet Jansson, Jack Gilbert and Rob Knight in 2010 to collect natural samples and to analyze the microbial community around the globe.
Community fingerprinting is a set of molecular biology techniques that can be used to quickly profile the diversity of a microbial community. Rather than directly identifying or counting individual cells in an environmental sample, these techniques show how many variants of a gene are present. In general, it is assumed that each different gene variant represents a different type of microbe. Community fingerprinting is used by microbiologists studying a variety of microbial systems to measure biodiversity or track changes in community structure over time. The method analyzes environmental samples by assaying genomic DNA. This approach offers an alternative to microbial culturing, which is important because most microbes cannot be cultured in the laboratory. Community fingerprinting does not result in identification of individual microbe species; instead, it presents an overall picture of a microbial community. These methods are now largely being replaced by high throughput sequencing, such as targeted microbiome analysis and metagenomics.
Biological dark matter is an informal term for unclassified or poorly understood genetic material. This genetic material may refer to genetic material produced by unclassified microorganisms. By extension, biological dark matter may also refer to the un-isolated microorganism whose existence can only be inferred from the genetic material that they produce. Some of the genetic material may not fall under the three existing domains of life: Bacteria, Archaea and Eukaryota; thus, it has been suggested that a possible fourth domain of life may yet be discovered, although other explanations are also probable. Alternatively, the genetic material may refer to non-coding DNA and non-coding RNA produced by known organisms.
In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.
Viral phylodynamics is defined as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies. Since the coining of the term in 2004, research on viral phylodynamics has focused on transmission dynamics in an effort to shed light on how these dynamics impact viral genetic variation. Transmission dynamics can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts.
A microbiome is the community of microorganisms that can usually be found living together in any given habitat. It was defined more precisely in 1988 by Whipps et al. as "a characteristic microbial community occupying a reasonably well-defined habitat which has distinct physio-chemical properties. The term thus not only refers to the microorganisms involved but also encompasses their theatre of activity". In 2020, an international panel of experts published the outcome of their discussions on the definition of the microbiome. They proposed a definition of the microbiome based on a revival of the "compact, clear, and comprehensive description of the term" as originally provided by Whipps et al., but supplemented with two explanatory paragraphs. The first explanatory paragraph pronounces the dynamic character of the microbiome, and the second explanatory paragraph clearly separates the term microbiota from the term microbiome.
PICRUSt is a bioinformatics software package. The name is an abbreviation for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.
Catherine Anne Lozupone is an American microbiologist who specializes in bacteria and how they impact human health. Her noted work in trying to determine what constitutes "normal" gut bacteria, led to her creation of the UniFrac algorithm, used by researchers to plot the relationships between microbial communities in the human body.