PICRUSt

Last updated
Original author(s) Morgan Langille,Jesse Zaneveld, Dan Knights, Joshua A Reyes, Jose C Clemente, Deron E Burkepile, Rebecca L Vega Thurber, Rob Knight, Robert G Beiko, Curtis Huttenhower
Developer(s) Morgan Langille,Jesse Zaneveld,Daniel McDonald,Greg Caporaso,Gavin Douglas
Initial release29 July 2013;10 years ago (2013-07-29)
Written inPython, R
Website picrust.github.io/picrust/

PICRUSt [1] is a bioinformatics software package. The name is an abbreviation for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.

Contents

The tool serves in the field of metagenomic analysis where it allows inference of the functional profile of a microbial community based on marker gene survey along one or more samples. In essence, PICRUSt takes a user supplied operational taxonomic unit table (typically referred to as an OTU table), representing the marker gene sequences (most commonly a 16S cluster) accompanied with its relative abundance in each of the samples. The output of PICRUSt is a sample by functional-gene-count matrix, telling the count of each functional-gene in each of the samples surveyed. The ability of PICRUSt to estimate the functional-gene profile for a given sample relies on a set of known sequenced genomes. This could also be thought of as an automated alternative to manually researching the gene families likely to be present in organisms whose sequences are found in a 16S ribosomal RNA amplicon library. The below description corresponds to the original version of PICRUSt, but a major update to this tool is currently being developed. [2]

Genome prediction algorithm

In an initial preprocessing phase, PICRUSt constructs confidence intervals and point predictions for the number of copies of each gene family in each bacterial and archaeal strain in a reference tree, using organisms with sequenced genomes as a reference. More specifically, for each gene family, PICRUSt maps known gene copy numbers (from complete sequenced genomes) onto a reference tree of life. These gene family copy numbers are treated as continuous traits, and an evolutionary model constructed under the assumption of Brownian Motion. These evolutionary models can be constructed with either Maximum Likelihood, Relaxed Maximum Likelihood or Wagner Parsimony This evolutionary model is then used to predict both a point estimate and a confidence interval for the copy number of microorganisms without sequenced genomes. This 'genome prediction' step produces a large table of bacterial types (specifically operational taxonomic unit or OTUs) vs. gene family copy numbers. This table is distributed to end users. It is important to note that this prediction method is not the same as a nearest neighbor approach (i.e. just looking up the nearest sequenced genome), and was shown to give a small but significant improvement in accuracy over that strategy. However, nearest neighbor prediction is available as an option in PICRUSt.

Notably, while this functionality is typically used for prediction of gene copy numbers in bacteria, it could, in principle, be used for prediction of any other continuous trait given trait data for diverse organisms and a reference phylogeny.

Langille et al. [1] tested the accuracy of this genome prediction step using leave-one-out cross validation on the input set of sequenced genomes. Additional tests examined sensitivity to errors in phylogenetic inference, lack of genomic data, and the accuracy of the confidence intervals on gene content.

A similar step predicts the copy number of 16S rRNA genes.

Metagenome prediction algorithm

When applying PICRUSt to a 16S rRNA gene library, PICRUSt matches reference operational taxonomic units against the tables, and retrieves a predicted 16S rRNA copy number and gene copy number for each gene family. The abundance of each OTU is divided by its predicted copy number (if a bacterium has multiple 16S copies, its apparent abundance in 16S rRNA data will be inflated), and then multiplied by the copy number of the gene family. This gives a prediction for the contribution of each OTU to the overall gene content of the sample (the metagenome). Finally, these individual contributions are summed together to produce an estimate of the genes present in the metagenome.

Langille et al., 2013 [1] tested the accuracy of this genome prediction step by using previously reported datasets in which the same biological sample was subjected to 16S rRNA gene amplification and shotgun metagenomics. In these cases, the shotgun metagenomic results were taken as a representation of the 'true' community, and the 16S rRNA gene amplicon libraries fed into PICRUSt to attempt to predict those data. Test datasets included human microbiome samples from the Human Microbiome Project, soil samples, diverse mammalian samples, and samples from the Guerrero Negro microbial mats

The Nearest Sequenced Taxon Index

Because PICRUSt, and evolutionary comparative genomics in general, depends on sequenced genomes, biological samples from well-studied environments (many sequenced genomes) will be better predicted than poorly studied environments. In order to assess how many genomes are available, PICRUSt optionally allows users to calculate a Nearest Sequenced Taxon Index (NSTI) for their samples. This index reflects the average phylogenetic distance between each 16S rRNA gene sequence in their sample, and a 16S rRNA gene sequence from a fully sequenced genome. In general, the lower the NSTI score, the more accurate PICRUSt's predictions are expected to be. For example, [1] showed that PICRUSt was much more accurate on diverse soil samples and samples from the Human Microbiome Project than on microbial mat samples from Guerrero Negro, which contained many bacteria without any sequenced relatives.

Okuda et al., 2012 [3] published a similar method that used a bounded k-Nearest Neighbor approach to predict virtual metagenomes. They validated their approach using 16S rRNA gene sequences extracted from shotgun metagenomes, and compared the predictions of their method against the full metagenome.

CopyRighter, [4] like PICRUSt, uses evolutionary modeling and phylogenetic trait prediction to estimate 16S rRNA gene sequence copy numbers for each bacterial and archaeal type in a sample, and then uses these estimates to correct estimates of community composition.

PanFP [5] presented a similar method, but based on genome predictions for each taxonomic group. Benchmarking showed highly similar performance to PICRUSt when compared on the same datasets. One advantage is that all OTUs, not just those in a reference phylogeny table can be used. One disadvantage is that confidence intervals and evolutionary models are not constructed.

PAPRICA [6] is a metagenome prediction tool based on placing input 16S rRNA gene sequences into a known phylogenetic tree based corresponding to reference genomes. The main prediction output corresponds to Enzyme Commission numbers.

Piphillin [7] is a tool produced by the company Second Genome that produces metagenome predictions based on nearest-neighbour clustering of input 16S rRNA gene sequences with 16S rRNA gene sequences from reference genomes. There is a web portal for running this tool on the Second Genome website. This tool is under continual development and undergoing validation as summarized in a 2020 publication. [8]

Tax4Fun [9] is a similar tool based on linking the 16S ribosomal RNA genes from all KEGG organisms with 16S rRNA gene sequences found in the SILVA ribosomal RNA database. Originally this tool was restricted to 16S rRNA gene sequences found within the SILVA database. However, the latest version of this tool, Tax4Fun2, can be used with OTUs or amplicon sequence variants from any clustering pipeline.

Related Research Articles

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">16S ribosomal RNA</span> RNA component

16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.

<span class="mw-page-title-main">Human Microbiome Project</span> Former research initiative

The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on identifying and characterizing human microbiota. The second phase, known as the Integrative Human Microbiome Project (iHMP) launched in 2014 with the aim of generating resources to characterize the microbiome and elucidating the roles of microbes in health and disease states. The program received $170 million in funding by the NIH Common Fund from 2007 to 2016.

<span class="mw-page-title-main">Microbiota</span> Community of microorganisms

Microbiota are the range of microorganisms that may be commensal, mutualistic, or pathogenic found in and on all multicellular organisms, including plants. Microbiota include bacteria, archaea, protists, fungi, and viruses, and have been found to be crucial for immunologic, hormonal, and metabolic homeostasis of their host.

Metaproteomics is an umbrella term for experimental approaches to study all proteins in microbial communities and microbiomes from environmental sources. Metaproteomics is used to classify experiments that deal with all proteins identified and quantified from complex microbial communities. Metaproteomics approaches are comparable to gene-centric environmental genomics, or metagenomics.

<span class="mw-page-title-main">Earth Microbiome Project</span>

The Earth Microbiome Project (EMP) is an initiative founded by Janet Jansson, Jack Gilbert and Rob Knight in 2010 to collect natural samples and to analyze the microbial community around the globe.

Community fingerprinting is a set of molecular biology techniques that can be used to quickly profile the diversity of a microbial community. Rather than directly identifying or counting individual cells in an environmental sample, these techniques show how many variants of a gene are present. In general, it is assumed that each different gene variant represents a different type of microbe. Community fingerprinting is used by microbiologists studying a variety of microbial systems to measure biodiversity or track changes in community structure over time. The method analyzes environmental samples by assaying genomic DNA. This approach offers an alternative to microbial culturing, which is important because most microbes cannot be cultured in the laboratory. Community fingerprinting does not result in identification of individual microbe species; instead, it presents an overall picture of a microbial community. These methods are now largely being replaced by high throughput sequencing, such as targeted microbiome analysis and metagenomics.

Biological dark matter is an informal term for unclassified or poorly understood genetic material. This genetic material may refer to genetic material produced by unclassified microorganisms. By extension, biological dark matter may also refer to the un-isolated microorganism whose existence can only be inferred from the genetic material that they produce. Some of the genetic material may not fall under the three existing domains of life: Bacteria, Archaea and Eukaryota; thus, it has been suggested that a possible fourth domain of life may yet be discovered, although other explanations are also probable. Alternatively, the genetic material may refer to non-coding DNA and non-coding RNA produced by known organisms.

MG-RAST is an open-source web application server that suggests automatic phylogenetic and functional analysis of metagenomes. It is also one of the biggest repositories for metagenomic data. The name is an abbreviation of Metagenomic Rapid Annotations using Subsystems Technology. The pipeline automatically produces functional assignments to the sequences that belong to the metagenome by performing sequence comparisons to databases in both nucleotide and amino-acid levels. The applications supply phylogenetic and functional assignments of the metagenome being analysed, as well as tools for comparing different metagenomes. It also provides a RESTful API for programmatic access.

<span class="mw-page-title-main">Viral metagenomics</span>

Viral metagenomics is the metagenomic study of viral genetic material obtained from environmental DNA samples or clinical DNA samples obtained from a host or natural reservoir. Metagenomic methods can be applied to study viruses in any system and has been used to describe various viruses associated with cancerous tumors, extreme environments, terrestrial ecosystems, and the blood and feces of humans. The term virome is also used to refer to viruses investigated by metagenomic sequencing of viral nucleic acids and is frequently used to describe environmental shotgun metagenomes. Viral metagenomics is a culture independent methodology that provides insights on viral diversity, abundance, and functional potential of viruses within the environment. Viruses lack a universal phylogenetic marker making metagenomics the only way to assess the genetic diversity of viruses in an environmental sample. With the advancements of techniques that can exploit next-generation sequencing, viruses can now be studied outside of culturable virus-host pairs. This approach has created improvements in molecular epidemiology and accelerated the discovery of novel viruses.

<span class="mw-page-title-main">Microbiome</span> Microbial community assemblage and activity

A microbiome is the community of microorganisms that can usually be found living together in any given habitat. It was defined more precisely in 1988 by Whipps et al. as "a characteristic microbial community occupying a reasonably well-defined habitat which has distinct physio-chemical properties. The term thus not only refers to the microorganisms involved but also encompasses their theatre of activity". In 2020, an international panel of experts published the outcome of their discussions on the definition of the microbiome. They proposed a definition of the microbiome based on a revival of the "compact, clear, and comprehensive description of the term" as originally provided by Whipps et al., but supplemented with two explanatory paragraphs. The first explanatory paragraph pronounces the dynamic character of the microbiome, and the second explanatory paragraph clearly separates the term microbiota from the term microbiome.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

DECIPHER is a software toolset that can be used to decipher and manage biological sequences efficiently using the programming language R. Some functions of the program are accessible online through web tools.

<span class="mw-page-title-main">Pharmacomicrobiomics</span>

Pharmacomicrobiomics, proposed by Prof. Marco Candela for the ERC-2009-StG project call, and publicly coined for the first time in 2010 by Rizkallah et al., is defined as the effect of microbiome variations on drug disposition, action, and toxicity. Pharmacomicrobiomics is concerned with the interaction between xenobiotics, or foreign compounds, and the gut microbiome. It is estimated that over 100 trillion prokaryotes representing more than 1000 species reside in the gut. Within the gut, microbes help modulate developmental, immunological and nutrition host functions. The aggregate genome of microbes extends the metabolic capabilities of humans, allowing them to capture nutrients from diverse sources. Namely, through the secretion of enzymes that assist in the metabolism of chemicals foreign to the body, modification of liver and intestinal enzymes, and modulation of the expression of human metabolic genes, microbes can significantly impact the ingestion of xenobiotics.

<span class="mw-page-title-main">Machine learning in bioinformatics</span>

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Microbial DNA barcoding is the use of DNA metabarcoding to characterize a mixture of microorganisms. DNA metabarcoding is a method of DNA barcoding that uses universal genetic markers to identify DNA of a mixture of organisms.

Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to identify and characterize the genome of bacteria, fungi, parasites, and viruses without the need for a prior knowledge of a specific pathogen directly from clinical specimens. The capacity to detect all the potential pathogens in a sample makes metagenomic next generation sequencing a potent tool in the diagnosis of infectious disease especially when other more directed assays, such as PCR, fail. Its limitations include clinical utility, laboratory validity, sense and sensitivity, cost and regulatory considerations.

<span class="mw-page-title-main">Microbiome-wide association study</span>

A microbiome-wide association study (MWAS), otherwise known as a metagenome-wide association study (MGWAS), is a statistical methodology used to examine the full metagenome of a defined microbiome in various organisms to determine if some feature of the microbiome is associated with a host trait. MWAS has been adopted by the field of metagenomics from the widely used genome-wide association study (GWAS).

References

  1. 1 2 3 4 Langille, Morgan G I; Zaneveld, Jesse; Caporaso, J Gregory; McDonald, Daniel; Knights, Dan; Reyes, Joshua A; Clemente, Jose C; Burkepile, Deron E; Vega Thurber, Rebecca L; Knight, Rob; Beiko, Robert G; Huttenhower, Curtis (2013). "Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences" (PDF). Nature Biotechnology. 31 (9): 814–821. doi:10.1038/nbt.2676. ISSN   1087-0156. PMC   3819121 . PMID   23975157.
  2. Douglas, Gavin; Maffei, Vince; Zaneveld, Jesse; Yurgel, Svetlana; Brown, James; Taylor, Christopher; Huttenhower, Curtis; Langille, Morgan (2020). "PICRUSt2: An improved and customizable approach for metagenome inference". bioRxiv. doi: 10.1101/672295 .
  3. Okuda, Shujiro; Tsuchiya, Yuki; Kiriyama, Chiho; Itoh, Masumi; Morisaki, Hisao (2012). "Okuda et al., 2012". Nature Communications. 3: 1203. doi: 10.1038/ncomms2203 . PMID   23149747.
  4. Angly, Florent E; Dennis, Paul G; Skarshewski, Adam; Vanwonterghem, Inka; Hugenholtz, Philip; Tyson, Gene W (2014). "CopyRighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction". Microbiome. 2: 11. doi: 10.1186/2049-2618-2-11 . PMC   4021573 . PMID   24708850.
  5. Jun, Se-Ran; Robeson, Michael S.; Hauser, Loren J.; Schadt, Christopher W.; Gorin, Andrey A. (2015). "PanFP: pangenome-based functional profiles for microbial communities". BMC Research Notes. 8: 479. doi: 10.1186/s13104-015-1462-8 . PMC   4584126 . PMID   26409790.
  6. Bowman, Jeff; Ducklow, Hugh (2015). "Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula". PLOS ONE. 10 (8): e0135868. Bibcode:2015PLoSO..1035868B. doi: 10.1371/journal.pone.0135868 . PMC   4540456 . PMID   26285202.
  7. Iwai, Shoko; Weinmaier, Thomas; Schmidt, Brian; Albertson, Donna; Poloso, Neil; Dabbagh, Karim; DeSantis, Todd (2016). "Piphillin: Improved Prediction of Metagenomic Content by Direct Inference from Human Microbiomes". PLOS ONE. 11 (11): e0166104. Bibcode:2016PLoSO..1166104I. doi: 10.1371/journal.pone.0166104 . PMC   5098786 . PMID   27820856.
  8. Narayan, Nicole; Weinmaier, Thomas; Laserna-Mendieta, Emilio; Claesson, Marcus; Shanahan, Fergus; Dabbagh, Karim; Iwai, Shoko; DeSantis, Todd (2020). "Piphillin predicts metagenomic composition and dynamics from DADA2-corrected 16S rDNA sequences". BMC Genomics. 21 (1): 56. doi: 10.1186/s12864-019-6427-1 . PMC   6967091 . PMID   31952477.
  9. Aßhauer, Kathrin; Wemheuer, Bernd; Daniel, Rolf; Meinicke, Peter (2015). "Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data". Bioinformatics. 31 (17): 2882–2884. doi:10.1093/bioinformatics/btv287. PMC   4547618 . PMID   25957349.