MG-RAST

Last updated
MG-RAST
Original author(s) Argonne National Laboratory, University of Chicago, San Diego State University
Developer(s) F. Meyer, D. Paarmann, M. D'Souza, R. Olson, E.M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R.A. Edwards
Initial release2008;16 years ago (2008)
Stable release
4.0 / 15 November 2016;7 years ago (2016-11-15)
Type Bioinformatics
Website http://metagenomics.anl.gov/

MG-RAST, an open-source web application server, facilitates automatic phylogenetic and functional analysis of metagenomes. It stands as one of the largest repositories for metagenomic data, employing the acronym for Metagenomic Rapid Annotations using Subsystems Technology (MG-RAST). This platform utilizes a pipeline that automatically assigns functions to metagenomic sequences, conducting sequence comparisons at both nucleotide and amino acid levels. Users benefit from phylogenetic and functional insights into the analyzed metagenomes, along with tools for comparing different datasets. MG-RAST also offers a RESTful API for programmatic access.

Contents

Argonne National Laboratory from the University of Chicago created and maintains this server. As of December 29, 2016, MG-RAST had analyzed a substantial 60 terabase-pairs of data from over 150,000 datasets. Notably, more than 23,000 of these datasets are publicly available. Computational resources are currently sourced from the DOE Magellan cloud at Argonne National Laboratory, Amazon EC2 Web services, and various traditional clusters.

Background

MG-RAST was developed to serve as a free, public resource dedicated to the analysis and storage of metagenome sequence data. It addresses a key bottleneck in metagenome analysis by eliminating the dependence on high-performance computing for annotating data.

The significance of MG-RAST becomes evident in metagenomic and metatranscriptomic studies, where the processing of large datasets often requires computationally intensive analyses. With the substantial reduction in sequencing costs in recent years, scientists can generate vast amounts of data. However, the limiting factor has shifted to computing costs. For example, a recent University of Maryland study estimated a cost exceeding $5 million per terabase using their CLOVR metagenome analysis pipeline. As sequence datasets' size and number continue to grow, the associated analysis costs are expected to rise.

Beyond analysis, MG-RAST functions as a repository tool for metagenomic data. Metadata collection and interpretation are crucial for genomic and metagenomic studies. MG-RAST addresses challenges related to the exchange, curation, and distribution of this information. The system has embraced minimal checklist standards and biome-specific environmental packages established by the Genomics Standards Consortium. Furthermore, MG-RAST provides a user-friendly uploader for capturing metadata at the time of data submission.

[1]

Pipeline for metagenomic data analysis

The MG-RAST application provides a comprehensive suite of services, including automated quality control, annotation, comparative analysis, and archiving for metagenomic and amplicon sequences. It utilizes a combination of various bioinformatics tools to achieve these functionalities. Originally designed for metagenomic data analysis, MG-RAST also extends support to amplicon sequences (16S, 18S, and ITS) and metatranscriptome (RNA-seq) sequences processing. However, it's important to note that MG-RAST currently lacks the capability to predict coding regions from eukaryotes, limiting its utility for eukaryotic metagenome analysis.

The MG-RAST pipeline can be segmented into five distinct stages:

Data hygiene

The MG-RAST pipeline incorporates a series of steps for quality control and artifacts removal, ensuring robust processing of metagenomic and metatranscriptome datasets. The initial stage involves trimming low-quality regions using SolexaQA and eliminating reads with inappropriate lengths. In the case of metagenome and metatranscriptome datasets, a dereplication step is introduced to enhance data processing efficiency.

The subsequent step employs DRISEE (Duplicate Read Inferred Sequencing Error Estimation) to evaluate sample sequencing errors by measuring Artificial Duplicate Reads (ADRs). This assessment contributes to enhancing the accuracy of downstream analyses.

Finally, the pipeline offers the option to screen reads using the Bowtie aligner. It identifies and removes reads that exhibit matches close to the genomes of model organisms, including fly, mouse, cow, and human. This step aids in refining the dataset by filtering out reads associated with potential contaminants or unintended sequences.

Feature extraction

In the gene identification process, MG-RAST employs a machine learning approach known as FragGeneScan. This method is utilized to identify gene sequences within the metagenomic or metatranscriptomic data.

For the identification of ribosomal RNA sequences, MG-RAST initiates a BLAT search against a reduced version of the SILVA database. This step allows the system to pinpoint and categorize ribosomal RNA sequences within the dataset, contributing to a more detailed understanding of the biological composition of the analyzed metagenomes or metatranscriptomes.

Feature annotation

To identify the putative functions and annotations of the genes, MG-RAST follows a multi-step process. Initially, it builds clusters of proteins at a 90% identity level using the UCLUST implementation in QIIME. The longest sequence within each cluster is then selected for further analysis.

For the similarity analysis, MG-RAST employs sBLAT, a parallelized version of the BLAT algorithm using OpenMP. The search is conducted against a protein database derived from the M5nr, which integrates nonredundant sequences from various databases such as GenBank, SEED, IMG, UniProt, KEGG, and eggNOGs.

In the case of reads associated with rRNA sequences, a clustering step is performed at a 97% identity level. The longest sequence from each cluster is chosen as the representative and is used for a BLAT search against the M5rna database. This database integrates sequences from SILVA, Greengenes, and RDP, providing a comprehensive reference for the analysis of ribosomal RNA sequences.

Profile generation

The data feeds several key products, primarily abundance profiles. These profiles summarize and reorganize the information found in the similarity files in a more easily digestible format.

Data loading

Finally, the obtained abundance profiles are loaded into the respective databases.

Detailed steps of the MG-RAST pipeline

MG-RAST PipelineDescription
qc_statsGenerate quality control statistics
preprocessPreprocessing, to trim low-quality regions from FASTQ data
dereplicationDereplication for shotgun metagenome data by using k-mer approach
screenRemoving reads that are near-exact matches to the genomes of model organisms (fly, mouse, cow and human)
rna detectionBLAT search against a reduced RNA database, to identifies ribosomal RNA
rna clusteringrRNA-similar reads are then clustered at 97% identity
rna sims blatBLAT similarity search for the longest cluster representative against the M5rna database
genecallingA machine learning approach, FragGeneScan, to predict coding regions in DNA sequences
aa filteringFilter proteins
aa clusteringCluster proteins at 90% identity level using uclust
aa sims blatBLAT similarity analysis to identify protein
aa sims annotationSequence similarity against protein database from the M5nr
rna sims annotationSequence similarity against RNA database from the M5rna
index sim seqIndex sequence similarity to data sources
md5 annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
function annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
organism annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
lca annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
ontology annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
source annotation summaryGenerate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation
md5 summary loadLoad summary report to the project
function summary loadLoad summary report to the project
organism summary loadLoad summary report to the project
lca summary loadLoad summary report to the project
ontology summary loadLoad summary report to the project
done stage
notify job completionSend notification to user via email

MG-RAST utilities

G-RAST isn't just a powerhouse for metagenome analysis, it's also a treasure trove for data exploration. Dive into a diverse toolbox for visualizing and comparing metagenome profiles across various datasets. Filter based on specifics like composition, quality, functionality, or sample type to tailor your search. Delve deeper with statistical inferences and ecological analyses – all within the user-friendly web interface.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

<span class="mw-page-title-main">Integrated Microbial Genomes System</span> Genome browsing and annotation platform

The Integrated Microbial Genomes system is a genome browsing and annotation platform developed by the U.S. Department of Energy (DOE)-Joint Genome Institute. IMG contains all the draft and complete microbial genomes sequenced by the DOE-JGI integrated with other publicly available genomes. IMG provides users a set of tools for comparative analysis of microbial genomes along three dimensions: genes, genomes and functions. Users can select and transfer them in the comparative analysis carts based upon a variety of criteria. IMG also includes a genome annotation pipeline that integrates information from several tools, including KEGG, Pfam, InterPro, and the Gene Ontology, among others. Users can also type or upload their own gene annotations and the IMG system will allow them to generate Genbank or EMBL format files containing these annotations.

<span class="mw-page-title-main">16S ribosomal RNA</span> RNA component

16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

<span class="mw-page-title-main">Human Microbiome Project</span> Former research initiative

The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on identifying and characterizing human microbiota. The second phase, known as the Integrative Human Microbiome Project (iHMP) launched in 2014 with the aim of generating resources to characterize the microbiome and elucidating the roles of microbes in health and disease states. The program received $170 million in funding by the NIH Common Fund from 2007 to 2016.

MEGAN is a computer program that allows optimized analysis of large metagenomic datasets.

Fibrolytic bacteria constitute a group of microorganisms that are able to process complex plant polysaccharides thanks to their capacity to synthesize cellulolytic and hemicellulolytic enzymes. Polysaccharides are present in plant cellular cell walls in a compact fiber form where they are mainly composed of cellulose and hemicellulose.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

<span class="mw-page-title-main">Earth Microbiome Project</span>

The Earth Microbiome Project (EMP) is an initiative founded by Janet Jansson, Jack Gilbert and Rob Knight in 2010 to collect natural samples and to analyze the microbial community around the globe.

<span class="mw-page-title-main">Viral metagenomics</span>

Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

PICRUSt is a bioinformatics software package. The name is an abbreviation for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.

<span class="mw-page-title-main">Virome</span>

Virome refers to the assemblage of viruses that is often investigated and described by metagenomic sequencing of viral nucleic acids that are found associated with a particular ecosystem, organism or holobiont. The word is frequently used to describe environmental viral shotgun metagenomes. Viruses, including bacteriophages, are found in all environments, and studies of the virome have provided insights into nutrient cycling, development of immunity, and a major source of genes through lysogenic conversion. Also, the human virome has been characterized in nine organs of 31 Finnish individuals using qPCR and NGS methodologies.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to identify and characterize the genome of bacteria, fungi, parasites, and viruses without the need for a prior knowledge of a specific pathogen directly from clinical specimens. The capacity to detect all the potential pathogens in a sample makes metagenomic next generation sequencing a potent tool in the diagnosis of infectious disease especially when other more directed assays, such as PCR, fail. Its limitations include clinical utility, laboratory validity, sense and sensitivity, cost and regulatory considerations.

<span class="mw-page-title-main">Genome skimming</span> Method of genome sequencing

Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.

References

  1. Field, Dawn; Amaral-Zettler, Linda; Cochrane, Guy; Cole, James R.; Dawyndt, Peter; Garrity, George M.; Gilbert, Jack; Glöckner, Frank Oliver; Hirschman, Lynette (2011-06-21). "The Genomic Standards Consortium". PLOS Biology. 9 (6): e1001088. doi: 10.1371/journal.pbio.1001088 . ISSN   1545-7885. PMC   3119656 . PMID   21713030.