MG-RAST is an open-source web application server that suggests automatic

phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...

and functional analysis of

metagenome Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...

s. It is also one of the biggest repositories for metagenomic data. The name is an abbreviation of ''Metagenomic Rapid Annotations using Subsystems Technology''. The pipeline automatically produces functional assignments to the

sequences In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called t ...

that belong to the metagenome by performing sequence comparisons to

databases In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...

in both

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecule ...

and

amino-acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ami ...

levels. The applications supply phylogenetic and functional assignments of the metagenome being analysed, as well as tools for comparing different metagenomes. It also provides
RESTful API
for programmatic access. The server was created and maintained by Argonne National Laboratory from the University of Chicago. In December 29 of 2016, the system had analyzed 60 terabase-pairs of data from more than 150,000 data sets. Among the analyzed data sets, more than 23,000 are available to the public. Currently, the computational resources are provided by the DOE Magellan cloud at Argonne National Laboratory, Amazon EC2 Web services, and a number of traditional clusters.

Background

MG-RAST has been developed in an effort to have a free, public resource for the analysis and the storage of metagenome sequence data. The service removes one of the primary bottlenecks in metagenome analysis: the availability of high-performance computing for annotating data. Metagenomic and metatranscriptomic studies involve the processing of large datasets, and therefore they can require computationally expensive analysis. Nowadays, scientists are able to generate such volumes of data because, in recent years, the sequencing costs have reduced dramatically. This fact has shifted the limiting factor to the computing costs:for instance, a recent study of the University of Maryland, estimated a cost of more than $5 million per terabase using thei
CLOVR
metagenome analysis pipeline. As the size and number of sequence datasets continue to increase, costs related to their analysis will continue to rise. Additionally, MG-RAST also works as a repository tool for metagenomic data. Metadata collection and interpretation is vital for genomic and metagenomic studies, and challenges in this regard include the exchange, curation, and distribution of this information. The MG-RAST system has been an early adopter of the minimal checklist standards and the expanded biome-specific environmental packages devised by the Genomics Standards Consortium, and provides an easy-to-use uploader for metadata capture at the time of data submission.

Pipeline for metagenomic data analysis

The MG-RAST application offers automated quality control, annotation, comparative analysis and archiving service of metagenomic and amplicon sequences using a combination of several bioinformatics tools. The application was built to analyze metagenomic data, but it also supports amplicon (16S, 18S, and ITS) sequences and metatranscriptome (RNA-seq) sequences processing. Presently, MG-RAST is not capable of predicting coding regions from eukaryotes and therefore it is of limited use for eukaryotic metagenomes analysis. The pipeline of MG-RAST can be divided into five stages:

Data hygiene

Includes steps for quality control and artifacts removal. Firstly, low-quality regions are trimmed using SolexaQA and reads showing inappropriate lengths are removed. A dereplication step is included in the case of metagenome and metatranscriptome datasets processing. Subsequently, DRISEE (Duplicate Read Inferred Sequencing Error Estimation) is used to assess the sample sequencing error based on Artificial Duplicate Reads (ADRs) measuring. And finally, the pipeline offers the possibility of screening the reads using

Bowtie The bow tie is a type of necktie. A modern bow tie is tied using a common shoelace knot, which is also called the bow knot for that reason. It consists of a ribbon of fabric tied around the collar of a shirt in a symmetrical manner so that t ...

aligner and removing the reads showing matches close to model organisms genomes (including fly, mouse, cow and human).

Feature extraction

MG-RAST identifies gene sequences by using a machine learning approach: FragGeneScan. Ribosomal RNA sequences are identified through an initial BLAT search against a reduced version of

SILVA Silva is a surname in Portuguese language, Portuguese-speaking countries, such as Portugal and Brazil. It is derived from the Latin word , meaning "forest" or "woodland". It is the family name of the House of Silva. The name is also widespread i ...

database.

Feature annotation

In order to identify the putative functions and annotation of the genes, MG-RAST builds clusters of proteins at 90% identity level using the UCLUST implementation in

QIIME QIIME (an abbreviation for ''Quantitative Insights Into Microbial Ecology'') is a bioinformatic pipeline designated for the task of analysing microbial communities that were sampled through marker gene (e.g. 16S or 18S rRNA genes) amplicon sequenc ...

. The longest sequence of each cluster will be selected for a similarity analysis. The similarity analysis is computed through sBLAT (in which BLAT algorithm is parallelized using

OpenMP OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating syst ...

). The search is computed against a protein database derived from the M5nr, which provides nonredundant integration of sequences from GenBank, SEED, IMG, UniProt, KEGG and eggNOGs databases. The reads associated to rRNA sequences are clustered at 97% identity. The longest sequence of each cluster is picked as representative and will be used for a BLAT search against the M5rna database, which integrates SILVA, Greengenes and RDP.

Profile generation

The data is integrated into a number of data products. The most important ones are the abundance profiles, which represent a pivoted and aggregated version of the similarity files.

Data loading

Finally, the obtained abundance profiles are loaded into the respective databases.

Detailed steps of the MG-RAST pipeline

MG-RAST utilities

Besides metagenome analysis, MG-RAST can also be used for data discovery. The visualization or comparison of metagenomes profiles and data sets can be implemented in a wide variety of modes; the web interface allows selecting data based on criteria like composition, sequences quality, functionality or sample type and offers several ways to compute statistical inferences and ecological analyses. The profiles for the metagenomes can be visualized and compared by using barcharts, trees, spreadsheet-like tables, heatmaps, PCoA, rarefaction plots, circular recruitment plot, and KEGG maps.

References

{{reflist, 2

External links

MG-RAST Web Server
* tp://ftp.metagenomics.anl.gov/data/manual/mg-rast-manual.pdf MG-RAST manualbr>M5NR
Molecular biology Molecular evolution Metagenomics