HOME

TheInfoList



OR:

''De novo'' transcriptome assembly is the de novo sequence assembly method of creating a
transcriptome The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The t ...
without the aid of a
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...
.


Introduction

As a result of the development of novel sequencing technologies, the years between 2008 and 2012 saw a large drop in the cost of sequencing. Per megabase and genome, the cost dropped to 1/100,000th and 1/10,000th of the price, respectively. Prior to this, only transcriptomes of organisms that were of broad interest and utility to scientific research were sequenced; however, these developed in 2010s
high-throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
(also called next-generation sequencing) technologies are both cost- and labor- effective, and the range of organisms studied via these methods is expanding. Transcriptomes have subsequently been created for
chickpea The chickpea or chick pea (''Cicer arietinum'') is an annual legume of the family Fabaceae, subfamily Faboideae. Its different types are variously known as gram" or Bengal gram, garbanzo or garbanzo bean, or Egyptian pea. Chickpea seeds are h ...
, planarians, '' Parhyale hawaiensis'', as well as the brains of the
Nile crocodile The Nile crocodile (''Crocodylus niloticus'') is a large crocodilian native to freshwater habitats in Africa, where it is present in 26 countries. It is widely distributed throughout sub-Saharan Africa, occurring mostly in the central, eastern, ...
, the corn snake, the
bearded dragon ''Pogona'' is a genus of reptiles containing six lizard species which are often known by the common name bearded dragons. The name "bearded dragon" refers to the underside of the throat (or "beard") of the lizard, which can turn black and gain we ...
, and the red-eared slider, to name just a few. Examining non-model organisms can provide novel insights into the mechanisms underlying the "diversity of fascinating morphological innovations" that have enabled the abundance of life on planet Earth. In animals and plants, the "innovations" that cannot be examined in common model organisms include
mimicry In evolutionary biology, mimicry is an evolved resemblance between an organism and another object, often an organism of another species. Mimicry may evolve between different species, or between individuals of the same species. Often, mimicry f ...
, mutualism,
parasitism Parasitism is a close relationship between species, where one organism, the parasite, lives on or inside another organism, the host, causing it some harm, and is adapted structurally to this way of life. The entomologist E. O. Wilson ha ...
, and
asexual reproduction Asexual reproduction is a type of reproduction that does not involve the fusion of gametes or change in the number of chromosomes. The offspring that arise by asexual reproduction from either unicellular or multicellular organisms inherit the ...
. ''De novo'' transcriptome assembly is often the preferred method to studying non-model organisms, since it is cheaper and easier than building a genome, and reference-based methods are not possible without an existing genome. The transcriptomes of these organisms can thus reveal novel proteins and their isoforms that are implicated in such unique biological phenomena.


''De novo'' vs. reference-based assembly

A set of assembled transcripts allows for initial gene expression studies. Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as
alternative splicing Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be i ...
. Since a genome contains the sum of all introns and exons that may be present in a transcript, spliced variants that do not align continuously along the genome may be discounted as actual protein isoforms. Even if a reference genome is available, ''de novo'' assembly should be performed, as it can recover transcripts that are transcribed from segments of the genome that are missing from the reference genome assembly.


Transcriptome vs. genome assembly

Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding
intron An intron is any Nucleic acid sequence, nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of ...
regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced
isoforms A protein isoform, or "protein variant", is a member of a set of highly similar proteins that originate from a single gene or gene family and are the result of genetic differences. While many perform the same or similar biological roles, some iso ...
, or minor variation among members of a gene family. Genome assembler can't be directly used in transcriptome assembly for several reasons. First, genome sequencing depth is usually the same across a genome, but the depth of transcripts can vary. Second, both strands are always sequenced in genome sequencing, but RNA-seq can be strand-specific. Third, transcriptome assembly is more challenging because transcript variants from the same gene can share exons and are difficult to resolve unambiguously.


Method


RNA-seq

Once RNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina, and
SOLiD Solid is one of the four fundamental states of matter (the others being liquid, gas, and plasma). The molecules in a solid are closely packed together and contain the least amount of kinetic energy. A solid is characterized by structur ...
.


Assembly algorithms

The cDNA sequence reads are assembled into transcripts via a short read transcript assembly program. Most likely, some amino acid variations among transcripts that are otherwise similar reflect different protein isoforms. It is also possible that they represent different genes within the same gene family, or even genes that share only a conserved domain, depending on the degree of variation. A number of assembly programs are available (see
Assemblers Assembler may refer to: Arts and media * Nobukazu Takemura, avant-garde electronic musician, stage name Assembler * Assemblers, a fictional race in the ''Star Wars'' universe * Assemblers, an alternative name of the superhero group Champions of A ...
). Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges. Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance. In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms. Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs.
Overlap graphs Overlap may refer to: * In set theory, an overlap of elements shared between sets is called an intersection, as in a Venn diagram. * In music theory, overlap is a synonym for reinterpretation of a chord at the boundary of two musical phrases * Ove ...
are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
De Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multipl ...
s align
k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...
s (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The k-mers are shorter than the read lengths allowing fast hashing so the operations in de Bruijn graphs are generally less computationally intensive.


Functional annotation

Functional annotation of the assembled transcripts allows for insight into the particular molecular functions, cellular components, and biological processes in which the putative proteins are involved. Blast2GO (B2G) enables Gene Ontology based data mining to annotate sequence data for which no GO annotation is available yet. It is a research tool often employed in functional genomics research on non-model species. It works by blasting assembled contigs against a non-redundant protein database (at NCBI), then annotating them based on sequence similarity. GOanna is another GO annotation program specific for animal and agricultural plant gene products that works in a similar fashion. It is part of the AgBase database of curated, publicly accessible suite of computational tools for GO annotation and analysis. Following annotation, KEGG (Kyoto Encyclopedia of Genes and Genomes) enables visualization of metabolic pathways and molecular interaction networks captured in the transcriptome. In addition to being annotated for GO terms, contigs can also be screened for open reading frames (ORFs) in order to predict the amino acid sequence of proteins derived from these transcripts. Another approach is to annotate protein domains and determine the presence of gene families, rather than specific genes.


Verification and quality control

Since a well-resolved reference genome is rarely available, the quality of computer-assembled contigs may be verified either by comparing the assembled sequences to the reads used to generate them (reference-free), or by aligning the sequences of conserved gene domains found in mRNA transcripts to transcriptomes or genomes of closely related species (reference-based). Tools such as Transrate and DETONATE allow statistical analysis of assembly quality by these methods. Another method is to design PCR primers for predicted transcripts, then attempt to amplify them from the cDNA library. Often, exceptionally short reads are filtered out. Short sequences (< 40 amino acids) are unlikely to represent functional proteins, as they are unable to fold independently and form hydrophobic cores. Complementary to these metrics, a quantitative assessment of the gene content may provide additional insights into the quality of the assembly. To perform this step, tools that model the expected gene space based of conserved genes, such as BUSCO, can be used. For eukaryotes, CEGMA may also be used, although it is officially no longer supported since 2015.


Assemblers

The following is a partial compendium of assembly software that has been used to generate transcriptomes, and has also been cited in scientific literature.


SeqMan NGen

SeqMan NGen, part of
DNASTAR DNASTAR is a global bioinformatics software company incorporated in 1984 that is headquartered in Madison, Wisconsin. DNASTAR develops and sells software for sequence analysis in the fields of genomics, molecular biology, and structural biology. ...
's software pipeline, includes a ''de novo'' transcriptome assembler for small or large transcriptome data sets. SeqMan NGen uses a patented algorithm which utilizes
RefSeq The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences ( DNA, RNA) and their protein products. RefSeq was first introduced in 2000. This database is built by National ...
to identify and merge transcripts, and automatically annotates assembled transcripts using DNASTAR's proprietary transcript annotation tool to identify and highlight known and novel genes.


SOAPdenovo-Trans

SOAPdenovo-Trans is a ''de novo'' transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2.


Velvet/Oases

The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian
bacterial artificial chromosome A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid (or F-plasmid), used for transforming and cloning in bacteria, usually '' E. coli''. F-plasmids play a crucial role because they contain partiti ...
s (BACs). These preliminary transcripts are transferred to
Oases In ecology, an oasis (; ) is a fertile area of a desert or semi-desert environment'ksar''with its surrounding feeding source, the palm grove, within a relational and circulatory nomadic system.” The location of oases has been of critical im ...
, which uses paired end read and long read information to build transcript isoforms.


Trans-ABySS

ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential
polyadenylation Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In euk ...
sites, as well as candidate gene-fusion events.


Trinity

Trinity first divides the sequence data into a number of
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multipl ...
s, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from
paralogous genes Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a sp ...
from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts: * Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts. * Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs. * Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.


See also

*
Transcriptome The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The t ...
*
Transcriptomics Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. H ...
* Human-transcriptome database for alternative splicing (H-DBAS) *
UniGene UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Inform ...
* Full-parasites *
Exome sequencing Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). It consists of two steps: the first step is to select only the sub ...


References

{{reflist, 2 Bioinformatics Systems biology Computational biology Omics