RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses

next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation ...

(NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular

transcriptome The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The t ...

. Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts,

post-transcriptional modification Transcriptional modification or co-transcriptional modification is a set of biological processes common to most eukaryotic cells by which an RNA primary transcript is chemically altered following transcription from a gene to produce a mature, fu ...

s, gene fusion, mutations/

SNPs In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...

and changes in gene expression over time, or differences in gene expression in different groups or treatments. In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA, such as

miRNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miR ...

tRNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ...

, and ribosomal profiling. RNA-Seq can also be used to determine exon/ intron boundaries and verify or amend previously

annotated An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For ann ...

5' and 3' gene boundaries. Recent advances in RNA-Seq include

single cell sequencing Single-cell sequencing examines the sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual ...

, in situ sequencing of fixed tissue, and native RNA molecule sequencing with single-molecule real-time sequencing. Other examples of emerging RNA-Seq applications due to the advancement of bioinformatics algorithms are copy number alteration, microbial contamination, transposable elements, cell type (deconvolution) and the presence of neoantigens. Prior to RNA-Seq, gene expression studies were done with hybridization-based

microarrays A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silicon ...

. Issues with microarrays include cross-hybridization artifacts, poor quantification of lowly and highly expressed genes, and needing to know the sequence ''a priori''. Because of these technical issues,

transcriptomics Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. H ...

transitioned to sequencing-based methods. These progressed from

Sanger sequencing Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Fred ...

Expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proc ...

libraries, to chemical tag-based methods (e.g.,

serial analysis of gene expression Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those tra ...

), and finally to the current technology,

next-gen sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...

complementary DNA In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a spec ...

(cDNA), notably RNA-Seq.

Methods

Library preparation

The general steps to prepare a

(cDNA) library for sequencing are described below, but often vary between platforms. # ''RNA Isolation:'' RNA is isolated from tissue and mixed with

Deoxyribonuclease Deoxyribonuclease (DNase, for short) refers to a group of glycoprotein endonucleases which are enzymes that catalyze the hydrolytic cleavage of phosphodiester linkages in the DNA backbone, thus degrading DNA. The role of the DNase enzyme in cells ...

(DNase). DNase reduces the amount of genomic DNA. The amount of RNA degradation is checked with

gel A gel is a semi-solid that can have properties ranging from soft and weak to hard and tough. Gels are defined as a substantially dilute cross-linked system, which exhibits no flow when in the steady-state, although the liquid phase may still dif ...

and

capillary electrophoresis Capillary electrophoresis (CE) is a family of electrokinetic separation methods performed in submillimeter diameter capillaries and in micro- and nanofluidic channels. Very often, CE refers to capillary zone electrophoresis (CZE), but other elect ...

and is used to assign an

RNA integrity number The RNA integrity number (RIN) is an algorithm for assigning integrity values to RNA measurements. The integrity of RNA is a major concern for gene expression studies and traditionally has been evaluated using the 28S to 18S rRNA ratio, a metho ...

to the sample. This RNA quality and the total amount of starting RNA are taken into consideration during the subsequent library preparation, sequencing, and analysis steps. #''RNA selection/depletion:'' To analyze signals of interest, the isolated RNA can either be kept as is, filtered for RNA with 3' polyadenylated (poly(A)) tails to include only

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...

, depleted of ribosomal RNA (rRNA), and/or filtered for RNA that binds specific sequences (RNA selection and depletion methods table, below). The RNA with 3' poly(A) tails are mainly composed of mature, processed, coding sequences. Poly(A) selection is performed by mixing RNA with oligomers covalently attached to a substrate, typically magnetic beads. Poly(A) selection has important limitations in RNA biotype detection. Many RNA biotypes are not polyadenylated, including many noncoding RNA and histone-core protein transcripts, or are regulated via their poly(A) tail length (e.g., cytokines) and thus might not be detected after poly(A) selection. Furthermore, poly(A) selection may display increased 3' bias, especially with lower quality RNA. These limitations can be avoided with ribosomal depletion, removing rRNA that typically represents over 90% of the RNA in a cell. Both poly(A) enrichment and ribosomal depletion steps are labor intensive and could introduce biases, so more simple approaches have been developed to omit these steps. Small RNA targets, such as

, can be further isolated through size selection with exclusion gels, magnetic beads, or commercial kits. #''cDNA synthesis:'' RNA is reverse transcribed to cDNA because DNA is more stable and to allow for amplification (which uses

DNA polymerases A DNA polymerase is a member of a family of enzymes that catalyze the synthesis of DNA molecules from nucleoside triphosphates, the molecular precursors of DNA. These enzymes are essential for DNA replication and usually work in groups to create ...

) and leverage more mature DNA sequencing technology. Amplification subsequent to reverse transcription results in loss of strandedness, which can be avoided with chemical labeling or single molecule sequencing. Fragmentation and size selection are performed to purify sequences that are the appropriate length for the sequencing machine. The RNA, cDNA, or both are fragmented with enzymes,

sonication A sonicator at the Weizmann Institute of Science during sonicationSonication is the act of applying sound energy to agitate particles in a sample, for various purposes such as the extraction of multiple compounds from plants, microalgae and seawe ...

, or nebulizers. Fragmentation of the RNA reduces 5' bias of randomly primed-reverse transcription and the influence of primer binding sites, with the downside that the 5' and 3' ends are converted to DNA less efficiently. Fragmentation is followed by size selection, where either small sequences are removed or a tight range of sequence lengths are selected. Because small RNAs like

miRNAs MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miR ...

are lost, these are analyzed independently. The cDNA for each experiment can be indexed with a hexamer or octamer barcode, so that these experiments can be pooled into a single lane for multiplexed sequencing.

Complementary DNA sequencing (cDNA-Seq)

The cDNA library derived from RNA biotypes is then sequenced into a computer-readable format. There are many high-throughput sequencing technologies for cDNA sequencing including platforms developed by Illumina, Thermo Fisher, BGI/MGI,

PacBio Pacific Biosciences of California, Inc. (aka PacBio) is an American biotechnology company founded in 2004 that develops and manufactures systems for gene sequencing and some novel real time biological observation. PacBio describes its platform ...

, and

Oxford Nanopore Technologies Oxford Nanopore Technologies Limited is a UK-based company which is developing and selling nanopore sequencing products (including the portable DNA sequencer, MinION) for the direct, electronic analysis of single molecules. History The company ...

. For Illumina short-read sequencing, a common technology for cDNA sequencing, adapters are ligated to the cDNA, DNA is attached to a flow cell, clusters are generated through cycles of bridge amplification and denaturing, and sequence-by-synthesis is performed in cycles of complementary strand synthesis and laser excitation of bases with reversible terminators. Sequencing platform choice and parameters are guided by experimental design and cost. Common experimental design considerations include deciding on the sequencing length, sequencing depth, use of single versus paired-end sequencing, number of replicates, multiplexing, randomization, and spike-ins.

Small RNA/non-coding RNA sequencing

When sequencing RNA other than mRNA, the library preparation is modified. The cellular RNA is selected based on the desired size range. For small RNA targets, such as

, the RNA is isolated through size selection. This can be performed with a size exclusion gel, through size selection magnetic beads, or with a commercially developed kit. Once isolated, linkers are added to the 3' and 5' end then purified. The final step is cDNA generation through reverse transcription.

Direct RNA sequencing

Because converting RNA into cDNA, ligation, amplification, and other sample manipulations have been shown to introduce biases and artifacts that may interfere with both the proper characterization and quantification of transcripts, single molecule direct RNA sequencing has been explored by companies including Helicos (bankrupt),

, and others. This technology sequences RNA molecules directly in a massively-parallel manner.

Single-molecule real-time RNA sequencing

Massively parallel single molecule direct RNA-Seq has been explored as an alternative to traditional RNA-Seq, in which RNA-to- cDNA conversion, ligation, amplification, and other sample manipulation steps may introduce biases and artifacts. Technology platforms that perform single-molecule real-time RNA-Seq include Oxford Nanopore Technologies (ONT)

Nanopore sequencing Nanopore sequencing is a third generation approach used in the sequencing of biopolymers — specifically, polynucleotides in the form of DNA or RNA. Using nanopore sequencing, a single molecule of DNA or RNA can be sequenced without the ne ...

IsoSeq, and Helicos (bankrupt). Sequencing RNA in its native form preserves modifications like methylation, allowing them to be investigated directly and simultaneously. Another benefit of single-molecule RNA-Seq is that transcripts can be covered in full length, allowing for higher confidence isoform detection and quantification compared to short-read sequencing. Traditionally, single-molecule RNA-Seq methods have higher error rates compared to short-read sequencing, but newer methods like ONT direct RNA-Seq limit errors by avoiding fragmentation and cDNA conversion. Recent uses of ONT direct RNA-Seq for differential expression in human cell populations have demonstrated that this technology can overcome many limitations of short and long cDNA sequencing.

Single-cell RNA sequencing (scRNA-Seq)

Standard methods such as microarrays and standard bulk RNA-Seq analysis analyze the expression of RNAs from large populations of cells. In mixed cell populations, these measurements may obscure critical differences between individual cells within these populations."" Single-cell RNA sequencing (scRNA-Seq) provides the expression profiles of individual cells. Although it is not possible to obtain complete information on every RNA expressed by each cell, due to the small amount of material available, patterns of gene expression can be identified through gene clustering analyses. This can uncover the existence of rare cell types within a cell population that may never have been seen before. For example, rare specialized cells in the lung called pulmonary ionocytes that express the

Cystic fibrosis transmembrane conductance regulator Cystic fibrosis transmembrane conductance regulator (CFTR) is a membrane protein and anion channel in vertebrates that is encoded by the ''CFTR'' gene. Geneticist Lap-Chee Tsui and his team identified the CFTR gene in 1989 as the gene linked wi ...

were identified in 2018 by two groups performing scRNA-Seq on lung airway epithelia.

Experimental procedures

Current scRNA-Seq protocols involve the following steps: isolation of single cell and RNA, reverse transcription (RT), amplification, library generation and sequencing. Single cells are either mechanically separated into microwells (e.g., BD Rhapsody, Takara ICELL8, Vycap Puncher Platform, or CellMicrosystems CellRaft) or encapsulated in droplets (e.g., 10x Genomics Chromium, Illumina Bio-Rad ddSEQ, 1CellBio InDrop, Dolomite Bio Nadia). Single cells are labeled by adding beads with barcoded oligonucleotides; both cells and beads are supplied in limited amounts such that co-occupancy with multiple cells and beads is a very rare event. Once reverse transcription is complete, the cDNAs from many cells can be mixed together for sequencing; transcripts from a particular cell are identified by each cell's unique barcode. Unique molecular identifier (UMIs) can be attached to mRNA/cDNA target sequences to help identify artifacts during library preparation. Challenges for scRNA-Seq include preserving the initial relative abundance of mRNA in a cell and identifying rare transcripts."" The reverse transcription step is critical as the efficiency of the RT reaction determines how much of the cell's RNA population will be eventually analyzed by the sequencer. The processivity of reverse transcriptases and the priming strategies used may affect full-length cDNA production and the generation of libraries biased toward the 3’ or 5' end of genes. In the amplification step, either PCR or

in vitro ''In vitro'' (meaning in glass, or ''in the glass'') studies are performed with microorganisms, cells, or biological molecules outside their normal biological context. Colloquially called " test-tube experiments", these studies in biology ...

transcription (IVT) is currently used to amplify cDNA. One of the advantages of PCR-based methods is the ability to generate full-length cDNA. However, different PCR efficiency on particular sequences (for instance, GC content and snapback structure) may also be exponentially amplified, producing libraries with uneven coverage. On the other hand, while libraries generated by IVT can avoid PCR-induced sequence bias, specific sequences may be transcribed inefficiently, thus causing sequence drop-out or generating incomplete sequences. Several scRNA-Seq protocols have been published: Tang et al., STRT, SMART-seq, CEL-seq, RAGE-seq, Quartz-seq and C1-CAGE. These protocols differ in terms of strategies for reverse transcription, cDNA synthesis and amplification, and the possibility to accommodate sequence-specific barcodes (i.e. UMIs) or the ability to process pooled samples. In 2017, two approaches were introduced to simultaneously measure single-cell mRNA and protein expression through oligonucleotide-labeled antibodies known as REAP-seq, and CITE-seq.

Applications

scRNA-Seq is becoming widely used across biological disciplines including Development,

Neurology Neurology (from el, νεῦρον (neûron), "string, nerve" and the suffix -logia, "study of") is the branch of medicine dealing with the diagnosis and treatment of all categories of conditions and disease involving the brain, the spinal ...

Oncology Oncology is a branch of medicine that deals with the study, treatment, diagnosis and prevention of cancer. A medical professional who practices oncology is an ''oncologist''. The name's etymological origin is the Greek word ὄγκος (''� ...

, Autoimmune disease, and

Infectious disease An infection is the invasion of tissues by pathogens, their multiplication, and the reaction of host tissues to the infectious agent and the toxins they produce. An infectious disease, also known as a transmissible disease or communicable di ...

. scRNA-Seq has provided considerable insight into the development of embryos and organisms, including the worm '' Caenorhabditis elegans'', and the regenerative planarian ''

Schmidtea mediterranea ''Schmidtea mediterranea'' is a freshwater triclad that lives in southern Europe and Tunisia. It is a model for regeneration, stem cells and development of tissues such as the brain and germline. Distribution ''Schmidtea mediterranea'' is foun ...

''. The first vertebrate animals to be mapped in this way were

Zebrafish The zebrafish (''Danio rerio'') is a freshwater fish belonging to the minnow family (Cyprinidae) of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (and thus often ca ...

and ''

Xenopus laevis The African clawed frog (''Xenopus laevis'', also known as the xenopus, African clawed toad, African claw-toed frog or the ''platanna'') is a species of African aquatic frog of the family Pipidae. Its name is derived from the three short claws ...

''. In each case multiple stages of the embryo were studied, allowing the entire process of development to be mapped on a cell-by-cell basis.

Science Science is a systematic endeavor that Scientific method, builds and organizes knowledge in the form of Testability, testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earli ...

recognized these advances as the 2018

Breakthrough of the Year The Breakthrough of the Year is an annual award for the most significant development in scientific research made by the AAAS journal ''Science,'' an academic journal covering all branches of science. Originating in 1989 as the ''Molecule of the Ye ...

Experimental considerations

A variety of

parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

are considered when designing and conducting RNA-Seq experiments: * ''Tissue specificity:'' Gene expression varies within and between tissues, and RNA-Seq measures this mix of cell types. This may make it difficult to isolate the biological mechanism of interest.

Single cell sequencing Single-cell sequencing examines the sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual ...

can be used to study each cell individually, mitigating this issue. * ''Time dependence:'' Gene expression changes over time, and RNA-Seq only takes a snapshot. Time course experiments can be performed to observe changes in the transcriptome. * ''Coverage (also known as depth):'' RNA harbors the same mutations observed in DNA, and detection requires deeper coverage. With high enough coverage, RNA-Seq can be used to estimate the expression of each allele. This may provide insight into phenomena such as imprinting or cis-regulatory effects. The depth of sequencing required for specific applications can be extrapolated from a pilot experiment. * ''Data generation artifacts (also known as technical variance):'' The reagents (e.g., library preparation kit), personnel involved, and type of sequencer (e.g., Illumina,

Pacific Biosciences Pacific Biosciences of California, Inc. (aka PacBio) is an American biotechnology company founded in 2004 that develops and manufactures systems for gene sequencing and some novel real time biological observation. PacBio describes its platfor ...

) can result in technical artifacts that might be mis-interpreted as meaningful results. As with any scientific experiment, it is prudent to conduct RNA-Seq in a well controlled setting. If this is not possible or the study is a

meta-analysis A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting me ...

, another solution is to detect technical artifacts by inferring

latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...

s (typically principal component analysis or

factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...

) and subsequently correcting for these variables. * ''Data management:'' A single RNA-Seq experiment in humans is usually 1-5 Gb (compressed), or more when including intermediate files. This large volume of data can pose storage issues. One solution is compressing the data using multi-purpose computational schemas (e.g.,

gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and i ...

) or genomics-specific schemas. The latter can be based on reference sequences or de novo. Another solution is to perform microarray experiments, which may be sufficient for hypothesis-driven work or replication studies (as opposed to exploratory research).

Analysis

file:RNASeqWorkflow2016.png, A standard RNA-Seq analysis workflow. Sequenced reads are aligned to a reference genome and/or transcriptome and subsequently processed for a variety of quality control, discovery, and hypothesis-driven analyses.

Transcriptome assembly

Two methods are used to assign raw sequence reads to genomic features (i.e., assemble the transcriptome): * ''De novo:'' This approach does not require a

reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...

to reconstruct the transcriptome, and is typically used if the genome is unknown, incomplete, or substantially altered compared to the reference. Challenges when using short reads for de novo assembly include 1) determining which reads should be joined together into contiguous sequences (

contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...

s), 2) robustness to sequencing errors and other artifacts, and 3) computational efficiency. The primary algorithm used for de novo assembly transitioned from overlap graphs, which identify all pair-wise overlaps between reads, to

de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...

s, which break reads into sequences of length k and collapse all k-mers into a hash table. Overlap graphs were used with Sanger sequencing, but do not scale well to the millions of reads generated with RNA-Seq. Examples of assemblers that use de Bruijn graphs are Trinity, Oases (derived from the genome assembler

Velvet Weave details visible on a purple-colored velvet fabric Velvet is a type of woven tufted fabric in which the cut threads are evenly distributed, with a short pile, giving it a distinctive soft feel. By extension, the word ''velvety'' means ...

), Bridger, and rnaSPAdes. Paired-end and long-read sequencing of the same sample can mitigate the deficits in short read sequencing by serving as a template or skeleton. Metrics to assess the quality of a de novo assembly include median contig length, number of contigs and N50. file:RNA-Seq-alignment.png, RNA-Seq alignment with intron-split short reads. Alignment of short reads to an mRNA sequence and the reference genome. Alignment software has to account for short reads that overlap exon-exon junctions (in red) and thereby skip intronic sections of the pre-mRNA and reference genome. * ''Genome guided:'' This approach relies on the same methods used for DNA alignment, with the additional complexity of aligning reads that cover non-continuous portions of the reference genome. These non-continuous reads are the result of sequencing spliced transcripts (see figure). Typically, alignment algorithms have two steps: 1) align short portions of the read (i.e., seed the genome), and 2) use

dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...

to find an optimal alignment, sometimes in combination with known annotations. Software tools that use genome-guided alignment include

Bowtie The bow tie is a type of necktie. A modern bow tie is tied using a common shoelace knot, which is also called the bow knot for that reason. It consists of a ribbon of fabric tied around the collar of a shirt in a symmetrical manner so that t ...

, TopHat (which builds on BowTie results to align splice junctions), Subread, STAR, HISAT2, and GMAP. The output of genome guided alignment (mapping) tools can be further used by tools such as Cufflinks or StringTie to reconstruct contiguous transcript sequences (''i.e.'', a FASTA file). The quality of a genome guided assembly can be measured with both 1) de novo assembly metrics (e.g., N50) and 2) comparisons to known transcript, splice junction, genome, and protein sequences using precision, recall, or their combination (e.g., F1 score). In addition, ''in silico'' assessment could be performed using simulated reads. ''A note on assembly quality:'' The current consensus is that 1) assembly quality can vary depending on which metric is used, 2) assembly tools that scored well in one species do not necessarily perform well in the other species, and 3) combining different approaches might be the most reliable.

Gene expression quantification

Expression is quantified to study cellular changes in response to external stimuli, differences between healthy and

diseased A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. Diseases are often known to be medical conditions that a ...

states, and other research questions. Transcript levels are often used as a proxy for protein abundance, but these are often not equivalent due to post transcriptional events such as

RNA interference RNA interference (RNAi) is a biological process in which RNA molecules are involved in sequence-specific suppression of gene expression by double-stranded RNA, through translational or transcriptional repression. Historically, RNAi was known by ...

and

nonsense-mediated decay Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that exists in all eukaryotes. Its main function is to reduce errors in gene expression by eliminating mRNA transcripts that contain premature stop codons. Translation of these aberran ...

. Expression is quantified by counting the number of reads that mapped to each locus in the transcriptome assembly step. Expression can be quantified for exons or genes using contigs or reference transcript annotations. These observed RNA-Seq read counts have been robustly validated against older technologies, including expression microarrays and

qPCR A real-time polymerase chain reaction (real-time PCR, or qPCR) is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real ...

. Tools that quantify counts are HTSeq, FeatureCounts, Rcount, maxcounts, FIXSEQ, and Cuffquant. These tools determine read counts from aligned RNA-Seq data, but alignment-free counts can also be obtained with Sailfish and Kallisto. The read counts are then converted into appropriate metrics for hypothesis testing, regressions, and other analyses. Parameters for this conversion are: * '' Sequencing depth/coverage:'' Although depth is pre-specified when conducting multiple RNA-Seq experiments, it will still vary widely between experiments. Therefore, the total number of reads generated in a single experiment is typically normalized by converting counts to fragments, reads, or counts per million mapped reads (FPM, RPM, or CPM). The difference between RPM and FPM was historically derived during the evolution from single-end sequencing of fragments to paired-end sequencing. In single-end sequencing, there is only one read per fragment (''i.e.'', RPM = FPM). In paired-end sequencing, there are two reads per fragment (''i.e.'', RPM = 2 x FPM). Sequencing depth is sometimes referred to as library size, the number of intermediary cDNA molecules in the experiment. * ''Gene length:'' Longer genes will have more fragments/reads/counts than shorter genes if transcript expression is the same. This is adjusted by dividing the FPM by the length of a feature (which can be a gene, transcript, or exon), resulting in the metric fragments per kilobase of feature per million mapped reads (FPKM). When looking at groups of features across samples, FPKM is converted to transcripts per million (TPM) by dividing each FPKM by the sum of FPKMs within a sample. * ''Total sample RNA output:'' Because the same amount of RNA is extracted from each sample, samples with more total RNA will have less RNA per gene. These genes appear to have decreased expression, resulting in false positives in downstream analyses. Normalization strategies including quantile, DESeq2, TMM and Median Ratio attempt to account for this difference by comparing a set of non-differentially expressed genes between samples and scaling accordingly. * ''

Variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

for each gene's expression:'' is modeled to account for

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ( ...

(important for genes with low read counts), increase power, and decrease false positives. Variance can be estimated as a

normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...

, Poisson, or

negative binomial In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-r ...

distribution and is frequently decomposed into technical and biological variance.

Spike-ins for absolute quantification and detection of genome-wide effects

RNA spike-ins are samples of RNA at known concentrations that can be used as gold standards in experimental design and during downstream analyses for absolute quantification and detection of genome-wide effects. * ''Absolute quantification:'' Absolute quantification of gene expression is not possible with most RNA-Seq experiments, which quantify expression relative to all transcripts. It is possible by performing RNA-Seq with spike-ins, samples of RNA at known concentrations. After sequencing, read counts of spike-in sequences are used to determine the relationship between each gene's read counts and absolute quantities of biological fragments In one example, this technique was used in ''

Xenopus tropicalis The western clawed frog (''Xenopus tropicalis'') is a species of frog in the family Pipidae, also known as tropical clawed frog. It is the only species in the genus ''Xenopus'' to have a diploid genome. Its genome has been sequenced, making it a ...

'' embryos to determine transcription kinetics. * ''Detection of genome-wide effects:'' Changes in global regulators including chromatin remodelers,

transcription factors In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fun ...

(e.g.,

MYC ''Myc'' is a family of regulator genes and proto-oncogenes that code for transcription factors. The ''Myc'' family consists of three related human genes: ''c-myc'' (MYC), ''l-myc'' ( MYCL), and ''n-myc'' (MYCN). ''c-myc'' (also sometimes refe ...

), acetyltransferase complexes, and nucleosome positioning are not congruent with normalization assumptions and spike-in controls can offer precise interpretation.

Differential expression

The simplest but often most powerful use of RNA-Seq is finding differences in gene expression between two or more conditions (''e.g.'', treated vs not treated); this process is called differential expression. The outputs are frequently referred to as differentially expressed genes (DEGs) and these genes can either be up- or down-regulated (''i.e.'', higher or lower in the condition of interest). There are many tools that perform differential expression. Most are run in R,

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

, or the

Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, an ...

command line. Commonly used tools include DESeq, edgeR, and voom+limma, all of which are available through R/

Bioconductor Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor is based primarily on the statistical R programm ...

. These are the common considerations when performing differential expression: * ''Inputs:'' Differential expression inputs include (1) an RNA-Seq expression matrix (M genes x N samples) and (2) a

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...

containing experimental conditions for N samples. The simplest design matrix contains one column, corresponding to labels for the condition being tested. Other covariates (also referred to as factors, features, labels, or parameters) can include

batch effect In molecular biology, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcome ...

s, known artifacts, and any metadata that might confound or mediate gene expression. In addition to known covariates, unknown covariates can also be estimated through

unsupervised machine learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

approaches including

principal component Principal may refer to: Title or rank * Principal (academia), the chief executive of a university ** Principal (education), the office holder/ or boss in any school * Principal (civil service) or principal officer, the senior management level ...

, surrogate variable, and PEER analyses. Hidden variable analyses are often employed for human tissue RNA-Seq data, which typically have additional artifacts not captured in the metadata (''e.g.'', ischemic time, sourcing from multiple institutions, underlying clinical traits, collecting data across many years with many personnel). * ''Methods:'' Most tools use regression or

non-parametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being dist ...

to identify differentially expressed genes, and are either based on read counts mapped to a reference genome (DESeq2, limma, edgeR) or based on read counts derived from alignment-free quantification (sleuth, Cuffdiff, Ballgown). Following regression, most tools employ either familywise error rate (FWER) or false discovery rate (FDR) p-value adjustments to account for multiple hypotheses (in human studies, ~20,000 protein-coding genes or ~50,000 biotypes). * ''Outputs:'' A typical output consists of rows corresponding to the number of genes and at least three columns, each gene's log fold change (

log-transform In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 of ...

of the ratio in expression between conditions, a measure of

effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...

), p-value, and p-value adjusted for

multiple comparisons In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. The more inferences ...

. Genes are defined as biologically meaningful if they pass cut-offs for effect size (log fold change) and statistical significance. These cut-offs should ideally be specified ''a priori'', but the nature of RNA-Seq experiments is often exploratory so it is difficult to predict effect sizes and pertinent cut-offs ahead of time. * ''Pitfalls:'' The raison d'etre for these complex methods is to avoid the myriad of pitfalls that can lead to statistical errors and misleading interpretations. Pitfalls include increased false positive rates (due to multiple comparisons), sample preparation artifacts, sample heterogeneity (like mixed genetic backgrounds), highly correlated samples, unaccounted for multi-level experimental designs, and poor

experimental design The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...

. One notable pitfall is viewing results in Microsoft Excel without using the import feature to ensure that the gene names remain text. Although convenient, Excel automatically converts some gene names (''

SEPT1 Septin-1 is a protein that in humans is encoded by the ''SEPTIN1'' gene. It was renamed from SEPT1 to avoid problems where Microsoft Excel would auto-correct the gene name to the date September 1. Function This gene is a member of the septin f ...

DEC1 Deleted in esophageal cancer 1 is a protein that in humans is encoded by the ''DEC1'' gene. Function The function of this gene is not known. This gene is located in a region commonly deleted in esophageal squamous cell carcinomas. Gene expre ...

MARCH2 E3 ubiquitin-protein ligase MARCH2 is an enzyme that in humans is encoded by the ''MARCH2'' gene. It is a member of the MARCH family of E3 ligases, and plays an important role in the turnover of membrane proteins. MARCH2 has been shown to negativ ...

'') into dates or floating point numbers. * ''Choice of tools and benchmarking:'' There are numerous efforts that compare the results of these tools, with DESeq2 tending to moderately outperform other methods. As with other methods, benchmarking consists of comparing tool outputs to each other and known gold standards. Downstream analyses for a list of differentially expressed genes come in two flavors, validating observations and making biological inferences. Owing to the pitfalls of differential expression and RNA-Seq, important observations are replicated with (1) an orthogonal method in the same samples (like

real-time PCR A real-time polymerase chain reaction (real-time PCR, or qPCR) is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real ...

) or (2) another, sometimes pre-registered, experiment in a new cohort. The latter helps ensure generalizability and can typically be followed up with a meta-analysis of all the pooled cohorts. The most common method for obtaining higher-level biological understanding of the results is gene set enrichment analysis, although sometimes candidate gene approaches are employed. Gene set enrichment determines if the overlap between two gene sets is statistically significant, in this case the overlap between differentially expressed genes and gene sets from known pathways/databases (''e.g.'',

Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ge ...

KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis i ...

Human Phenotype Ontology The Human Phenotype Ontology (HPO) is a formal ontology of human phenotypes. Developed in collaboration with members of the Open Biomedical Ontologies Foundry, HPO currently contains over 13,000 terms and over 156,000 annotations to hereditary dise ...

) or from complementary analyses in the same data (like co-expression networks). Common tools for gene set enrichment include web interfaces (''e.g.'', ENRICHR, g:profiler, WEBGESTALT) and software packages. When evaluating enrichment results, one heuristic is to first look for enrichment of known biology as a sanity check and then expand the scope to look for novel biology. file:Alt splicing bestiary2.jpg, Examples of alternative RNA splicing modes. Exons are represented as blue and yellow blocks, spliced introns as horizontal black lines connecting two exons, and exon-exon junctions as thin grey connecting lines between two exons.

Alternative splicing

RNA splicing is integral to eukaryotes and contributes significantly to protein regulation and diversity, occurring in >90% of human genes. There are multiple alternative splicing modes: exon skipping (most common splicing mode in humans and higher eukaryotes), mutually exclusive exons, alternative donor or acceptor sites, intron retention (most common splicing mode in plants, fungi, and protozoa), alternative transcription start site (promoter), and alternative polyadenylation. One goal of RNA-Seq is to identify alternative splicing events and test if they differ between conditions. Long-read sequencing captures the full transcript and thus minimizes many of issues in estimating isoform abundance, like ambiguous read mapping. For short-read RNA-Seq, there are multiple methods to detect alternative splicing that can be classified into three main groups: * ''Count-based (also event-based, differential splicing):'' estimate exon retention. Examples are DEXSeq, MATS, and SeqGSEA. * ''Isoform-based (also multi-read modules, differential isoform expression)'': estimate isoform abundance first, and then relative abundance between conditions. Examples are Cufflinks 2 and DiffSplice. * ''Intron excision based:'' calculate alternative splicing using split reads. Examples are MAJIQ and Leafcutter. Differential gene expression tools can also be used for differential isoform expression if isoforms are quantified ahead of time with other tools like RSEM.

Coexpression networks

Coexpression networks are data-derived representations of genes behaving in a similar way across tissues and experimental conditions. Their main purpose lies in hypothesis generation and guilt-by-association approaches for inferring functions of previously unknown genes. RNA-Seq data has been used to infer genes involved in specific pathways based on

Pearson correlation In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...

, both in plants and mammals. The main advantage of RNA-Seq data in this kind of analysis over the microarray platforms is the capability to cover the entire transcriptome, therefore allowing the possibility to unravel more complete representations of the gene regulatory networks. Differential regulation of the splice isoforms of the same gene can be detected and used to predict their biological functions. Weighted gene co-expression network analysis has been successfully used to identify co-expression modules and intramodular hub genes based on RNA seq data. Co-expression modules may correspond to cell types or pathways. Highly connected intramodular hubs can be interpreted as representatives of their respective module. An eigengene is a weighted sum of expression of all genes in a module. Eigengenes are useful biomarkers (features) for diagnosis and prognosis. Variance-Stabilizing Transformation approaches for estimating correlation coefficients based on RNA seq data have been proposed.

Variant discovery

RNA-Seq captures DNA variation, including single nucleotide variants, small insertions/deletions. and

structural variation Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, ...

. Variant calling in RNA-Seq is similar to DNA variant calling and often employs the same tools (including SAMtools mpileup and GATK HaplotypeCaller) with adjustments to account for splicing. One unique dimension for RNA variants is allele-specific expression (ASE): the variants from only one haplotype might be preferentially expressed due to regulatory effects including imprinting and

expression quantitative trait loci Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs. Distant and local, trans- and cis-eQTLs, respectively An expression quantitative trait is an amount of an mRNA transcript or a p ...

, and noncoding rare variants. Limitations of RNA variant identification include that it only reflects expressed regions (in humans, <5% of the genome), could be subject to biases introduced by data processing (e.g., de novo transcriptome assemblies underestimate heterozygosity), and has lower quality when compared to direct DNA sequencing.

RNA editing (post-transcriptional alterations)

Having the matching genomic and transcriptomic sequences of an individual can help detect post-transcriptional edits (

RNA editing RNA editing (also RNA modification) is a molecular process through which some cells can make discrete changes to specific nucleotide sequences within an RNA molecule after it has been generated by RNA polymerase. It occurs in all living organism ...

). A post-transcriptional modification event is identified if the gene's transcript has an allele/variant not observed in the genomic data. file:RNA-Seq-fusion-gene.png, A gene fusion event and the behaviour of paired-end reads falling on both sides of the gene union. Gene fusions can occur in ''Trans'', between genes on separate chromosomes, or in ''Cis'', between two genes on the same chromosome.

Fusion gene detection

Caused by different structural modifications in the genome, fusion genes have gained attention because of their relationship with cancer. The ability of RNA-Seq to analyze a sample's whole transcriptome in an unbiased fashion makes it an attractive tool to find these kinds of common events in cancer. The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion event, however, because of the length of the reads, this could prove to be very noisy. An alternative approach is to use paired-end reads, when a potentially large number of paired reads would map each end to a different exon, giving better coverage of these events (see figure). Nonetheless, the end result consists of multiple and potentially novel combinations of genes providing an ideal starting point for further validation. Copy Number Alteration Copy number alteration (CNA) analyses are commonly used in cancer studies. Gain and loss of the genes have signalling pathway implications and are a key biomarker of molecular dysfunction in oncology. Calling the CNA information from RNA-Seq data is not straightforward because of the differences in gene expression, which lead to the read depth variance of different magnitudes across genes. Due to these difficulties, most of these analyses are usually done using whole-genome sequencing / whole-exome sequencing (WGS/WES). But advanced bioinformatics tools can call CNA from RNA-Seq. Other emerging analysis and applications The applications of RNA-Seq are growing day by day. Other new application of RNA-Seq includes detection of microbial contaminants, determining cell type abundance (cell type deconvolution), measuring the expression of TEs and Neoantigen prediction etc.

History

file:RNAseq over time (Pubmed).png, Pubmed manuscript matches highlight the growing popularity of RNA-Seq. Matches are for RNA-Seq (blue, search terms: "RNA Seq" OR "RNA-Seq" OR "RNA sequencing" OR "RNASeq") and RNA=Seq in medicine (gold, search terms: ("RNA Seq" OR "RNA-Seq" OR "RNA sequencing" OR "RNASeq") AND "Medicine"). The number of manuscripts on PubMed featuring RNA-Seq is still increasing. RNA-Seq was first developed in mid 2000s with the advent of next-generation sequencing technology. The first manuscripts that used RNA-Seq even without using the term includes those of prostate cancer

cell lines An immortalised cell line is a population of cells from a multicellular organism which would normally not proliferate indefinitely but, due to mutation, have evaded normal cellular senescence and instead can keep undergoing division. The cells ...

(dated 2006), ''

Medicago truncatula ''Medicago truncatula'', the barrelclover, strong-spined medick, barrel medic, or barrel medick, is a small annual legume native to the Mediterranean region that is used in genomic research. It is a low-growing, clover-like plant tall with trifol ...

'' (2006), maize (2007), and '' Arabidopsis thaliana'' (2007), while the term "RNA-Seq" itself was first mentioned in 2008. The number of manuscripts referring to RNA-Seq in the title or abstract (Figure, blue line) is continuously increasing with 6754 manuscripts published in 2018. The intersection of RNA-Seq and medicine (Figure, gold line) has similar celerity.

Applications to medicine

RNA-Seq has the potential to identify new disease biology, profile biomarkers for clinical indications, infer druggable pathways, and make genetic diagnoses. These results could be further personalized for subgroups or even individual patients, potentially highlighting more effective prevention, diagnostics, and therapy. The feasibility of this approach is in part dictated by costs in money and time; a related limitation is the required team of specialists (bioinformaticians, physicians/clinicians, basic researchers, technicians) to fully interpret the huge amount of data generated by this analysis.

Large-scale sequencing efforts

A lot of emphasis has been given to RNA-Seq data after the Encyclopedia of DNA Elements (ENCODE) and

The Cancer Genome Atlas ''The'' () is a grammatical article in English, denoting persons or things already mentioned, under discussion, implied or otherwise presumed familiar to listeners, readers, or speakers. It is the definite article in English. ''The'' is the m ...

(TCGA) projects have used this approach to characterize dozens of cell lines and thousands of primary tumor samples, respectively. ENCODE aimed to identify genome-wide regulatory regions in different cohort of cell lines and transcriptomic data are paramount to understand the downstream effect of those epigenetic and genetic regulatory layers. TCGA, instead, aimed to collect and analyze thousands of patient's samples from 30 different tumor types to understand the underlying mechanisms of malignant transformation and progression. In this context RNA-Seq data provide a unique snapshot of the transcriptomic status of the disease and look at an unbiased population of transcripts that allows the identification of novel transcripts, fusion transcripts and non-coding RNAs that could be undetected with different technologies.

References

External links

* : a high-level guide to designing and implementing an RNA-Seq experiment. {{DEFAULTSORT:Rna-Seq Molecular biology RNA Gene expression RNA sequencing