Digital Transcriptome Subtraction
   HOME

TheInfoList



OR:

Digital transcriptome subtraction (DTS) is a
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
method to detect the presence of novel pathogen transcripts through computational removal of the host sequences. DTS is the direct ''
in silico In biology and other experimental sciences, an ''in silico'' experiment is one performed on computer or via computer simulation. The phrase is pseudo-Latin for 'in silicon' (correct la, in silicio), referring to silicon in computer chips. It ...
'' analogue of the wet-lab approach representational difference analysis (RDA), and is made possible by unbiased high-throughput sequencing and the availability of a high-quality, annotated reference genome of the host. The method specifically examines the etiological agent of
infectious diseases An infection is the invasion of tissues by pathogens, their multiplication, and the reaction of host tissues to the infectious agent and the toxins they produce. An infectious disease, also known as a transmissible disease or communicable dise ...
and is best known for discovering
Merkel cell polyomavirus Merkel cell polyomavirus (MCV or MCPyV) was first described in January 2008 in Pittsburgh, Pennsylvania. It was the first example of a human viral pathogen discovered using unbiased metagenomic next-generation sequencing with a technique called d ...
, the suspect causative agent in Merkel-cell carcinoma.


History

Using computational subtraction to discover novel pathogens was first proposed in 2002 by Meyerson et al. using human expressed sequence tag (EST) datasets. In a
proof of principle Proof of concept (POC or PoC), also known as proof of principle, is a realization of a certain method or idea in order to demonstrate its feasibility, or a demonstration in principle with the aim of verifying that some concept or theory has prac ...
experiment, Meyerson et al. demonstrated that it was a feasible approach using
Epstein–Barr virus The Epstein–Barr virus (EBV), formally called ''Human gammaherpesvirus 4'', is one of the nine known human herpesvirus types in the herpes family, and is one of the most common viruses in humans. EBV is a double-stranded DNA virus. It is b ...
-infected lymphocytes in post-transplant lymphoproliferative disorder (PTLD). In 2007, the term "Digital Transcriptome Subtraction" was coined by the Chang-
Moore Moore may refer to: People * Moore (surname) ** List of people with surname Moore * Moore Crosthwaite (1907–1989), a British diplomat and ambassador * Moore Disney (1765–1846), a senior officer in the British Army * Moore Powell (died c. 1573 ...
group, and was used to discover Merkel cell polymavirus in Merkel-cell carcinoma. Simultaneously to the MCV discovery, this approach was used to implicate a novel
arenavirus An arenavirus is a bisegmented ambisense RNA virus that is a member of the family ''Arenaviridae''. These viruses infect rodents and occasionally humans. A class of novel, highly divergent arenaviruses, properly known as reptarenaviruses, have ...
as cause of fatality in a case where three patients died of similar illnesses shortly following organ transplantations from a single donor.


Method


Construction of cDNA library

After treatment with
DNase I Deoxyribonuclease I (usually called DNase I), is an endonuclease of the DNase family coded by the human gene DNASE1. DNase I is a nuclease that cleaves DNA preferentially at phosphodiester linkages adjacent to a pyrimidine nucleotide, yielding ...
to eliminate human genomic DNA, total
RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
is extracted from primary infected tissue.
Messenger RNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the p ...
is then purified using an oligo-dT column that binds to the
poly-A tail Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In euka ...
, a signal specifically found on transcribed genes. Using random hexamers priming,
reverse transcriptase A reverse transcriptase (RT) is an enzyme used to generate complementary DNA (cDNA) from an RNA template, a process termed reverse transcription. Reverse transcriptases are used by viruses such as HIV and hepatitis B to replicate their genomes, ...
(RT) convert all mRNA into
cDNA In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a speci ...
and cloned into bacterial vectors. Bacteria, usually ''
E. coli ''Escherichia coli'' (),Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. also known as ''E. coli'' (), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus ''Escher ...
'', are then transformed using the cDNA vectors and selected using a marker, the collection of transformed clones is the cDNA library. This generates a snap-shot of tissue mRNA that is stable and can be sequenced at a later stage.


Sequencing and quality control

The cDNA library must be sequenced to great depth (i.e. number of clones sequenced) in order to detect a theoretical rare pathogen sequence (Table 1), especially if the foreign sequence is novel. Chang-Moore recommend a sequencing depth of 200,000 transcripts or greater using multiple sequencing platforms. Stringent quality control are then applied to the raw sequences to minimize false-positive results. The initial quality screen uses several general parameters to exclude ambiguous sequences, leaving behind a dataset of high-fidelity (Hi-Fi) reads. * Low
Phred score A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human G ...
cutoff is used to remove low-quality end sequences. Typically, a Phred score cutoff of 20 or 30 is used to ensure 99%-99.9% accuracy in each base-calling. * Vector and adaptor removal. * Low complexity - complexity score of a sequence reflects number of identical bases in a series (homo-polymers) such as poly-dT or poly-dA. * Human repetitive DNA. * Length - parameter is dependent on the optimized read length specific to the
sequencing technology DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...
that was used. * BLAST and exclude ''
E. coli ''Escherichia coli'' (),Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. also known as ''E. coli'' (), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus ''Escher ...
'' genome sequences.


BLAST to host genome

Using MEGABLAST, Hi-Fi reads are then matched to sequences in annotated databases and any positive matches are then subtracted from the dataset. Minimum hit length for a positive match of human sequence is typically 30 consecutive identical bases, which equates to a BLAST score of 60; generally, the remaining sequence is BLAST again with less stringent parameters to allow for slight mismatches (1 in 20 nucleotide). The vast majority of sequences (>99%) should be removed from the dataset at this stage. Subtracted sequences typically include: * Reference human transcriptome - eliminates any known human transcripts from expression library sets. * Reference human genome - eliminates genes that have been missed by the annotation process and any contaminating genomic sequences during cDNA library construction. *
Mitochondrial DNA Mitochondrial DNA (mtDNA or mDNA) is the DNA located in mitochondria, cellular organelles within eukaryotic cells that convert chemical energy from food into a form that cells can use, such as adenosine triphosphate (ATP). Mitochondrial D ...
- mitochondrial DNA are highly abundant and polymorphic due to rapid mutation rate. * Immunoglobulin region - The immunoglobulin loci is highly polymorphic and would otherwise yield false-positive due to poor alignment to the reference genome. * Other
vertebrate Vertebrates () comprise all animal taxa within the subphylum Vertebrata () ( chordates with backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the phylum Chordata, ...
sequences * Unannotated sequences


Analysis of "non-host" candidates


Alignment to pathogen databases

After stringent rounds of subtraction, the remaining sequences are clustered into non-redundant contigs and aligned to known pathogen sequences using low-stringency parameters. As pathogen genomes mutates quickly, nucleotide-nucleotide alignments, or blastn, is usually uninformative as it is possible to have mutations at certain bases without changing the amino acid residue due to
codon degeneracy Degeneracy or redundancy of codons is the redundancy of the genetic code, exhibited as the multiplicity of three-base pair codon combinations that specify an amino acid. The degeneracy of the genetic code is what accounts for the existence of synon ...
. Matching the ''
in silico In biology and other experimental sciences, an ''in silico'' experiment is one performed on computer or via computer simulation. The phrase is pseudo-Latin for 'in silicon' (correct la, in silicio), referring to silicon in computer chips. It ...
'' translated protein sequences of all 6 open reading frames to the amino acid sequence to annotated proteins, or blastx, is the preferred alignment method as it increases the likelihood of identifying a novel pathogen by matching to a related strain/species. Experimental extension of candidate sequences might also be used at this stage to maximize chances of a positive match.


''De novo'' assembly

In cases where alignment to known pathogens is uninformative or ambiguous, contigs of candidate sequence can be used as templates for
primer walking Primer walking is a technique used to clone a gene (e.g., disease gene) from its known closest markers (e.g., known gene). As a result, it is employed in cloning and sequencing efforts in plants, fungi, and mammals with minor alterations. This te ...
in primary infected tissue to generate the complete pathogen genome sequence. As viral transcripts are exceedingly rare ratio tissue mRNA (10 transcripts in 1 million), it is unlikely to generate a transcriptome based on the original candidate sequences alone due to low coverage.


Validation of pathogen

Once a putative pathogen has been identified in the high-throughput sequencing data, it is imperative to validate the presence of pathogen in infected patients using more sensitive techniques, such as: *
RT-PCR Reverse transcription polymerase chain reaction (RT-PCR) is a laboratory technique combining reverse transcription of RNA into DNA (in this context called complementary DNA or cDNA) and amplification of specific DNA targets using polymerase cha ...
and derivative methods, including 3'- and 5'- RACE to confirm the existence of pathogen mRNA. *
Immunohistochemistry Immunohistochemistry (IHC) is the most common application of immunostaining. It involves the process of selectively identifying antigens (proteins) in cells of a tissue section by exploiting the principle of antibodies binding specifically to an ...
using antibodies to related pathogen to determine existence the pathogen in tissues. * Serological tests to measure pathogen-specific
antibody titer Titer (American English) or titre (British English) is a way of expressing concentration. Titer testing employs serial dilution to obtain approximate quantitative information from an analytical procedure that inherently only evaluates as positiv ...
. * Bacterial culture/
viral culture Viral culture is a laboratory technique in which samples of a virus are placed to different cell lines which the virus being tested for its ability to infect. If the cells show changes, known as cytopathic effects, then the culture is positive. ...
, which is considered as the
gold standard A gold standard is a monetary system in which the standard economic unit of account is based on a fixed quantity of gold. The gold standard was the basis for the international monetary system from the 1870s to the early 1920s, and from the la ...
in laboratory diagnosis.


Applications

The primary application for DTS lies in identification of pathogenic viruses in cancer. It can also be used to identify viral pathogens in non-cancer related disease. Future clinical applications could include the use of DTS on a routine basis in individuals. DTS could also apply to
agriculture Agriculture or farming is the practice of cultivating plants and livestock. Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people to ...
, identifying pathogens that have an effect on output. Computation subtraction was already used in a metagenomics study that associated viral infection by IAPV with colony collapse disorder in honey bees.


Advantages

*Requires no prior knowledge about pathogen sequence. *Can identify previously unassociated, potentially treatable pathogens. *Uses already available molecular methods and resources.


Disadvantages

*Identifies the presence of pathogen but does not establish causal link to disease. See Koch's postulate and
Bradford Hill criteria The Bradford Hill criteria, otherwise known as Hill's criteria for causation, are a group of nine principles that can be useful in establishing epidemiologic evidence of a causal relationship between a presumed cause and an observed effect and have ...
. *Must have a highly reliable, complete reference transcriptome for the organism being studied. *Lack of foreign sequence identification cannot entirely exclude a pathogenic foreign body.


References

{{Reflist Bioinformatics