Whole Genome Shotgun
   HOME

TheInfoList



OR:

In
genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
, shotgun sequencing is a method used for
sequencing In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succ ...
random DNA strands. It is named by analogy with the rapidly expanding, quasi-random
shot grouping In shooting sports, a shot grouping, or simply group, is the collective pattern of projectile impacts on a target from multiple consecutive shots taken in one shooting session. The ''tightness'' of the grouping (the proximity of all the shots to ...
of a
shotgun A shotgun (also known as a scattergun, or historically as a fowling piece) is a long gun, long-barreled firearm designed to shoot a straight-walled cartridge (firearms), cartridge known as a shotshell, which usually discharges numerous small p ...
. The chain-termination method of
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
("Sanger sequencing") can only be used for short DNA strands of 100 to 1000
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s. Due to this size limit, longer sequences are subdivided into smaller fragments that can be sequenced separately, and these sequences are assembled to give the overall sequence. In shotgun sequencing, DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain ''reads''. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence. Shotgun sequencing was one of the precursor technologies that was responsible for enabling
whole genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a s ...
.


Example

For example, consider the following two rounds of shotgun reads: In this extremely simplified example, none of the reads cover the full length of the original sequence, but the four reads can be assembled into the original sequence using the overlap of their ends to align and order them. In reality, this process uses enormous amounts of information that are rife with ambiguities and sequencing errors. Assembly of complex genomes is additionally complicated by the great abundance of repetitive sequences, meaning similar short reads could come from completely different parts of the sequence. Many overlapping reads for each segment of the original DNA are necessary to overcome these difficulties and accurately assemble the sequence. For example, to complete the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
, most of the human genome was sequenced at 12X or greater ''coverage''; that is, each base in the final sequence was present on average in 12 different reads. Even so, current methods have failed to isolate or assemble reliable sequence for approximately 1% of the (
euchromatic Euchromatin (also called "open chromatin") is a lightly packed form of chromatin (DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active Transcription (genetics), transcription. Euchromatin stands in contrast ...
) human genome, as of 2004.


Whole genome shotgun sequencing


History

Whole genome shotgun sequencing for small (4000- to 7000-base-pair) genomes was first suggested in 1979. The first genome sequenced by shotgun sequencing was that of
cauliflower mosaic virus Cauliflower mosaic virus (CaMV) is a member of the genus ''Caulimovirus'', one of the six genera in the family ''Caulimoviridae'', which are pararetroviruses that infect plants. Pararetroviruses replicate through reverse transcription just like ...
, published in 1981.


Paired-end sequencing

Broader application benefited from pairwise end sequencing, known colloquially as ''double-barrel shotgun sequencing''. As sequencing projects began to take on longer and more complicated DNA sequences, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment. History. The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HGPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991. At the time, there was community consensus that the optimal fragment length for pairwise end sequencing would be three times the sequence read length. In 1995
Roach Roach may refer to: Animals * Cockroach, various insect species of the order Blattodea * Common roach (''Rutilus rutilus''), a fresh and brackish water fish of the family Cyprinidae ** ''Rutilus'' or roaches, a genus of fishes * California roach ...
et al. introduced the innovation of using fragments of varying sizes, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by
The Institute for Genomic Research The J. Craig Venter Institute (JCVI) is a non-profit genomics research institute founded by J. Craig Venter, Ph.D. in October 2006. The institute was the result of consolidating four organizations: the Center for the Advancement of ...
(TIGR) to sequence the genome of the bacterium ''
Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bacteria ...
'' in 1995, and then by
Celera Genomics Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnostic ...
to sequence the ''
Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the "vinegar fly" or "pomace fly". Starting with Ch ...
'' (fruit fly) genome in 2000, and subsequently the human genome.


Approach

To apply the strategy, a high-molecular-weight DNA strand is sheared into random fragments, size-selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate
vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...
. The clones are then sequenced from both ends using the
chain termination method Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Fred ...
yielding two short sequences. Each sequence is called an ''end-read'' or ''read 1 and read 2'' and two reads from the same clone are referred to as '' mate pairs''. Since the chain termination method usually can only produce reads between 500 and 1000 bases long, in all but the smallest clones, mate pairs will rarely overlap.


Assembly

The original sequence is reconstructed from the reads using
sequence assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
software. First, overlapping reads are collected into longer composite sequences known as ''contigs''. Contigs can be linked together into ''scaffolds'' by following connections between mate pairs. The distance between contigs can be inferred from the
mate pair Paired-end tags (PET) (sometimes "Paired-End diTags", or simply "ditags") are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making t ...
positions if the average fragment length of the library is known and has a narrow window of deviation. Depending on the size of the gap between contigs, different techniques can be used to find the sequence in the gaps. If the gap is small (5-20kb) then the use of
polymerase chain reaction The polymerase chain reaction (PCR) is a method widely used to rapidly make millions to billions of copies (complete or partial) of a specific DNA sample, allowing scientists to take a very small sample of DNA and amplify it (or a part of it) t ...
(PCR) to amplify the region is required, followed by sequencing. If the gap is large (>20kb) then the large fragment is cloned in special vectors such as
bacterial artificial chromosome A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid (or F-plasmid), used for transforming and cloning in bacteria, usually '' E. coli''. F-plasmids play a crucial role because they contain partition ...
s (BAC) followed by sequencing of the vector.


Pros and cons

Proponents of this approach argue that it is possible to sequence the whole
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
at once using large arrays of sequencers, which makes the whole process much more efficient than more traditional approaches. Detractors argue that although the technique quickly sequences large regions of DNA, its ability to correctly link these regions is suspect, particularly for genomes with repeating regions. As
sequence assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
programs become more sophisticated and computing power becomes cheaper, it may be possible to overcome this limitation.


Coverage

Coverage (read depth or depth) is the average number of reads representing a given
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
in the reconstructed sequence. It can be calculated from the length of the original genome (''G''), the number of reads(''N''), and the average read length(''L'') as N\times L/G. For example, a hypothetical genome with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2x redundancy. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in
base calling Base calling is the process of assigning nucleobases to chromatogram peaks, light intensity signals, or electrical current changes resulting from nucleotides passing through a nanopore. One computer program for accomplishing this job is Phred (softw ...
and assembly. The subject of
DNA sequencing theory DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing ...
addresses the relationships of such quantities. Sometimes a distinction is made between ''sequence coverage'' and ''physical coverage''. Sequence coverage is the average number of times a base is read (as described above). Physical coverage is the average number of times a base is read or spanned by mate paired reads.


Hierarchical shotgun sequencing

Although shotgun sequencing can in theory be applied to a genome of any size, its direct application to the sequencing of large genomes (for instance, the
human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
) was limited until the late 1990s, when technological advances made practical the handling of the vast quantities of complex data involved in the process.Dunham, I. ''Genome Sequencing''. Encyclopedia of Life Sciences, 2005. Historically, full-genome shotgun sequencing was believed to be limited by both the sheer size of large genomes and by the complexity added by the high percentage of repetitive DNA (greater than 50% for the human genome) present in large genomes.Venter, J. C. "Shotgunning the Human Genome: A Personal View." Encyclopedia of Life Sciences, 2006. It was not widely accepted that a full-genome shotgun sequence of a large genome would provide reliable data. For these reasons, other strategies that lowered the computational load of sequence assembly had to be utilized before shotgun sequencing was performed. In hierarchical sequencing, also known as top-down sequencing, a low-resolution physical map of the genome is made prior to actual sequencing. From this map, a minimal number of fragments that cover the entire chromosome are selected for sequencing.Gibson, G. and Muse, S. V. ''A Primer of Genome Science''. 3rd ed. P.84 In this way, the minimum amount of high-throughput sequencing and assembly is required. The amplified genome is first sheared into larger pieces (50-200kb) and cloned into a bacterial host using BACs or
P1-derived artificial chromosome A P1-derived artificial chromosome, or PAC, is a DNA construct derived from the DNA of P1 bacteriophages and Bacterial artificial chromosome. It can carry large amounts (about 100–300 kilobases) of other sequences for a variety of bioengineeri ...
s (PAC). Because multiple genome copies have been sheared at random, the fragments contained in these clones have different ends, and with enough coverage (see section above) finding a scaffold of BAC contigs that covers the entire genome is theoretically possible. This scaffold is called a tiling path. Once a tiling path has been found, the BACs that form this path are sheared at random into smaller fragments and can be sequenced using the shotgun method on a smaller scale. Although the full sequences of the BAC contigs is not known, their orientations relative to one another are known. There are several methods for deducing this order and selecting the BACs that make up a tiling path. The general strategy involves identifying the positions of the clones relative to one another and then selecting the fewest clones required to form a contiguous scaffold that covers the entire area of interest. The order of the clones is deduced by determining the way in which they overlap.Dear, P. H. ''Genome Mapping''. Encyclopedia of Life Sciences, 2005. . Overlapping clones can be identified in several ways. A small radioactively or chemically labeled probe containing a
sequence-tagged site A sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known. Usage STSs can be easily detected by the polymerase chain reaction (PCR) usin ...
(STS) can be hybridized onto a microarray upon which the clones are printed. In this way, all the clones that contain a particular sequence in the genome are identified. The end of one of these clones can then be sequenced to yield a new probe and the process repeated in a method called chromosome walking. Alternatively, the BAC library can be restriction-digested. Two clones that have several fragment sizes in common are inferred to overlap because they contain multiple similarly spaced restriction sites in common. This method of genomic mapping is called restriction fingerprinting because it identifies a set of restriction sites contained in each clone. Once the overlap between the clones has been found and their order relative to the genome known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced. Because it involves first creating a low-resolution map of the genome, hierarchical shotgun sequencing is slower than whole-genome shotgun sequencing, but relies less heavily on computer algorithms than whole-genome shotgun sequencing. The process of extensive BAC library creation and tiling path selection, however, make hierarchical shotgun sequencing slow and labor-intensive. Now that the technology is available and the reliability of the data demonstrated, the speed and cost efficiency of whole-genome shotgun sequencing has made it the primary method for genome sequencing.


Newer sequencing technologies

The classical shotgun sequencing was based on the Sanger sequencing method: this was the most advanced technique for sequencing genomes from about 1995–2005. The shotgun strategy is still applied today, however using other sequencing technologies, such as short-read sequencing and
long-read sequencing Third-generation sequencing (also known as long-read sequencing) is a class of DNA sequencing methods currently under active development. Third generation sequencing technologies have the capability to produce substantially longer reads than Massiv ...
. Short-read or "next-gen" sequencing produces shorter reads (anywhere from 25–500bp) but many hundreds of thousands or millions of reads in a relatively short time (on the order of a day). This results in high coverage, but the assembly process is much more computationally intensive. These technologies are vastly superior to Sanger sequencing due to the high volume of data and the relatively short time it takes to sequence a whole genome.


Metagenomic shotgun sequencing

Having reads of 400-500 base pairs length is sufficient to determine the species or strain of the organism where the DNA comes from, provided its genome is already known, by using for example a ''k''-mer based taxonomic classifier software. With millions of reads from next generation sequencing of an environmental sample, it is possible to get a complete overview of any complex microbiome with thousands of species, like the
gut flora Gut microbiota, gut microbiome, or gut flora, are the microorganisms, including bacteria, archaea, fungi, and viruses that live in the digestive tracts of animals. The gastrointestinal metagenome is the aggregate of all the genomes of the gut mi ...
. Advantages over 16S rRNA
amplicon sequencing In molecular biology, an amplicon is a piece of DNA or RNA that is the source and/or product of amplification (molecular biology), amplification or DNA replication, replication events. It can be formed artificially, using various methods including ...
are: not being limited to bacteria; strain-level classification where amplicon sequencing only gets the genus; and the possibility to extract whole genes and specify their function as part of the metagenome. The sensitivity of metagenomic sequencing makes it an attractive choice for clinical use. It however emphasizes the problem of contamination of the sample or the sequencing pipeline.


See also

*
Clinical metagenomic sequencing Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material (DNA or RNA) in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to id ...
*
DNA sequencing theory DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing ...


References


Further reading

* *


External links

{{NCBI-handbook Molecular biology DNA sequencing 1981 in biotechnology Metagenomics Bioinformatics