Whole genome sequencing
   HOME

TheInfoList



OR:

Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
at a single time. This entails sequencing all of an organism's
chromosomal A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...
DNA as well as DNA contained in the
mitochondria A mitochondrion (; ) is an organelle found in the cells of most Eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used ...
and, for plants, in the
chloroplast A chloroplast () is a type of membrane-bound organelle known as a plastid that conducts photosynthesis mostly in plant and algal cells. The photosynthetic pigment chlorophyll captures the energy from sunlight, converts it, and stores it i ...
. Whole genome sequencing has largely been used as a research tool, but was being introduced to clinics in 2014. In the future of
personalized medicine Personalized medicine, also referred to as precision medicine, is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on the ...
, whole genome sequence data may be an important tool to guide therapeutic intervention. The tool of
gene sequencing Gene Sequencing may refer to: * DNA sequencing * or a comprehensive variant of it: Whole genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the pr ...
at SNP level is also used to pinpoint functional variants from
association studies Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence. Studies of genetic association aim to test whether single-locus alleles or genotype fre ...
and improve the knowledge available to researchers interested in
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life ...
, and hence may lay the foundation for predicting disease susceptibility and drug response. Whole genome sequencing should not be confused with
DNA profiling DNA profiling (also called DNA fingerprinting) is the process of determining an individual's DNA characteristics. DNA analysis intended to identify a species, rather than an individual, is called DNA barcoding. DNA profiling is a forensic t ...
, which only determines the likelihood that genetic material came from a particular individual or group, and does not contain additional information on genetic relationships, origin or susceptibility to specific diseases. In addition, whole genome sequencing should not be confused with methods that sequence specific subsets of the genome – such methods include
whole exome sequencing Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). It consists of two steps: the first step is to select only the sub ...
(1–2% of the genome) or
SNP genotyping SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common ...
(< 0.1% of the genome).


History

The DNA sequencing methods used in the 1970s and 1980s were manual; for example, Maxam–Gilbert sequencing and
Sanger sequencing Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Fred ...
. Several whole bacteriophage and animal viral genomes were sequenced by these techniques, but the shift to more rapid, automated sequencing methods in the 1990s facilitated the sequencing of the larger bacterial and eukaryotic genomes. The first virus to have its complete genome sequenced was the
Bacteriophage MS2 Bacteriophage MS2 (''Emesvirus zinderi''), commonly called MS2, is an icosahedral, positive-sense single-stranded RNA virus that infects the bacterium ''Escherichia coli'' and other members of the Enterobacteriaceae. MS2 is a member of a family ...
by 1976. In 1992, yeast chromosome III was the first chromosome of any organism to be fully sequenced. The first organism whose entire genome was fully sequenced was ''
Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bact ...
'' in 1995. After it, the genomes of other bacteria and some
archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaeba ...
were first sequenced, largely due to their small genome size. ''H. influenzae'' has a genome of 1,830,140 base pairs of DNA. In contrast,
eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacter ...
, both unicellular and
multicellular A multicellular organism is an organism that consists of more than one cell, in contrast to unicellular organism. All species of animals, land plants and most fungi are multicellular, as are many algae, whereas a few organisms are partially ...
such as ''
Amoeba dubia ''Polychaos dubium'' is a freshwater amoeboid and one of the larger species of single-celled eukaryote. Like other amoebozoans, ''P. dubium'' moves by means of temporary projections called pseudopods. ''P. dubium'' reportedly has one of the larg ...
'' and humans (''
Homo sapiens Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture ...
'') respectively, have much larger genomes (see
C-value paradox C-value is the amount, in picograms, of DNA contained within a haploid nucleus (e.g. a gamete) or one half the amount in a diploid somatic cell of a eukaryotic organism. In some cases (notably among diploid organisms), the terms C-value and gen ...
). ''Amoeba dubia'' has a genome of 700 billion
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...
pairs spread across thousands of
chromosomes A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...
. Humans contain fewer
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...
pairs (about 3.2 billion in each
germ cell Germ or germs may refer to: Science * Germ (microorganism), an informal word for a pathogen * Germ cell, cell that gives rise to the gametes of an organism that reproduces sexually * Germ layer, a primary layer of cells that forms during embr ...
– note the exact size of the human genome is still being revised) than ''A. dubia,'' however, their genome size far outweighs the genome size of individual bacteria. The first bacterial and archaeal genomes, including that of ''H. influenzae'', were sequenced by
Shotgun sequencing In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun. The chain-termination method of DNA sequencing ("Sanger sequencing ...
. In 1996 the first eukaryotic genome (''
Saccharomyces cerevisiae ''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have b ...
'') was sequenced. ''S. cerevisiae'', a
model organism A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workin ...
in
biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary ...
has a genome of only around 12 million
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...
pairs, and was the first ''unicellular'' eukaryote to have its whole genome sequenced. The first ''multicellular'' eukaryote, and
animal Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage ...
, to have its whole genome sequenced was the
nematode The nematodes ( or grc-gre, Νηματώδη; la, Nematoda) or roundworms constitute the phylum Nematoda (also called Nemathelminthes), with plant- parasitic nematodes also known as eelworms. They are a diverse animal phylum inhabiting a bro ...
worm: ''
Caenorhabditis elegans ''Caenorhabditis elegans'' () is a free-living transparent nematode about 1 mm in length that lives in temperate soil environments. It is the type species of its genus. The name is a blend of the Greek ''caeno-'' (recent), ''rhabditis'' (r ...
'' in 1998. Eukaryotic genomes are sequenced by several methods including Shotgun sequencing of short DNA fragments and sequencing of larger DNA clones from DNA libraries such as bacterial artificial chromosomes (BACs) and yeast artificial chromosomes (YACs). In 1999, the entire DNA sequence of human
chromosome 22 Chromosome 22 is one of the 23 pairs of chromosomes in human cells. Humans normally have two copies of chromosome 22 in each cell. Chromosome 22 is the second smallest human chromosome, spanning about 49 million DNA base pairs and representing ...
, the shortest human
autosome An autosome is any chromosome that is not a sex chromosome. The members of an autosome pair in a diploid cell have the same morphology, unlike those in allosomal (sex chromosome) pairs, which may have different structures. The DNA in autosomes ...
, was published. By the year 2000, the second animal and second
invertebrate Invertebrates are a paraphyletic group of animals that neither possess nor develop a vertebral column (commonly known as a ''backbone'' or ''spine''), derived from the notochord. This is a grouping including all animals apart from the chorda ...
(yet first
insect Insects (from Latin ') are pancrustacean hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body ( head, thorax and abdomen), three pa ...
) genome was sequenced – that of the fruit fly ''
Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the " vinegar fly" or "pomace fly". Starting with ...
'' – a popular choice of model organism in experimental research. The first
plant Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all current definitions of Plantae excl ...
genome – that of the model organism ''
Arabidopsis thaliana ''Arabidopsis thaliana'', the thale cress, mouse-ear cress or arabidopsis, is a small flowering plant native to Eurasia and Africa. ''A. thaliana'' is considered a weed; it is found along the shoulders of roads and in disturbed land. A winter ...
'' – was also fully sequenced by 2000. By 2001, a draft of the entire human genome sequence was published. The genome of the laboratory mouse ''
Mus musculus Mus or MUS may refer to: Abbreviations * MUS, the NATO country code for Mauritius * MUS, the IATA airport code for Minami Torishima Airport * MUS, abbreviation for the Centre for Modern Urban Studies on Campus The Hague, Leiden University, Net ...
'' was completed in 2002. In 2004, the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both ...
published an incomplete version of the human genome. In 2008, a group from Leiden, the Netherlands, reported the sequencing of the first female human genome ( Marjolein Kriek). Currently thousands of genomes have been wholly or partially sequenced.


Experimental details


Cells used for sequencing

Almost any biological sample containing a full copy of the DNA—even a very small amount of DNA or
ancient DNA Ancient DNA (aDNA) is DNA isolated from ancient specimens. Due to degradation processes (including cross-linking, deamination and fragmentation) ancient DNA is more degraded in comparison with contemporary genetic material. Even under the bes ...
—can provide the genetic material necessary for full genome sequencing. Such samples may include
saliva Saliva (commonly referred to as spit) is an extracellular fluid produced and secreted by salivary glands in the mouth. In humans, saliva is around 99% water, plus electrolytes, mucus, white blood cells, epithelial cells (from which DNA can ...
,
epithelial cells Epithelium or epithelial tissue is one of the four basic types of animal tissue, along with connective tissue, muscle tissue and nervous tissue. It is a thin, continuous, protective layer of compactly packed cells with a little intercellu ...
,
bone marrow Bone marrow is a semi-solid biological tissue, tissue found within the Spongy bone, spongy (also known as cancellous) portions of bones. In birds and mammals, bone marrow is the primary site of new blood cell production (or haematopoiesis). It i ...
,
hair Hair is a protein filament that grows from follicles found in the dermis. Hair is one of the defining characteristics of mammals. The human body, apart from areas of glabrous skin, is covered in follicles which produce thick terminal and fi ...
(as long as the hair contains a
hair follicle The hair follicle is an organ found in mammalian skin. It resides in the dermal layer of the skin and is made up of 20 different cell types, each with distinct functions. The hair follicle regulates hair growth via a complex interaction between ...
),
seed A seed is an embryonic plant enclosed in a protective outer covering, along with a food reserve. The formation of the seed is a part of the process of reproduction in seed plants, the spermatophytes, including the gymnosperm and angiosper ...
s, plant leaves, or anything else that has DNA-containing cells. The genome sequence of a single cell selected from a mixed population of cells can be determined using techniques of ''single cell genome sequencing''. This has important advantages in environmental microbiology in cases where a single cell of a particular microorganism species can be isolated from a mixed population by microscopy on the basis of its morphological or other distinguishing characteristics. In such cases the normally necessary steps of isolation and growth of the organism in culture may be omitted, thus allowing the sequencing of a much greater spectrum of organism genomes. Single cell genome sequencing is being tested as a method of
preimplantation genetic diagnosis Preimplantation genetic diagnosis (PGD or PIGD) is the genetic profiling of embryos prior to implantation (as a form of embryo profiling), and sometimes even of oocytes prior to fertilization. PGD is considered in a similar fashion to prenatal ...
, wherein a cell from the embryo created by in vitro fertilization is taken and analyzed before
embryo transfer Embryo transfer refers to a step in the process of assisted reproduction in which embryos are placed into the uterus of a female with the intent to establish a pregnancy. This technique (which is often used in connection with in vitro fertilizati ...
into the uterus. After implantation,
cell-free fetal DNA Cell-free fetal DNA (cffDNA) is fetal DNA that circulates freely in the maternal blood. Maternal blood is sampled by venipuncture. Analysis of cffDNA is a method of non-invasive prenatal diagnosis frequently ordered for pregnant women of advance ...
can be taken by simple
venipuncture In medicine, venipuncture or venepuncture is the process of obtaining intravenous access for the purpose of venous blood sampling (also called ''phlebotomy'') or intravenous therapy. In healthcare, this procedure is performed by medical labor ...
from the mother and used for whole genome sequencing of the fetus.


Early techniques

Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of
shotgun sequencing In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun. The chain-termination method of DNA sequencing ("Sanger sequencing ...
technology. While full genome shotgun sequencing for small (4000–7000
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both D ...
) genomes was already in use in 1979, broader application benefited from pairwise end sequencing, known colloquially as ''double-barrel shotgun sequencing''. As sequencing projects began to take on longer and more complicated genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment. The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991. In 1995 the innovation of using fragments of varying sizes was introduced, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire genome of the bacterium ''
Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bact ...
'' in 1995, and then by
Celera Genomics Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnost ...
to sequence the entire fruit fly genome in 2000, and subsequently the entire human genome.
Applied Biosystems Applied Biosystems is one of various brands under the Life Technologies brand of Thermo Fisher Scientific corporation. The brand is focused on integrated systems for genetic analysis, which include computerized machines and the consumables used w ...
, now called Life Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The Human Genome Project.


Current techniques

While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still too expensive and takes too long for commercial purposes. Since 2005 capillary sequencing has been progressively displaced by high-throughput (formerly "next-generation") sequencing technologies such as Illumina dye sequencing,
pyrosequencing Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle, in which the sequencing is performed by detecting the nucleotide incorporated by a DNA polymerase. Pyrosequ ...
, and SMRT sequencing. All of these technologies continue to employ the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation. Other technologies have emerged, including Nanopore technology. Though the sequencing accuracy of Nanopore technology is lower than those above, its read length is on average much longer. This generation of long reads is valuable especially in ''de novo'' whole-genome sequencing applications.


Analysis

In principle, full genome sequencing can provide the raw
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...
sequence of an individual organism's DNA at a single point in time. However, further analysis must be performed to provide the biological or medical meaning of this sequence, such as how this knowledge can be used to help prevent disease. Methods for analyzing sequencing data are being developed and refined. Because sequencing generates a lot of data (for example, there are approximately six billion
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both D ...
s in each human diploid genome), its output is stored electronically and requires a large amount of computing power and storage capacity. While analysis of WGS data can be slow, it is possible to speed up this step by using dedicated hardware.


Commercialization

A number of public and private companies are competing to develop a full genome sequencing platform that is commercially robust for both research and clinical use, including Illumina,
Knome Knome, Inc. was a human genome interpretation company based in Cambridge, Massachusetts. Launched in 2007, Knome focused on improving quality of life by applying scientific insights gained from the interpretation of human genomes. Their product ...
,
Sequenom Sequenom () is an American company based in San Diego, California. It develops enabling molecular technologies, and highly sensitive laboratory genetic tests for NIPT. Sequenom's wholly owned subsidiarity, Sequenom Center for Molecular Medicine ...
,
454 Life Sciences 454 Life Sciences was a biotechnology company based in Branford, Connecticut that specialized in high-throughput DNA sequencing. It was acquired by Roche in 2007 and shut down by Roche in 2013 when its technology became noncompetitive, although ...
, Pacific Biosciences,
Complete Genomics Complete Genomics is a life sciences company that has developed and commercialized a DNA sequencing platform for human genome sequencing and analysis. This solution combines the company's proprietary human genome sequencing technology with its in ...
,
Helicos Biosciences Helicos BioSciences Corporation was a publicly traded life science company headquartered in Cambridge, Massachusetts focused on genetic analysis technologies for the research, drug discovery and diagnostic markets. The firm's Helicos Genetic Anal ...
, GE Global Research (
General Electric General Electric Company (GE) is an American multinational conglomerate founded in 1892, and incorporated in New York state and headquartered in Boston. The company operated in sectors including healthcare, aviation, power, renewable ene ...
),
Affymetrix Affymetrix is now Applied Biosystems, a brand of DNA microarray products sold by Thermo Fisher Scientific that originated with an American biotechnology research and development and manufacturing company of the same name. The Santa Clara, Cali ...
, IBM, Intelligent Bio-Systems, Life Technologies, Oxford Nanopore Technologies, and the
Beijing Genomics Institute BGI Group, formerly Beijing Genomics Institute, is a Chinese genomics company with headquarters in Yantian District, Shenzhen. The company was originally formed in 1999 as a genetics research center to participate in the Human Genome Project. It ...
. These companies are heavily financed and backed by venture capitalists,
hedge funds A hedge fund is a pooled investment fund that trades in relatively liquid assets and is able to make extensive use of more complex trading, portfolio-construction, and risk management techniques in an attempt to improve performance, such as ...
, and investment banks. A commonly-referenced commercial target for sequencing cost until the late 2010s was $1,000USD, however, the private companies are working to reach a new target of only $100.


Incentive

In October 2006, the X Prize Foundation, working in collaboration with the J. Craig Venter Science Foundation, established the Archon X Prize for Genomics, intending to award $10 million to "the first team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $1,000 per genome". The Archon X Prize for Genomics was cancelled in 2013, before its official start date.


History

In 2007,
Applied Biosystems Applied Biosystems is one of various brands under the Life Technologies brand of Thermo Fisher Scientific corporation. The brand is focused on integrated systems for genetic analysis, which include computerized machines and the consumables used w ...
started selling a new type of sequencer called SOLiD System. The technology allowed users to sequence 60 gigabases per run. In June 2009, Illumina announced that they were launching their own Personal Full Genome Sequencing Service at a depth of 30× for $48,000 per genome. In August, the founder of Helicos Biosciences,
Stephen Quake Stephen Ronald Quake (born 1969) is an American scientist, inventor and entrepreneur. He earned his B.S. in physics and M.S. in mathematics from Stanford in 1991 and his D.Phil. in physics from Oxford University in 1994 as a Marshall Scholar. H ...
, stated that using the company's Single Molecule Sequencer he sequenced his own full genome for less than $50,000. In November, Complete Genomics published a peer-reviewed paper in ''Science'' demonstrating its ability to sequence a complete human genome for $1,700. In May 2011, Illumina lowered its Full Genome Sequencing service to $5,000 per human genome, or $4,000 if ordering 50 or more. Helicos Biosciences, Pacific Biosciences, Complete Genomics, Illumina, Sequenom, ION Torrent Systems, Halcyon Molecular, NABsys, IBM, and GE Global appear to all be going head to head in the race to commercialize full genome sequencing. With sequencing costs declining, a number of companies began claiming that their equipment would soon achieve the $1,000 genome: these companies included Life Technologies in January 2012, Oxford Nanopore Technologies in February 2012, and Illumina in February 2014. In 2015, the
NHGRI The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland. NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
estimated the cost of obtaining a whole-genome sequence at around $1,500. In 2016, Veritas Genetics began selling whole genome sequencing, including a report as to some of the information in the sequencing for $999. In summer 2019 Veritas Genetics cut the cost for WGS to $599. In 2017, BGI began offering WGS for $600. However, in 2015 some noted that effective use of whole gene sequencing can cost considerably more than $1000. Also, reportedly there remain parts of the human genome that have not been fully sequenced by 2017.


Comparison with other technologies


DNA microarrays

Full genome sequencing provides information on a genome that is orders of magnitude larger than by DNA arrays, the previous leader in genotyping technology. For humans, DNA arrays currently provide genotypic information on up to one million genetic variants, while full genome sequencing will provide information on all six billion bases in the human genome, or 3,000 times more data. Because of this, full genome sequencing is considered a
disruptive innovation In business theory, disruptive innovation is innovation that creates a new market and value network or enters at the bottom of an existing market and eventually displaces established market-leading firms, products, and alliances. The concept w ...
to the DNA array markets as the accuracy of both range from 99.98% to 99.999% (in non-repetitive DNA regions) and their consumables cost of $5000 per 6 billion base pairs is competitive (for some applications) with DNA arrays ($500 per 1 million basepairs).


Applications


Mutation frequencies

Whole genome sequencing has established the
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
frequency for whole human genomes. The mutation frequency in the whole genome between generations for humans (parent to child) is about 70 new mutations per generation. An even lower level of variation was found comparing whole genome sequencing in blood cells for a pair of monozygotic (identical twins) 100-year-old centenarians. Only 8 somatic differences were found, though somatic variation occurring in less than 20% of blood cells would be undetected. In the specifically protein coding regions of the human genome, it is estimated that there are about 0.35 mutations that would change the protein sequence between parent/child generations (less than one mutated protein per generation). In cancer, mutation frequencies are much higher, due to
genome instability Genome instability (also genetic instability or genomic instability) refers to a high frequency of mutations within the genome of a cellular lineage. These mutations can include changes in nucleic acid sequences, chromosomal rearrangements or ane ...
. This frequency can further depend on patient age, exposure to DNA damaging agents (such as UV-irradiation or components of tobacco smoke) and the activity/inactivity of DNA repair mechanisms. Furthermore, mutation frequency can vary between cancer types: in germline cells, mutation rates occur at approximately 0.023 mutations per megabase, but this number is much higher in breast cancer (1.18-1.66 somatic mutations per Mb), in lung cancer (17.7) or in melanomas (≈33). Since the haploid human genome consists of approximately 3,200 megabases, this translates into about 74 mutations (mostly in noncoding regions) in germline DNA per generation, but 3,776-5,312 somatic mutations per haploid genome in breast cancer, 56,640 in lung cancer and 105,600 in melanomas. The distribution of somatic mutations across the human genome is very uneven, such that the gene-rich, early-replicating regions receive fewer mutations than gene-poor, late-replicating heterochromatin, likely due to differential DNA repair activity. In particular, the histone modification H3K9me3 is associated with high, and
H3K36me3 H3K36me3 is an epigenetic modification to the DNA packaging protein Histone H3. It is a mark that indicates the tri- methylation at the 36th lysine residue of the histone H3 protein and often associated with gene bodies. There are diverse modif ...
with low mutation frequencies.


Genome-wide association studies

In research, whole-genome sequencing can be used in a Genome-Wide Association Study (GWAS) – a project aiming to determine the genetic variant or variants associated with a disease or some other phenotype.


Diagnostic use

In 2009, Illumina released its first whole genome sequencers that were approved for clinical as opposed to research-only use and doctors at
academic medical center An academic medical centre (AMC), variously also known as academic health science centre, academic health science system, or academic health science partnership, is an educational and healthcare institute formed by the grouping of a health profess ...
s began quietly using them to try to diagnose what was wrong with people whom standard approaches had failed to help. In 2009, a team from Stanford led by
Euan Ashley Euan Angus Ashley is a Scottish physician, scientist, author, and founder based at Stanford University in California where he is Associate Dean in the School of Medicine and holds the Roger and Joelle Burnell Chair of Genomics and Precision He ...
performed clinical interpretation of a full human genome, that of bioengineer Stephen Quake. In 2010, Ashley's team reported whole genome molecular autopsy and in 2011, extended the interpretation framework to a fully sequenced family, the West family, who were the first family to be sequenced on the Illumina platform. The price to sequence a genome at that time was $19,500USD, which was billed to the patient but usually paid for out of a research grant; one person at that time had applied for reimbursement from their insurance company. For example, one child had needed around 100 surgeries by the time he was three years old, and his doctor turned to whole genome sequencing to determine the problem; it took a team of around 30 people that included 12
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
experts, three sequencing technicians, five physicians, two genetic counsellors and two ethicists to identify a rare mutation in the
XIAP X-linked inhibitor of apoptosis protein (XIAP), also known as inhibitor of apoptosis protein 3 (IAP3) and baculoviral IAP repeat-containing protein 4 (BIRC4), is a protein that stops apoptotic cell death. In humans, this protein (XIAP) is produ ...
that was causing widespread problems. Due to recent cost reductions (see above) whole genome sequencing has become a realistic application in DNA diagnostics. In 2013, the 3Gb-TEST consortium obtained funding from the European Union to prepare the health care system for these innovations in DNA diagnostics. Quality assessment schemes,
Health technology assessment Health technology assessment (HTA) is a multidisciplinary process that uses systematic and explicit methods to evaluate the properties and effects of a health technology. Health technology is conceived as any intervention (test, device, medici ...
and
guidelines A guideline is a statement by which to determine a course of action. A guideline aims to streamline particular processes according to a set routine or sound practice. Guidelines may be issued by and used by any organization (governmental or pri ...
have to be in place. The 3Gb-TEST consortium has identified the analysis and interpretation of sequence data as the most complicated step in the diagnostic process. At the Consortium meeting in Athens in September 2014, the Consortium coined the word ''genotranslation'' for this crucial step. This step leads to a so-called ''genoreport''. Guidelines are needed to determine the required content of these reports. Genomes2People (G2P), an initiative of
Brigham and Women's Hospital Brigham and Women's Hospital (BWH) is the second largest teaching hospital of Harvard Medical School and the largest hospital in the Longwood Medical Area in Boston, Massachusetts. Along with Massachusetts General Hospital, it is one of the two ...
and
Harvard Medical School Harvard Medical School (HMS) is the graduate medical school of Harvard University and is located in the Longwood Medical Area of Boston, Massachusetts. Founded in 1782, HMS is one of the oldest medical schools in the United States and is cons ...
was created in 2011 to examine the integration of genomic sequencing into clinical care of adults and children. G2P's director,
Robert C. Green Robert C. Green is an American medical geneticist, physician, and public health researcher. He directs the Genomes2People Research Program in translational genomics and health outcomes in the Division of Genetics at Brigham and Women's Hospital ...
, had previously led the REVEAL study — Risk EValuation and Education for Alzheimer's Disease – a series of clinical trials exploring patient reactions to the knowledge of their genetic risk for Alzheimer's. Green and a team of researchers launched the BabySeq Project in 2013 to study the ethical and medical consequences of sequencing an infant's DNA. A second phase, BabySeq2, was funded by NIH in 2021 and is an implementation study that expands this project, planning to enroll 500 infants from diverse families and track the effects of their genomic sequencing on their pediatric care. In 2018, researchers at Rady Children's Institute for Genomic Medicine in San Diego, CA determined that rapid whole-genome sequencing (rWGS) can diagnose genetic disorders in time to change acute medical or surgical management (clinical utility) and improve outcomes in acutely ill infants. The researchers reported a retrospective cohort study of acutely ill inpatient infants in a regional children's hospital from July 2016-March 2017. Forty-two families received rWGS for etiologic diagnosis of genetic disorders. The diagnostic sensitivity of rWGS was 43% (eighteen of 42 infants) and 10% (four of 42 infants) for standard genetic tests (P = .0005). The rate of clinical utility of rWGS (31%, thirteen of 42 infants) was significantly greater than for standard genetic tests (2%, one of 42; P = .0015). Eleven (26%) infants with diagnostic rWGS avoided morbidity, one had a 43% reduction in likelihood of mortality, and one started palliative care. In six of the eleven infants, the changes in management reduced inpatient cost by $800,000-$2,000,000. These findings replicate a prior study of the clinical utility of rWGS in acutely ill inpatient infants, and demonstrate improved outcomes and net healthcare savings. rWGS merits consideration as a first tier test in this setting. A 2018 review of 36 publications found the cost for whole genome sequencing to range from $1,906USD to $24,810USD and have a wide variance in diagnostic yield from 17% to 73% depending on patient groups.


Rare variant association study

Whole genome sequencing studies enable the assessment of associations between complex traits and both coding and noncoding rare variants ( minor allele frequency (MAF) < 1%) across the genome. Single-variant analyses typically have low power to identify associations with rare variants, and variant set tests have been proposed to jointly test the effects of given sets of multiple rare variants.
SNP annotation Single nucleotide polymorphism annotation ( SNP annotation) is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in ...
s help to prioritize rare functional variants, and incorporating these annotations can effectively boost the power of genetic association of rare variants analysis of whole genome sequencing studies. Some tools have been specifically developed to provide all-in-one rare variant association analysis for whole-genome sequencing data, including integration of genotype data and their functional annotations, association analysis, result summary and visualization.


Ethical concerns

The introduction of whole genome sequencing may have ethical implications. On one hand, genetic testing can potentially diagnose preventable diseases, both in the individual undergoing genetic testing and in their relatives. On the other hand, genetic testing has potential downsides such as
genetic discrimination Genetic discrimination occurs when people treat others (or are treated) differently because they have or are perceived to have a gene mutation(s) that causes or increases the risk of an inherited disorder. It may also refer to any and all discr ...
, loss of anonymity, and psychological impacts such as discovery of non-paternity. Some ethicists insist that the privacy of individuals undergoing genetic testing must be protected. Indeed,
privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of ...
issues can be of particular concern when minors undergo genetic testing. Illumina's CEO, Jay Flatley, claimed in February 2009 that "by 2019 it will have become routine to map infants' genes when they are born". This potential use of genome sequencing is highly controversial, as it runs counter to established
ethical Ethics or moral philosophy is a branch of philosophy that "involves systematizing, defending, and recommending concepts of right and wrong behavior".''Internet Encyclopedia of Philosophy'' The field of ethics, along with aesthetics, concerns ma ...
norms for predictive
genetic testing Genetic testing, also known as DNA testing, is used to identify changes in DNA sequence or chromosome structure. Genetic testing can also include measuring the results of genetic changes, such as RNA analysis as an output of gene expression, or ...
of asymptomatic minors that have been well established in the fields of
medical genetics Medical genetics is the branch tics in that human genetics is a field of scientific research that may or may not apply to medicine, while medical genetics refers to the application of genetics to medical care. For example, research on the caus ...
and
genetic counseling Genetic counseling is the process of investigating individuals and families affected by or at risk of genetic disorders to help them understand and adapt to the medical, psychological and familial implications of genetic contributions to disease; t ...
. The traditional guidelines for genetic testing have been developed over the course of several decades since it first became possible to test for genetic markers associated with disease, prior to the advent of cost-effective, comprehensive genetic screening. When an individual undergoes whole genome sequencing, they reveal information about not only their own DNA sequences, but also about probable DNA sequences of their close genetic relatives. This information can further reveal useful predictive information about relatives' present and future health risks. Hence, there are important questions about what obligations, if any, are owed to the family members of the individuals who are undergoing genetic testing. In Western/European society, tested individuals are usually encouraged to share important information on any genetic diagnoses with their close relatives, since the importance of the genetic diagnosis for offspring and other close relatives is usually one of the reasons for seeking a genetic testing in the first place. Nevertheless, a major ethical dilemma can develop when the patients refuse to share information on a diagnosis that is made for serious genetic disorder that is highly preventable and where there is a high risk to relatives carrying the same disease mutation. Under such circumstances, the clinician may suspect that the relatives would rather know of the diagnosis and hence the clinician can face a conflict of interest with respect to patient-doctor confidentiality. Privacy concerns can also arise when whole genome sequencing is used in scientific research studies. Researchers often need to put information on patient's genotypes and phenotypes into public scientific databases, such as locus specific databases. Although only anonymous patient data are submitted to locus specific databases, patients might still be identifiable by their relatives in the case of finding a rare disease or a rare missense mutation. Public discussion around the introduction of advanced forensic techniques (such as advanced familial searching using public DNA ancestry websites and DNA phenotyping approaches) has been limited, disjointed, and unfocused. As forensic genetics and medical genetics converge toward genome sequencing, issues surrounding genetic data become increasingly connected, and additional legal protections may need to be established.


Public human genome sequences


First people with public genome sequences

The first nearly complete human genomes sequenced were two Americans of predominantly
Northwestern Europe Northwestern Europe, or Northwest Europe, is a loosely defined subregion of Europe, overlapping Northern and Western Europe. The region can be defined both geographically and ethnographically. Geographic definitions Geographically, North ...
an ancestry in 2007 (
J. Craig Venter John Craig Venter (born October 14, 1946) is an American biotechnologist and businessman. He is known for leading one of the first draft sequences of the human genome and assembled the first team to transfect a cell with a synthetic chromosome. ...
at 7.5-fold
coverage Coverage may refer to: Filmmaking * Coverage (lens), the size of the image a lens can produce * Camera coverage, the amount of footage shot and different camera setups used in filming a scene * Script coverage, a short summary of a script, writ ...
, and
James Watson James Dewey Watson (born April 6, 1928) is an American molecular biologist, geneticist, and zoologist. In 1953, he co-authored with Francis Crick the academic paper proposing the double helix structure of the DNA molecule. Watson, Crick a ...
at 7.4-fold). This was followed in 2008 by sequencing of an anonymous
Han Chinese The Han Chinese () or Han people (), are an East Asian ethnic group native to China. They constitute the world's largest ethnic group, making up about 18% of the global population and consisting of various subgroups speaking distinctive v ...
man (at 36-fold), a Yoruban man from
Nigeria Nigeria ( ), , ig, Naìjíríyà, yo, Nàìjíríà, pcm, Naijá , ff, Naajeeriya, kcg, Naijeriya officially the Federal Republic of Nigeria, is a country in West Africa. It is situated between the Sahel to the north and the Gulf o ...
(at 30-fold), a female clinical geneticist ( Marjolein Kriek) from the Netherlands (at 7 to 8-fold), and a female
leukemia Leukemia ( also spelled leukaemia and pronounced ) is a group of blood cancers that usually begin in the bone marrow and result in high numbers of abnormal blood cells. These blood cells are not fully developed and are called ''blasts'' or ...
patient in her mid-50s (at 33 and 14-fold coverage for tumor and normal tissues).
Steve Jobs Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American entrepreneur, industrial designer, media proprietor, and investor. He was the co-founder, chairman, and CEO of Apple; the chairman and majority shareholder of Pixar; ...
was among the first 20 people to have their whole genome sequenced, reportedly for the cost of $100,000. , there were 69 nearly complete human genomes publicly available. In November 2013, a Spanish family made their personal genomics data publicly available under a Creative Commons public domain license. The work was led by
Manuel Corpas Manuel Corpas (born December 3, 1982) is a Panamanian professional baseball pitcher for the Martinez Sturgeon of the Pecos League. He previously played in Major League Baseball (MLB) for the Colorado Rockies and Chicago Cubs. Career Colorado R ...
and the data obtained by
direct-to-consumer genetic testing Genetic testing, also known as DNA testing, is used to identify changes in DNA sequence or chromosome structure. Genetic testing can also include measuring the results of genetic changes, such as RNA analysis as an output of gene expression, or ...
with
23andMe 23andMe Holding Co. is a publicly held personal genomics and biotechnology company based in South San Francisco, California. It is best known for providing a direct-to-consumer genetic testing service in which customers provide a saliva sample ...
and the
Beijing Genomics Institute BGI Group, formerly Beijing Genomics Institute, is a Chinese genomics company with headquarters in Yantian District, Shenzhen. The company was originally formed in 1999 as a genetics research center to participate in the Human Genome Project. It ...
). This is believed to be the first such
Public Genomics In public relations and communication science, publics are groups of individual people, and the public (a.k.a. the general public) is the totality of such groupings. This is a different concept to the sociological concept of the ''Öffentlichkei ...
dataset for a whole family.


Databases

According to ''
Science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence ...
'' the major databases of whole genomes are:


See also

*
Coverage (genetics) Coverage (or depth) in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence. Deep sequencing refers to the general concept of aiming for high number of unique reads of each region of a sequence. ...
*
DNA microarray A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to ...
*
DNA profiling DNA profiling (also called DNA fingerprinting) is the process of determining an individual's DNA characteristics. DNA analysis intended to identify a species, rather than an individual, is called DNA barcoding. DNA profiling is a forensic t ...
*
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. T ...
* Duplex sequencing *
Exome Sequencing Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). It consists of two steps: the first step is to select only the sub ...
*
Single cell sequencing Single-cell sequencing examines the sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individua ...
* Horizontal correlation *
Medical genetics Medical genetics is the branch tics in that human genetics is a field of scientific research that may or may not apply to medicine, while medical genetics refers to the application of genetics to medical care. For example, research on the caus ...
*
Nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are us ...
*
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both ...
*
Personal Genome Project The Personal Genome Project (PGP) is a long term, large cohort study which aims to sequence and publicize the complete genomes and medical records of 100,000 volunteers, in order to enable research into personal genomics and personalized medicine. ...
*
Genomics England Genomics England is a British company set up and owned by the United Kingdom Department of Health and Social Care to run the 100,000 Genomes Project. The project aimed in 2014 to sequence 100,000 genomes from NHS patients with a rare disease a ...
* Medical Research Council (MRC) * Predictive medicine *
Personalized medicine Personalized medicine, also referred to as precision medicine, is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on the ...
*
Rare functional variant A rare functional variant is a genetic variant which alters gene function, and which occurs at low frequency in a population. Rare variants play a significant role in both complex and Mendelian disease and are responsible for a portion of the mis ...
*
SNP annotation Single nucleotide polymorphism annotation ( SNP annotation) is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in ...


References


External links


James Watson's Personal Genome Sequence

AAAS/Science: Genome Sequencing Poster
{{Breakthrough of the Year Biotechnology Genomics Molecular biology Bioinformatics DNA * Gene tests Emerging technologies Molecular genetics