Genomics is an interdisciplinary field of
biology
Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
focusing on the structure, function, evolution, mapping, and editing of
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
s. A genome is an organism's complete set of
DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration.
In contrast to
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
, which refers to the study of ''individual'' genes and their roles in inheritance, genomics aims at the collective characterization and quantification of ''all'' of an organism's genes, their interrelations and influence on the organism.
Genes may direct the production of
proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput
DNA sequencing and
bioinformatics to assemble and analyze the function and structure of entire genomes.
Advances in genomics have triggered a revolution in discovery-based research and
systems biology to facilitate understanding of even the most complex biological systems such as the brain.
The field also includes studies of intragenomic (within the genome) phenomena such as
epistasis (effect of one gene on another),
pleiotropy (one gene affecting more than one trait),
heterosis
Heterosis, hybrid vigor, or outbreeding enhancement is the improved or increased function of any biological quality in a hybrid offspring. An offspring is heterotic if its traits are enhanced as a result of mixing the genetic contributions o ...
(hybrid vigour), and other interactions between
loci and
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
s within the genome.
History
Etymology
From the Greek ΓΕΝ
''gen'', "gene" (gamma, epsilon, nu, epsilon) meaning "become, create, creation, birth", and subsequent variants: genealogy, genesis, genetics, genic, genomere, genotype, genus etc. While the word ''genome'' (from the
German
German(s) may refer to:
* Germany (of or related to)
** Germania (historical use)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizens of Germany, see also German nationality law
**Ge ...
''Genom'', attributed to
Hans Winkler
Hans Karl Albert Winkler (23 April 1877 – 22 November 1945) was a German botanist. He was Professor of Botany at the University of Hamburg, and a director of that university's Institute of Botany. Winkler coined the term 'heteroploidy' in 191 ...
) was in use in
English
English usually refers to:
* English language
* English people
English may also refer to:
Peoples, culture, and language
* ''English'', an adjective for something of, from, or related to England
** English national ide ...
as early as 1926,
the term ''genomics'' was coined by Tom Roderick, a
geneticist
A geneticist is a biologist or physician who studies genetics, the science of genes, heredity, and variation of organisms. A geneticist can be employed as a scientist or a lecturer. Geneticists may perform general research on genetic processes ...
at the
Jackson Laboratory (
Bar Harbor, Maine), over beers with Jim Womack, Tom Shows and
Stephen O’Brien at a meeting held in
Maryland
Maryland ( ) is a state in the Mid-Atlantic region of the United States. It shares borders with Virginia, West Virginia, and the District of Columbia to its south and west; Pennsylvania to its north; and Delaware and the Atlantic Ocean to ...
on the mapping of the human genome in 1986.
First as the name for a
new journal and then as a whole new science discipline.
Early sequencing efforts
Following
Rosalind Franklin's confirmation of the helical structure of DNA,
James D. Watson
James Dewey Watson (born April 6, 1928) is an American molecular biologist, geneticist, and zoologist. In 1953, he co-authored with Francis Crick the academic paper proposing the double helix structure of the DNA molecule. Watson, Crick and ...
and
Francis Crick's publication of the structure of DNA in 1953 and
Fred Sanger
Frederick Sanger (; 13 August 1918 – 19 November 2013) was an English biochemist who received the Nobel Prize in Chemistry twice.
He won the 1958 Chemistry Prize for determining the amino acid sequence of insulin and numerous othe ...
's publication of the
Amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha a ...
sequence of insulin in 1955, nucleic acid sequencing became a major target of early
molecular biologists
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
.
In 1964,
Robert W. Holley
Robert William Holley (January 28, 1922 – February 11, 1993) was an American biochemist. He shared the Nobel Prize in Physiology or Medicine in 1968 (with Har Gobind Khorana and Marshall Warren Nirenberg) for describing the structure of alani ...
and colleagues published the first nucleic acid sequence ever determined, the
ribonucleotide
In biochemistry, a ribonucleotide is a nucleotide containing ribose as its pentose component. It is considered a molecular precursor of nucleic acids. Nucleotides are the basic building blocks of DNA and RNA. Ribonucleotides themselves are basic ...
sequence of
alanine
Alanine (symbol Ala or A), or α-alanine, is an α-amino acid that is used in the biosynthesis of proteins. It contains an amine group and a carboxylic acid group, both attached to the central carbon atom which also carries a methyl group side ...
transfer RNA.
Extending this work,
Marshall Nirenberg
Marshall Warren Nirenberg (April 10, 1927 – January 15, 2010) was an American biochemist and geneticist. He shared a Nobel Prize in Physiology or Medicine in 1968 with Har Gobind Khorana and Robert W. Holley for "breaking the genetic code" ...
and
Philip Leder
Philip Leder (November 19, 1934 – February 2, 2020) was an American geneticist.
Early life and education
Leder was born in Washington, D.C. and studied at Harvard University, graduating in 1956. In 1960, he graduated from Harvard Medical Sc ...
revealed the triplet nature of the
genetic code
The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
and were able to determine the sequences of 54 out of 64
codons in their experiments.
In 1972,
Walter Fiers
Walter Fiers (31 January 1931 in Ypres, West Flanders – 28 July 2019 in Destelbergen) was a Belgian molecular biologist.
He obtained a degree of Engineer for Chemistry and Agricultural Industries at the University of Ghent in 1954, and started ...
and his team at the Laboratory of Molecular Biology of the
University of Ghent
Ghent University ( nl, Universiteit Gent, abbreviated as UGent) is a public research university located in Ghent, Belgium.
Established before the state of Belgium itself, the university was founded by the Dutch King William I in 1817, when the ...
(
Ghent
Ghent ( nl, Gent ; french: Gand ; traditional English: Gaunt) is a city and a municipality in the Flemish Region of Belgium. It is the capital and largest city of the East Flanders province, and the third largest in the country, exceeded i ...
,
Belgium
Belgium, ; french: Belgique ; german: Belgien officially the Kingdom of Belgium, is a country in Northwestern Europe. The country is bordered by the Netherlands to the north, Germany to the east, Luxembourg to the southeast, France to th ...
) were the first to determine the sequence of a gene: the gene for
Bacteriophage MS2 coat protein.
Fiers' group expanded on their MS2 coat protein work, determining the complete nucleotide-sequence of bacteriophage MS2-RNA (whose genome encodes just four genes in 3569
base pairs [bp]) and
Simian virus 40 in 1976 and 1978, respectively.
DNA-sequencing technology developed
In addition to his seminal work on the amino acid sequence of insulin,
Frederick Sanger and his colleagues played a key role in the development of DNA sequencing techniques that enabled the establishment of comprehensive genome sequencing projects.
In 1975, he and Alan Coulson published a sequencing procedure using DNA polymerase with radiolabelled nucleotides that he called the ''Plus and Minus technique''.
This involved two closely related methods that generated short oligonucleotides with defined 3' termini. These could be fractionated by
electrophoresis on a
polyacrylamide
Polyacrylamide (abbreviated as PAM) is a polymer with the formula (-CH2CHCONH2-). It has a linear-chain structure. PAM is highly water-absorbent, forming a soft gel when hydrated. In 2008, an estimated 750,000,000 kg were produced, mainly f ...
gel (called polyacrylamide gel electrophoresis) and visualised using autoradiography. The procedure could sequence up to 80 nucleotides in one go and was a big improvement, but was still very laborious. Nevertheless, in 1977 his group was able to sequence most of the 5,386 nucleotides of the single-stranded
bacteriophage φX174, completing the first fully sequenced DNA-based genome.
The refinement of the ''Plus and Minus'' method resulted in the chain-termination, or
Sanger method (see
below), which formed the basis of the techniques of DNA sequencing, genome mapping, data storage, and bioinformatic analysis most widely used in the following quarter-century of research.
In the same year
Walter Gilbert
Walter Gilbert (born March 21, 1932) is an American biochemist, physicist, molecular biology pioneer, and Nobel laureate.
Education and early life
Walter Gilbert was born in Boston, Massachusetts, on March 21, 1932, the son of Emma (Cohen), a c ...
and
Allan Maxam of
Harvard University
Harvard University is a private Ivy League research university in Cambridge, Massachusetts. Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, it is the oldest institution of high ...
independently developed the
Maxam-Gilbert method (also known as the ''chemical method'') of DNA sequencing, involving the preferential cleavage of DNA at known bases, a less efficient method.
For their groundbreaking work in the sequencing of nucleic acids, Gilbert and Sanger shared half the 1980
Nobel Prize
The Nobel Prizes ( ; sv, Nobelpriset ; no, Nobelprisen ) are five separate prizes that, according to Alfred Nobel's will of 1895, are awarded to "those who, during the preceding year, have conferred the greatest benefit to humankind." Alfr ...
in chemistry with
Paul Berg
Paul Berg (born June 30, 1926) is an American biochemist and professor emeritus at Stanford University. He was the recipient of the Nobel Prize in Chemistry in 1980, along with Walter Gilbert and Frederick Sanger. The award recognized their con ...
(
recombinant DNA
Recombinant DNA (rDNA) molecules are DNA molecules formed by laboratory methods of genetic recombination (such as molecular cloning) that bring together genetic material from multiple sources, creating sequences that would not otherwise be fo ...
).
Complete genomes
The advent of these technologies resulted in a rapid intensification in the scope and speed of completion of
genome sequencing projects. The first complete genome sequence of a
eukaryotic organelle, the human
mitochondrion (16,568 bp, about 16.6 kb [kilobase]), was reported in 1981,
and the first
chloroplast genomes followed in 1986.
In 1992, the first eukaryotic
chromosome
A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...
, chromosome III of brewer's yeast ''
Saccharomyces cerevisiae
''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have b ...
'' (315 kb) was sequenced.
The first free-living organism to be sequenced was that of ''
Haemophilus influenzae
''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bacter ...
'' (1.8 Mb [megabase]) in 1995.
The following year a consortium of researchers from laboratories across
North America,
Europe
Europe is a large peninsula conventionally considered a continent in its own right because of its great physical size and the weight of its history and traditions. Europe is also considered a subcontinent of Eurasia and it is located entirel ...
, and
Japan announced the completion of the first complete genome sequence of a eukaryote, ''
S. cerevisiae
''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have bee ...
'' (12.1 Mb), and since then genomes have continued being sequenced at an exponentially growing pace.
, the complete sequences are available for: 2,719
virus
A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea.
Since Dmitri Ivanovsk ...
es, 1,115
archaea and
bacteria
Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of prokaryotic microorganisms. Typically a few micrometr ...
, and 36
eukaryotes, of which about half are
fungi
A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately from ...
.
Most of the microorganisms whose genomes have been completely sequenced are problematic
pathogen
In biology, a pathogen ( el, πάθος, "suffering", "passion" and , "producer of") in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a germ ...
s, such as ''
Haemophilus influenzae
''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bacter ...
'', which has resulted in a pronounced bias in their phylogenetic distribution compared to the breadth of microbial diversity.
Of the other sequenced species, most were chosen because they were well-studied model organisms or promised to become good models. Yeast (''
Saccharomyces cerevisiae
''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have b ...
'') has long been an important
model organism for the
eukaryotic cell
Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacter ...
, while the fruit fly ''
Drosophila melanogaster
''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the " vinegar fly" or "pomace fly". Starting with ...
'' has been a very important tool (notably in early pre-molecular
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
). The worm ''
Caenorhabditis elegans'' is an often used simple model for
multicellular organisms. The zebrafish ''
Brachydanio rerio'' is used for many developmental studies on the molecular level, and the plant ''
Arabidopsis thaliana'' is a model organism for flowering plants. The
Japanese pufferfish
''Takifugu'' is a genus of pufferfish, often better known by the Japanese name . There are 25 species belonging to the genus ''Takifugu'' and most of these are native to salt and brackish waters of the northwest Pacific, but a few species are ...
(''
Takifugu rubripes
''Takifugu rubripes'', commonly known as the Japanese puffer, Tiger puffer, or torafugu ( ja, 虎河豚), is a pufferfish in the genus '' Takifugu''. It is distinguished by a very small genome that has been fully sequenced because of its use as a ...
'') and the
spotted green pufferfish (''
Tetraodon nigroviridis
''Dichotomyctere nigroviridis'' ( syn. ''Tetraodon nigroviridis'') is one of the pufferfish known as the green spotted puffer. It is found across South and Southeast Asia in coastal freshwater,but survives the longest in brackish to saltwater ...
'') are interesting because of their small and compact genomes, which contain very little
noncoding DNA
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and r ...
compared to most species.
The mammals dog (''
Canis familiaris
The dog (''Canis familiaris'' or ''Canis lupus familiaris'') is a domesticated descendant of the wolf. Also called the domestic dog, it is derived from the extinct Pleistocene wolf, and the modern wolf is the dog's nearest living relative. Do ...
''),
brown rat (''
Rattus norvegicus
''Rattus'' is a genus of muroid rodents, all typically called rats. However, the term rat can also be applied to rodent species outside of this genus.
Species and description
The best-known ''Rattus'' species are the black rat (''R. rattus'') ...
''), mouse (''
Mus musculus
Mus or MUS may refer to:
Abbreviations
* MUS, the NATO country code for Mauritius
* MUS, the IATA airport code for Minami Torishima Airport
* MUS, abbreviation for the Centre for Modern Urban Studies on Campus The Hague, Leiden University, Net ...
''), and chimpanzee (''
Pan troglodytes
The chimpanzee (''Pan troglodytes''), also known as simply the chimp, is a species of great ape native to the forest and savannah of tropical Africa. It has four confirmed subspecies and a fifth proposed subspecies. When its close relative th ...
'') are all important model animals in medical research.
A rough draft of the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
was completed by the
Human Genome Project in early 2001, creating much fanfare.
This project, completed in 2003, sequenced the entire genome for one specific person, and by 2007 this sequence was declared "finished" (less than one error in 20,000 bases and all chromosomes assembled).
In the years since then, the genomes of many other individuals have been sequenced, partly under the auspices of the
1000 Genomes Project
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
, which announced the sequencing of 1,092 genomes in October 2012.
Completion of this project was made possible by the development of dramatically more efficient sequencing technologies and required the commitment of significant
bioinformatics resources from a large international collaboration.
The continued analysis of human genomic data has profound political and social repercussions for human societies.
The "omics" revolution
The English-language
neologism
A neologism Greek νέο- ''néo''(="new") and λόγος /''lógos'' meaning "speech, utterance"] is a relatively recent or isolated term, word, or phrase that may be in the process of entering common use, but that has not been fully accepted int ...
omics informally refers to a field of study in biology ending in ''-omics'', such as genomics,
proteomics or
metabolomics
Metabolomics is the scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism. Specifically, metabolomics is the "systematic study of the unique chemical fingerprin ...
. The related suffix -ome is used to address the objects of study of such fields, such as the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
,
proteome
The proteome is the entire set of proteins that is, or can be, expressed by a genome, cell, tissue, or organism at a certain time. It is the set of expressed proteins in a given type of cell or organism, at a given time, under defined conditions. ...
or
metabolome
The metabolome refers to the complete set of Small molecule, small-molecule chemicals found within a biological sample. The biological sample can be a Cell (biology), cell, a cellular organelle, an Organ (anatomy), organ, a Tissue (biology), tiss ...
respectively. The suffix ''-ome'' as used in molecular biology refers to a ''totality'' of some sort; similarly omics has come to refer generally to the study of large, comprehensive biological data sets. While the growth in the use of the term has led some scientists (
Jonathan Eisen
Jonathan Andrew Eisen (born August 31, 1968) is an American evolutionary biologist, currently working at University of California, Davis. His academic research is in the fields of evolutionary biology, genomics and microbiology and he is the ac ...
, among others
) to claim that it has been oversold,
it reflects the change in orientation towards the quantitative analysis of complete or near-complete assortment of all the constituents of a system.
In the study of
symbioses, for example, researchers which were once limited to the study of a single gene product can now simultaneously compare the total complement of several types of biological molecules.
Genome analysis
After an organism has been selected, genome projects involve three components: the sequencing of DNA, the assembly of that sequence to create a representation of the original chromosome, and the annotation and analysis of that representation.
Sequencing
Historically, sequencing was done in ''sequencing centers'', centralized facilities (ranging from large independent institutions such as
Joint Genome Institute
The U.S. Department of Energy (DOE) Joint Genome Institute (JGI), first located in Walnut Creek then Berkeley, California, was created in 1997 to unite the expertise and resources in genome mapping, DNA sequencing, technology development, and i ...
which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory.
On the whole, genome sequencing approaches fall into two broad categories, ''shotgun'' and ''high-throughput'' (or ''next-generation'') sequencing.
Shotgun sequencing
Shotgun sequencing is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes.
It is named by analogy with the rapidly expanding, quasi-random firing pattern of a
shotgun. Since gel electrophoresis sequencing can only be used for fairly short sequences (100 to 1000 base pairs), longer DNA sequences must be broken into random small segments which are then sequenced to obtain ''reads''. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.
Shotgun sequencing is a random sampling process, requiring over-sampling to ensure a given
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecule ...
is represented in the reconstructed sequence; the average number of reads by which a genome is over-sampled is referred to as
coverage
Coverage may refer to:
Filmmaking
* Coverage (lens), the size of the image a lens can produce
* Camera coverage, the amount of footage shot and different camera setups used in filming a scene
* Script coverage, a short summary of a script, wri ...
.
For much of its history, the technology underlying shotgun sequencing was the classical chain-termination method or '
Sanger method', which is based on the selective incorporation of chain-terminating
dideoxynucleotides by
DNA polymerase
A DNA polymerase is a member of a family of enzymes that catalyze the synthesis of DNA molecules from nucleoside triphosphates, the molecular precursors of DNA. These enzymes are essential for DNA replication and usually work in groups to create ...
during
in vitro
''In vitro'' (meaning in glass, or ''in the glass'') studies are performed with microorganisms, cells, or biological molecules outside their normal biological context. Colloquially called " test-tube experiments", these studies in biology ...
DNA replication.
Recently, shotgun sequencing has been supplanted by Dna sequencing#Next-generation methods, high-throughput sequencing methods, especially for large-scale, automated
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
analyses. However, the Sanger method remains in wide use, primarily for smaller-scale projects and for obtaining especially long contiguous DNA sequence reads (>500 nucleotides).
[ Chain-termination methods require a single-stranded DNA template, a DNA primer (molecular biology), primer, a ]DNA polymerase
A DNA polymerase is a member of a family of enzymes that catalyze the synthesis of DNA molecules from nucleoside triphosphates, the molecular precursors of DNA. These enzymes are essential for DNA replication and usually work in groups to create ...
, normal deoxynucleosidetriphosphates (dNTPs), and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. These chain-terminating nucleotides lack a 3'-hydroxyl, OH group required for the formation of a phosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively or fluorescence, fluorescently labelled for detection in DNA sequencers.[ Typically, these machines can sequence up to 96 DNA samples in a single batch (run) in up to 48 runs a day.]
High-throughput sequencing
The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that multiplex (assay), parallelize the sequencing process, producing thousands or millions of sequences at once. High-throughput sequencing is intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods. In ultra-high-throughput sequencing, as many as 500,000 sequencing-by-synthesis operations may be run in parallel.
The Illumina dye sequencing method is based on reversible dye-terminators and was developed in 1996 at the Geneva Biomedical Research Institute, by Pascal Mayer and Laurent Farinelli. In this method, DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal colonies, initially coined "DNA colonies", are formed. To determine the sequence, four types of reversible terminator bases (RT-bases) are added and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera. Decoupling the enzymatic reaction and the image capture allows for optimal throughput and theoretically unlimited sequencing capacity; with an optimal configuration, the ultimate throughput of the instrument depends only on the Analog-to-digital converter, A/D conversion rate of the camera. The camera takes images of the Fluorescent labeling, fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.
An alternative approach, ion semiconductor sequencing, is based on standard DNA replication chemistry. This technology measures the release of a hydrogen ion each time a base is incorporated. A microwell containing template DNA is flooded with a single nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecule ...
, if the nucleotide is complementary to the template strand it will be incorporated and a hydrogen ion will be released. This release triggers an ISFET ion sensor. If a homopolymer is present in the template sequence multiple nucleotides will be incorporated in a single flood cycle, and the detected electrical signal will be proportionally higher.
Assembly
Sequence assembly refers to sequence alignment, aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as current DNA sequencing technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Third generation sequencing technologies such as PacBio or Oxford Nanopore routinely generate sequencing reads >10 kb in length; however, they have a high error rate at approximately 15 percent. Typically the short fragments, called reads, result from shotgun sequencing genome, genomic DNA, or Transcription (genetics), gene transcripts (expressed sequence tag, ESTs).
Assembly approaches
Assembly can be broadly categorized into two approaches: ''de novo'' assembly, for genomes which are not similar to any sequenced in the past, and comparative assembly, which uses the existing sequence of a closely related organism as a reference during assembly. Relative to comparative assembly, ''de novo'' assembly is computationally difficult (NP-hard), making it less favourable for short-read NGS technologies. Within the ''de novo'' assembly paradigm there are two primary strategies for assembly, Eulerian path strategies, and overlap-layout-consensus (OLC) strategies. OLC strategies ultimately try to create a Hamiltonian path through an overlap graph which is an NP-hard problem. Eulerian path strategies are computationally more tractable because they try to find a Eulerian path through a deBruijn graph.
Finishing
Finished genomes are defined as having a single contiguous sequence with no ambiguities representing each Replicon (genetics), replicon.
Annotation
The DNA sequence assembly alone is of little value without additional analysis. Genome annotation is the process of attaching biological information to DNA sequence, sequences, and consists of three main steps:
# identifying portions of the genome that do not code for proteins
# identifying elements on the genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
, a process called gene prediction, and
# attaching biological information to these elements.
Automatic annotation tools try to perform these steps ''in silico'', as opposed to manual annotation (a.k.a. curation) which involves human expertise and potential experimental verification. Ideally, these approaches co-exist and complement each other in the same annotation Pipeline (computing), pipeline (also see #Sequencing pipelines, below).
Traditionally, the basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on homologues. More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g. Ensembl) rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline. ''Structural annotation'' consists of the identification of genomic elements, primarily Open reading frame, ORFs and their localisation, or gene structure. ''Functional annotation'' consists of attaching biological information to genomic elements.
Sequencing pipelines and databases
The need for reproducibility and efficient management of the large amount of data associated with genome projects mean that Pipeline (software), computational pipelines have important applications in genomics.
Research areas
Functional genomics
Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome project, genome sequencing projects) to describe gene (and protein) functions and interactions. Functional genomics focuses on the dynamic aspects such as gene transcription (genetics), transcription, translation (biology), translation, and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.
A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics.
Structural genomics
Structural genomics seeks to describe the Protein Structure, 3-dimensional structure of every protein encoded by a given genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
. This genome-based approach allows for a high-throughput method of structure determination by a combination of protein structure prediction, experimental and modeling approaches. The principal difference between structural genomics and protein structure prediction, traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large numbers of sequenced genomes and previously solved protein structures allow scientists to model protein structure on the structures of previously solved homologs. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or homology modeling, structural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure. As opposed to traditional structural biology, the determination of a protein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function from its Three-dimensional space, 3D structure.
Epigenomics
Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence (Russell 2010 p. 475). Two of the most characterized epigenetic modifications are DNA methylation and Epigenetics#DNA methylation and chromatin remodeling, histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in Epigenetics#Development, differentiation/development and Epigenetics#Cancer, tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.
Metagenomics
Metagenomics is the study of ''metagenomes'', genetics, genetic material recovered directly from Natural environment, environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing rely upon cultivated clone (genetics), clonal microbiological culture, cultures, early environmental gene sequencing cloned specific genes (often the 16S ribosomal RNA, 16S rRNA gene) to produce a microbial ecology, profile of diversity in a natural sample. Such work revealed that the vast majority of biodiversity, microbial biodiversity had been missed by Microbiological culture, cultivation-based methods. Recent studies use "shotgun" chain termination method, Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities. Because of its power to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.
Model systems
Viruses and bacteriophages
Bacteriophages have played and continue to play a key role in bacterial genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
and molecular biology. Historically, they were used to define gene structure and gene regulation. Also the first genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
to be sequenced was a bacteriophage. However, bacteriophage research did not lead the genomics revolution, which is clearly dominated by bacterial genomics. Only very recently has the study of bacteriophage genomes become prominent, thereby enabling researchers to understand the mechanisms underlying phage evolution. Bacteriophage genome sequences can be obtained through direct sequencing of isolated bacteriophages, but can also be derived as part of microbial genomes. Analysis of bacterial genomes has shown that a substantial amount of microbial DNA consists of prophage sequences and prophage-like elements. A detailed database mining of these sequences offers insights into the role of prophages in shaping the bacterial genome: Overall, this method verified many known bacteriophage groups, making this a useful tool for predicting the relationships of prophages from bacterial genomes.
Cyanobacteria
At present there are 24 cyanobacteria for which a total genome sequence is available. 15 of these cyanobacteria come from the marine environment. These are six ''Prochlorococcus'' strains, seven marine ''Synechococcus'' strains, ''Trichodesmium erythraeum'' IMS101 and ''Crocosphaera watsonii'' WH8501. Several studies have demonstrated how these sequences could be used very successfully to infer important ecological and physiological characteristics of marine cyanobacteria. However, there are many more genome projects currently in progress, amongst those there are further ''Prochlorococcus'' and marine ''Synechococcus'' isolates, ''Acaryochloris'' and ''Prochloron'', the N2-fixing filamentous cyanobacteria ''Nodularia spumigena'', ''Lyngbya aestuarii'' and ''Lyngbya majuscula'', as well as bacteriophages infecting marine cyanobaceria. Thus, the growing body of genome information can also be tapped in a more general way to address global problems by applying a comparative approach. Some new and exciting examples of progress in this field are the identification of genes for regulatory RNAs, insights into the evolutionary origin of photosynthesis, or estimation of the contribution of horizontal gene transfer to the genomes that have been analyzed.
Applications
Genomics has provided applications in many fields, including medicine, biotechnology, anthropology and other social sciences.
Genomic medicine
Next-generation genomic technologies allow clinicians and biomedical researchers to drastically increase the amount of genomic data collected on large study populations. When combined with new informatics approaches that integrate many kinds of data with genomic data in disease research, this allows researchers to better understand the genetic bases of drug response and disease. Early efforts to apply the genome to medicine included those by a Stanford team led by Euan Ashley who developed the first tools for the medical interpretation of a human genome. The Genomes2People research program at Brigham and Women’s Hospital, Broad Institute and Harvard Medical School was established in 2012 to conduct empirical research in translating genomics into health. Brigham and Women's Hospital opened a Preventive Genomics Clinic in August 2019, with Massachusetts General Hospital following a month later. The ''All of Us'' research program aims to collect genome sequence data from 1 million participants to become a critical component of the precision medicine research platform.
Synthetic biology and bioengineering
The growth of genomic knowledge has enabled increasingly sophisticated applications of synthetic biology. In 2010 researchers at the J. Craig Venter Institute announced the creation of a partially synthetic species of bacterium, ''Mycoplasma laboratorium'', derived from the genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
of ''Mycoplasma genitalium''.
Population and conservation genomics
Population genomics has developed as a popular field of research, where genomic sequencing methods are used to conduct large-scale comparisons of DNA sequences among populations - beyond the limits of genetic markers such as short-range Polymerase chain reaction, PCR products or microsatellites traditionally used in population genetics. Population genomics studies genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...
-wide effects to improve our understanding of microevolution so that we may learn the phylogenetic history and demography of a population. Population genomic methods are used for many different fields including evolutionary biology, ecology, biogeography, conservation biology and fisheries management. Similarly, landscape genomics has developed from landscape genetics to use genomic methods to identify relationships between patterns of environmental and genetic variation.
Conservationists can use the information gathered by genomic sequencing in order to better evaluate genetic factors key to species conservation, such as the genetic diversity of a population or whether an individual is heterozygous for a recessive inherited genetic disorder. By using genomic data to evaluate the effects of evolutionary processes and to detect patterns in variation throughout a given population, conservationists can formulate plans to aid a given species without as many variables left unknown as those unaddressed by standard Conservation genetics, genetic approaches.
See also
* Cognitive genomics
* Computational genomics
* Epigenomics
* Functional genomics
* GeneCalling, an mRNA profiling technology
* Genomics of domestication
* Genetics in fiction
* Glycomics
* Immunomics
* Metagenomics
* Pathogenomics
* Personal genomics
* Proteomics
* Transcriptomics
* Venomics
* Psychogenomics
* Whole genome sequencing
*Thomas Roderick
References
Further reading
*
*
*
*
* electronic-book electronic-
External links
Annual Review of Genomics and Human Genetics
BMC Genomics
A BMC journal on Genomics
Genomics journal
Genomics.org
An openfree genomics portal.
NHGRI
US government's genome institute
JCVI Comprehensive Microbial Resource
KoreaGenome.org
The first Korean Genome published and the sequence is available freely.
GenomicsNetwork
Looks at the development and use of the science and technologies of genomics.
Institute for Genome Sciences
Genomics research.
MIT OpenCourseWare HST.512 Genomic Medicine
A free, self-study course in genomic medicine. Resources include audio lectures and selected lecture notes.
ENCODE threads explorer
Machine learning approaches to genomics. Nature (journal)
Global map of genomics laboratories
Genomics: Scitable by nature education
Learn All About Genetics Online
{{Authority control
Genomics,