HOME

TheInfoList




The human genome is a complete set of
nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monom ...
s for
humans Humans (''Homo sapiens'') are the most populous and widespread species In biology, a species is the basic unit of biological classification, classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species ...
, encoded as
DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A molecule is an electrically neutral gro ...

DNA
within the 23
chromosome A chromosome is a long DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A mole ...

chromosome
pairs in
cell nuclei In cell biology Cell biology (also cellular biology or cytology) is a branch of biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical process ...
and in a small DNA molecule found within individual
mitochondria A mitochondrion (; ) is a double-membrane Image:Schematic size.jpg, up150px, Schematic of size-based membrane exclusion A membrane is a selective barrier; it allows some things to pass through but stops others. Such things may be molecules, i ...

mitochondria
. These are usually treated separately as the nuclear genome and the
mitochondrial genome Mitochondrial DNA (mtDNA or mDNA) is the DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five car ...
. Human
genome In the fields of molecular biology and genetics, a genome is all genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and the noncodin ...

genome
s include both protein-coding DNA genes and
noncoding DNA Non-coding DNA sequences are components of an organism's DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of ...
.
Haploid Ploidy () is the number of complete sets of chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. Most eukaryotic chromosomes include packaging proteins called histones which, aided by ...
human genomes, which are contained in
germ cells A germ cell is any biological cell The cell (from Latin ''cella'', meaning "small room") is the basic structural, functional, and biological unit of all known organisms. Cells are the smallest units of life, and hence are often referred to a ...
(the
egg An egg is the organic vessel containing the in which an develops until it can survive on its own, at which point the animal hatches. An egg results from of an . Most s, (excluding s), and lay eggs, although some, such as s, do not. eg ...

egg
and
sperm Sperm is the male reproductive Cell (biology), cell, or gamete, in anisogamous forms of sexual reproduction (forms in which there is a larger, female reproductive cell and a smaller, male one). Animals produce motile sperm with a tail known as ...

sperm
gamete A gamete ( /ˈɡæmiːt/; from Ancient Greek Ancient Greek includes the forms of the Greek language used in ancient Greece and the classical antiquity, ancient world from around 1500 BC to 300 BC. It is often roughly divided into the foll ...
cells created in the
meiosis Meiosis (; , because it is a reductional division) is a special type of of in organisms used to produce the , such as or . It involves two rounds of division that ultimately result in four cells with only one copy of each (). Additionall ...

meiosis
phase of
sexual reproduction Sexual reproduction is a type of reproduction Reproduction (or procreation or breeding) is the biological process Biological processes are those processes that are vital for an organism In biology, an organism (from Ancient Greek, ...
before
fertilization Fertilisation or fertilization (see American and British English spelling differences#-ise.2C -ize .28-isation.2C -ization.29, spelling differences), also known as generative fertilisation, syngamy and impregnation, is the fusion of gametes ...

fertilization
creates a
zygote A zygote (, ) is a eukaryotic Eukaryotes () are organism In biology, an organism () is any organic, life, living system that functions as an individual entity. All organisms are composed of cells (cell theory). Organisms are c ...

zygote
) consist of three billion
DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A molecule is an electrically neutral gro ...

DNA
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids Nucleic acids are biopolymer Biopolymers are natural polymers produced by the cells of Organism, living organisms. Biopolymers consist of monomeric units that are Covalent_ ...
s, while
diploid Ploidy () is the number of complete sets of chromosomes in a cell (biology), cell, and hence the number of possible alleles for Autosome, autosomal and Pseudoautosomal region, pseudoautosomal genes. Sets of chromosomes refer to the number of mate ...
genomes (found in
somatic cells A somatic cell (from Ancient Greek Ancient Greek includes the forms of the Greek language used in ancient Greece and the classical antiquity, ancient world from around 1500 BC to 300 BC. It is often roughly divided into the following period ...
) have twice the DNA content. While there are significant differences among the genomes of human individuals (on the order of 0.1% due to
single-nucleotide variant In genetics Genetics is a branch of biology concerned with the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) Though heredity had been observed for millennia, Gregor Mendel, Moravia, Moravian scientist ...
s and 0.6% when considering
indel Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
s), these are considerably smaller than the differences between humans and their closest living relatives, the
bonobo The bonobo (; ''Pan paniscus''), also historically called the pygmy chimpanzee and less often, the dwarf or gracile chimpanzee, is an Endangered Species, endangered great ape and one of the two species making up the genus ''Pan (genus), Pan''; th ...

bonobo
s and
chimpanzee The chimpanzee (''Pan troglodytes''), also known simply as chimp, is a species of Hominidae, great ape native to the forest and savannah of tropical Africa. It has four confirmed subspecies and a fifth proposed subspecies. The chimpanzee and t ...
s (~1.1%
fixed Fixed may refer to: * Fixed (EP), ''Fixed'' (EP), EP by Nine Inch Nails * ''Fixed'', an upcoming 3D adult animated film directed by Genndy Tartakovsky * Fixed (typeface), a collection of monospace bitmap fonts that is distributed with the X Window ...
single-nucleotide variants and 4% when including indels).
Although the sequence of the human genome has been (almost) completely determined by DNA sequencing, it is not yet fully understood. Most (though probably not all)
gene In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological mecha ...

gene
s have been identified by a combination of high throughput experimental and
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biology, biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformat ...

bioinformatics
approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and
RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek ''wikt:poly-, poly-'', "many" + ''wikt:-mer, -mer'', "part") is a Chemical substance, substance or material consisting of very large molecules, or macromolecules, composed of many Re ...

RNA
products. Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including
regulation of gene expression Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein Proteins are large biomolecules or macromolecules that a ...

regulation of gene expression
, organization of chromosome architecture, and signals controlling
epigenetic inheritance Transgenerational epigenetic inheritance is the transmission of epigenetic markers from one organism to the next (i.e., from parent to child) that affects the traits of offspring without altering the primary structure of DNA (i.e. the sequence of n ...
. Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. However, a fuller understanding of the role played by sequences that do not encode proteins, but instead express regulatory RNA, has raised the total number of genes to at least 46,831, plus another 2300 micro-RNA genes. By 2012, functional DNA elements that encode neither RNA nor proteins have been noted. A 2018 population survey found another 300 million bases of human genome that was not in the reference sequence.
Protein Proteins are large biomolecule , showing alpha helices, represented by ribbons. This poten was the first to have its suckture solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, for which they received a No ...

Protein
-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with
non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or material consisting of very large molecule File:Pentacene on Ni(111) STM.jpg, A s ...
genes, regulatory DNA sequences,
LINEs Long interspersed nuclear elements (LINEs) (also known as long interspersed nucleotide elements or long interspersed elements) are a group of non-LTR (long terminal repeat A long terminal repeat (LTR) is a pair of identical sequences of DNA ...

LINEs
,
SINEs Sines () is a city and a municipality in Portugal. The municipality, divided into two parishes, has around 14,214 inhabitants (2021) in an area of . Sines holds an important oil refinery An oil refinery or petroleum refinery is an industrial pr ...
,
intron An intron (for ''intragenic region'') is any Nucleic acid sequence, nucleotide sequence within a gene that is removed by RNA splicing during Post-transcriptional modification, maturation of the final RNA product. In other words, introns are non-c ...

intron
s, and sequences for which as yet no function has been determined.


Sequencing

The first human genome sequences were published in nearly complete draft form in February 2001 by the
Human Genome Project The Human Genome Project (''HGP'') was an international scientific research The scientific method is an Empirical evidence, empirical method of acquiring knowledge that has characterized the development of science since at least the 17th cen ...
and
Celera Corporation Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnostics ...
. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence, leaving just 341 gaps in the sequence, representing highly-repetitive and other DNA that could not be sequenced with the technology available at the time. The human genome was the first of all vertebrates to be sequenced to such near-completion, and as of 2018, the diploid genomes of over a million individual humans had been determined using
next-generation sequencingMassive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes a ...
. In 2021 it was reported that the T2T consortium had filled in all of the gaps. Thus the human genome has been completely sequenced. These data are used worldwide in
biomedical science Biomedical sciences are a set of sciences applying portions of natural science or formal science, or both, to develop knowledge, interventions, or technology that are of use in healthcare or public health Public health has been defined a ...
,
anthropology Anthropology is the scientific study of human Humans (''Homo sapiens'') are the most abundant and widespread species In biology Biology is the natural science that studies life and living organisms, including their anatomy, ...
,
forensics Forensic science, also known as criminalistics, is the application of to and , mainly—on the criminal side—during , as governed by the legal standards of and . Forensic scientists collect, preserve, and analyze scientific during the c ...
and other branches of science. Such genomic studies have led to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including
human evolution Human evolution is the evolution Evolution is change in the heritable Heredity, also called inheritance or biological inheritance, is the passing on of Phenotypic trait, traits from parents to their offspring; either through asexual ...

human evolution
. In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome.


Completeness

Although the 'completion' of the human genome project was announced in 2001, there remained hundreds of gaps, with about 5–10% of the total sequence remaining undetermined. The missing genetic information was mostly in repetitive
heterochromatic Heterochromia is a variation in coloration. The term is most often used to describe color differences of the iris (anatomy), iris, but can also be applied to color variation of hair or skin. Heterochromia is determined by the production, delivery ...
regions and near the
centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis In cell b ...
s and
telomere A telomere ( or , from and ) is a region of repetitive nucleotide Nucleotides are organic molecules , CH4; is among the simplest organic compounds. In chemistry, organic compounds are generally any chemical compounds that contain carbon-h ...

telomere
s, but also some gene-encoding euchromatic regions. There remained 160 euchromatic gaps in 2015 when the sequences spanning another 50 formerly-unsequenced regions were determined. Only in 2020 was the first truly complete telomere-to-telomere sequence of a human chromosome determined, namely of the
X chromosome The X chromosome is one of the two sex-determining chromosome A chromosome is a long DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecul ...

X chromosome
. Level "complete genome" (without Y chromosome) was achieved in May 2021.


Molecular organization and gene content

The total length of the human
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides Nuc ...
, that does not represent the sequence of any specific individual, is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, termed
autosome An autosome is any chromosome that is not a sex chromosome (an allosome). The members of an autosome pair in a diploid cell have the same morphology, unlike those in allosome pairs which may have different structures. The DNA in autosomes is coll ...
s, plus the 23rd pair of
sex chromosome A sex chromosome (also referred to as an allosome, heterotypical chromosome, gonosome, heterochromosome, or idiochromosome) is a chromosome that differs from an ordinary autosome in form, size, and behavior. The human sex chromosomes, a typical p ...
s (XX) in the female, and (XY) in the male. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the
mitochondrial DNA Mitochondrial DNA (mtDNA or mDNA) is the DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five car ...

mitochondrial DNA
, a comparatively small circular molecule present in multiple copies in each the
mitochondrion A mitochondrion (; ) is a double-membrane A membrane is a selective barrier; it allows some things to pass through but stops others. Such things may be molecules, ions, or other small particles. Biological membranes include cell membranes ...

mitochondrion
. ''Original analysis published in the
EnsemblEnsembl genome database project is a scientific project at the European Bioinformatics Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project. Ensembl aims to provide a centralized resource for geneti ...
database at the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental_organization, Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformati ...
(EBI) and
Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit A nonprofit organization (NPO), also known as a non-business entity, not-for-profit organization, or nonprofit institution ...
. Chromosome lengths estimated by multiplying the number of base pairs by 0.34 nanometers (distance between base pairs in the most common structure of the
DNA double helix In molecular biology, the term double helix refers to the structure formed by double-stranded molecules of nucleic acids such as DNA. The double helical structure of a nucleic acid complex arises as a consequence of its secondary structure ...
; a recent estimate of human chromosome lengths based on updated data reports 205.00 cm for the diploid male genome and 208.23 cm for female, corresponding to weights of 6.41 and 6.51 picograms (pg), respectively). Number of proteins is based on the number of initial
precursor mRNA Precursor or Precursors may refer to: * Precursor (religion), a forerunner, predecessor ** The Precursor, John the Baptist John the Baptist ''Yohanān HaMatbil''; la, Ioannes Baptista; grc-gre, Ἰωάννης ὁ βαπτιστής, ''Iōánn ...
transcripts, and does not include products of , or modifications to protein structure that occur after
translation Translation is the communication of the meaning Meaning most commonly refers to: * Meaning (linguistics), meaning which is communicated through the use of language * Meaning (philosophy), definition, elements, and types of meaning discusse ...

translation
.'' ''
Variations Variation or Variations may refer to: Science and mathematics * Variation (astronomy), any perturbation of the mean motion or orbit of a planet or satellite, particularly of the moon * Genetic variation, the difference in DNA among individuals ...

Variations
are unique DNA sequence differences that have been identified in the individual human genome sequences analyzed by Ensembl as of December 2016. The number of identified variations is expected to increase as further personal genomes are sequenced and analyzed. In addition to the gene content shown in this table, a large number of non-expressed functional sequences have been identified throughout the human genome (see below). Links open windows to the reference chromosome sequences in the EBI genome browser.'' ''Small
non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or material consisting of very large molecule File:Pentacene on Ni(111) STM.jpg, A s ...
s are RNAs of as many as 200 bases that do not have protein-coding potential. These include:
microRNA A microRNA (abbreviated miRNA) is a small single-stranded non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or materi ...
s, or miRNAs (post-transcriptional regulators of gene expression),
small nuclear RNA Small may refer to: Science and technology * SMALL Small may refer to: Science and technology * SMALL Small may refer to: Science and technology * SMALL Small may refer to: Science and technology * SMALL, an ALGOL-like programming language ...
s, or snRNAs (the RNA components of
spliceosome A spliceosome is a large ribonucleoprotein (RNP) complex found primarily within the nucleus ''Nucleus'' (plural nuclei) is a Latin word for the seed inside a fruit. It most often refers to: *Atomic nucleus, the very dense central region of an atom ...
s), and
small nucleolar RNA In molecular biology, Small nucleolar RNAs (snoRNAs) are a class of small RNA molecules that primarily guide chemical modifications of other RNAs, mainly ribosomal RNAs, transfer RNAs and small nuclear RNAs. There are two main classes of snoRNA, t ...
s, or snoRNA (involved in guiding chemical modifications to other RNA molecules).
Long non-coding RNA Long non-coding RNAs (long ncRNAs, lncRNA) are a type of RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek ''wikt:poly-, poly-'', "many" + ''wikt:-mer, -mer'', "part") is a Chemical substance, substance or material consisting ...
s are RNA molecules longer than 200 bases that do not have protein-coding potential. These include:
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or material consis ...
s, or rRNAs (the RNA components of
ribosome Ribosomes ( ), also called Palade granules, are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (mRNA translation). Ribosomes link amino acids ...

ribosome
s), and a variety of other long RNAs that are involved in
regulation of gene expression Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein Proteins are large biomolecules or macromolecules that a ...

regulation of gene expression
,
epigenetic In biology, epigenetics is the study of heritability, heritable phenotype changes that do not involve alterations in the DNA sequence. The Ancient Greek, Greek prefix ''wikt:epi-, epi-'' ( "over, outside of, around") in ''epigenetics'' implies f ...
modifications of DNA nucleotides and
histone In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological mechanisms ...
proteins, and regulation of the activity of protein-coding genes. Small discrepancies between total-small-ncRNA numbers and the numbers of specific types of small ncNRAs result from the former values being sourced from Ensembl release 87 and the latter from Ensembl release 68.'' ''The number of genes in the human genome is not entirely clear because the function of numerous transcripts remains unclear. This is especially true for
non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or material consisting of very large molecule File:Pentacene on Ni(111) STM.jpg, A s ...
. The number of protein-coding genes is better known but there are still on the order of 1,400 questionable genes which may or may not encode functional proteins, usually encoded by short
open reading frame In molecular biology Molecular biology is the branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, molecular synthesis, modification, mech ...

open reading frame
s.''


Information content

The
haploid Ploidy () is the number of complete sets of chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. Most eukaryotic chromosomes include packaging proteins called histones which, aided by ...
human genome (23
chromosomes A chromosome is a long DNA molecule with part or all of the genome, genetic material of an organism. Most eukaryotic chromosomes include packaging proteins called histones which, aided by Chaperone (protein), chaperone proteins, bind to and D ...
) is about 3 billion base pairs long and contains around 30,000 genes. Since every base pair can be coded by 2 bits, this is about 750
megabyte The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes o ...
s of data. An individual somatic (
diploid Ploidy () is the number of complete sets of chromosomes in a cell (biology), cell, and hence the number of possible alleles for Autosome, autosomal and Pseudoautosomal region, pseudoautosomal genes. Sets of chromosomes refer to the number of mate ...
) cell contains twice this amount, that is, about 6 billion base pairs. Men have fewer than women because the Y chromosome is about 57 million base pairs whereas the X is about 156 million. Since individual genomes vary in sequence by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes. The
entropy rate In the mathematical theory of probability Probability is the branch of mathematics Mathematics (from Ancient Greek, Greek: ) includes the study of such topics as quantity (number theory), mathematical structure, structure (algebra), spa ...
of the genome differs significantly between coding and non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million base pairs), but less for the non-coding parts. It ranges between 1.5 and 1.9 bits per base pair for the individual chromosome, except for the Y-chromosome, which has an entropy rate below 0.9 bits per base pair., fig. 6, using the
Lempel-Ziv LZ77 and LZ78 are the two lossless data compression Lossless compression is a class of data compression In signal processing Signal processing is an electrical engineering Electrical engineering is an engineering discipline concer ...
estimators of entropy rate.


Coding vs. noncoding DNA

The content of the human genome is commonly divided into coding and noncoding DNA sequences.
Coding DNA The coding region of a gene, also known as the CDS (from ''coding sequence''), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compare ...
is defined as those sequences that can be transcribed into
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA i ...

mRNA
and
translated Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

translated
into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%).
Noncoding DNA Non-coding DNA sequences are components of an organism's DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of ...
is made up of all of those sequences (ca. 98% of the genome) that are not used to encode proteins. Some noncoding DNA contains genes for RNA molecules with important biological functions (
noncoding RNA : Ribonucleoproteins are shown in red, non-coding RNAs in blue. Note: in spliceosome is snRNA used A non-coding RNA (ncRNA) is an RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional ...
, for example
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or material consis ...
and
transfer RNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A molecule is an e ...
). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome ...
(Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity. Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced genome has become a more focused analytical concept than the classical concept of the DNA-coding gene.


Coding sequences (protein-coding genes)

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human
protein Proteins are large biomolecule , showing alpha helices, represented by ribbons. This poten was the first to have its suckture solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, for which they received a No ...

protein
s, although several biological processes (e.g. DNA rearrangements and ) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the
exome The exome is composed of all of the Exon, exons within the genome, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing. This includes Untranslated region, untranslated regions of mRNA, and Co ...
, and consists of DNA sequences encoded by
exon An exon is any part of a gene In biology, a gene (from ''genos'' "...Wilhelm Johannsen coined the word gene to describe the Mendelian_inheritance#History, Mendelian units of heredity..." (Greek language, Greek) meaning ''generation'' or ...
s that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project. Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as
Uniprot UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...

Uniprot
. Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early 1970s that the estimated mutational load from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes). The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the
roundworm The nematodes ( or grc-gre, Νηματώδη; la, Nematoda) or roundworms constitute the phylum Nematoda (also called Nemathelminthes), with plant-parasitic nematodes also known as eelworms. They are a diverse animal phylum inhabiting a bro ...

roundworm
and the . This difference may result from the extensive use of in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons. Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high gene density within chromosomes 1, 11, and 19. Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and
GC-content In molecular biology Molecular biology is the branch of biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, mo ...
. The significance of these nonrandom patterns of gene density is not well understood. Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability. For example, the gene for
histone H1 Histone H1 is one of the five main histone In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular int ...
a (HIST1HIA) is relatively small and simple, lacking introns and encoding an 781 nucleotide-long mRNA that produces a 215 amino acid protein from its 648 nucleotide
open reading frame In molecular biology Molecular biology is the branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, molecular synthesis, modification, mech ...

open reading frame
.
Dystrophin Dystrophin is a rod-shaped cytoplasmic protein Proteins are large biomolecules or macromolecules that are comprised of one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within o ...
(DMD) was the largest protein-coding gene in the 2001 human reference genome, spanning a total of 2.2 million nucleotides, while more recent systematic meta-analysis of updated human genome data identified an even larger protein-coding gene, ''RBFOX1'' (RNA binding protein, fox-1 homolog 1), spanning a total of 2.47 million nucleotides.
Titin Titin , also known as connectin, is a protein that in humans is encoded by the ''TTN'' gene. Titin is a giant protein (contraction for Titan protein), greater than 1 Micrometre, µm in length, that functions as a molecular Spring (device), sp ...
(TTN) has the longest coding sequence (114,414 nucleotides), the largest number of
exons An exon is any part of a gene In biology, a gene (from ''genos'' "...Wilhelm Johannsen coined the word gene to describe the Mendelian_inheritance#History, Mendelian units of heredity..." (Greek language, Greek) meaning ''generation'' or ...
(363), and the longest single exon (17,106 nucleotides). As estimated based on a curated set of protein-coding genes over the whole genome, the median size is 26,288 nucleotides (mean = 66,577), the median exon size, 133 nucleotides (mean = 309), the median number of exons, 8 (mean = 11), and the median encoded protein is 425 amino acids (mean = 553) in length.


Noncoding DNA (ncDNA)

Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA. Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements. Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA). Protein-coding sequences (specifically, coding
exon An exon is any part of a gene In biology, a gene (from ''genos'' "...Wilhelm Johannsen coined the word gene to describe the Mendelian_inheritance#History, Mendelian units of heredity..." (Greek language, Greek) meaning ''generation'' or ...
s) constitute less than 1.5% of the human genome. In addition, about 26% of the human genome is
introns An intron (for ''intragenic region'') is any nucleotide sequence within a gene In biology, a gene (from ''genos'' "...Wilhelm Johannsen coined the word gene to describe the Mendelian_inheritance#History, Mendelian units of heredity..." (G ...
. Aside from genes (exons and introns) and known regulatory sequences (8–20%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome ...
project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity. It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result transcriptional and biochemical noise, which must be actively filtered out by the organism. Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in
gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...

gene expression
have important biological functions.
Comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organism In biology, an organism (from Ancient Greek, Greek: ὀργανισμός, ''organismos'') is any individual contiguous system that ...
studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong
evolution Evolution is change in the heritable Heredity, also called inheritance or biological inheritance, is the passing on of Phenotypic trait, traits from parents to their offspring; either through asexual reproduction or sexual reproduction, ...

evolution
ary pressure and
positive selection upThree types of selection In population genetics, directional selection, or positive selection is a mode of natural selection Natural selection is the differential survival and reproduction of individuals due to differences in pheno ...
. Many of these sequences regulate the structure of chromosomes by limiting the regions of
heterochromatin Heterochromatin is a tightly packed form of DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon r ...
formation and regulating structural features of the chromosomes, such as the
telomeres A telomere ( or , from and ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres in a broad sense, are a widespread genetic ...

telomeres
and
centromeres Image:Chromosome.svg, In this diagram of a duplicated chromosome, (2) identifies the centromere—the region that joins the two sister chromatids, or each half of the chromosome. In prophase of mitosis, specialized regions on centromeres called kin ...
. Other noncoding regions serve as . Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example ), mRNA translation and stability (see
miRNA A microRNA (abbreviated miRNA) is a small single-stranded non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or materi ...

miRNA
), chromatin structure (including
histone In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological mechanisms ...
modifications, for example ), DNA methylation (for example ), DNA recombination (for example ), and cross-regulate other noncoding RNAs (for example ). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific
RNA Polymerase In molecular biology Molecular biology is the branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, molecular synthesis, modification, m ...

RNA Polymerase
activity.


Pseudogenes

Pseudogenes are inactive copies of protein-coding genes, often generated by
gene duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution Molecular evolution is the process of change in the sequence composition of c ...
, that have become nonfunctional through the accumulation of inactivating mutations. The number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during
molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings ...
. For example, the
olfactory receptor Olfactory receptors (ORs), also known as odorant receptors, are chemoreceptor A chemoreceptor, also known as chemosensor, is a specialized sensory receptor Sensory neurons, also known as afferent neurons, are neuron A neuron or nerve cell is ...
gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.


Genes for noncoding RNA (ncRNA)

Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of
protein synthesis Protein biosynthesis (or protein synthesis) is a core biological process, occurring inside Cell (biology), cells, homeostasis, balancing the loss of cellular proteins (via Proteolysis, degradation or Protein targeting, export) through the product ...

protein synthesis
and
RNA processing Post-transcriptional modification or co-transcriptional modification is a set of biological processes common to most eukaryotic cells by which an RNA primary transcript is chemically altered following transcription from a gene In biology, ...
. Noncoding RNA include
tRNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A molecule is an e ...

tRNA
,
ribosomal Ribosomes () are macromolecular machines, found within all living cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids Amino acids are organic compounds that contain amino (–NH2) and Carboxylic ...

ribosomal
RNA,
microRNA A microRNA (abbreviated miRNA) is a small single-stranded non-coding RNA A non-coding RNA (ncRNA) is an RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek '' poly-'', "many" + '' -mer'', "part") is a substance or materi ...
,
snRNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the Cell nucleus#Splicing speckles, splicing speckles and Cajal body, Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approxim ...
and other non-coding RNA genes including about 60,000
long non-coding RNA Long non-coding RNAs (long ncRNAs, lncRNA) are a type of RNA Ribonucleic acid (RNA) is a polymer A polymer (; Greek ''wikt:poly-, poly-'', "many" + ''wikt:-mer, -mer'', "part") is a Chemical substance, substance or material consisting ...
s (lncRNAs). Although the number of reported lncRNA genes continues to rise and the exact number in the human genome is yet to be defined, many of them are argued to be non-functional. Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.


Introns and untranslated regions of mRNA

In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of protein coding genes usually contain extensive noncoding sequences, in the form of
introns An intron (for ''intragenic region'') is any nucleotide sequence within a gene In biology, a gene (from ''genos'' "...Wilhelm Johannsen coined the word gene to describe the Mendelian_inheritance#History, Mendelian units of heredity..." (G ...
, 5'-untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon sequences.


Regulatory DNA sequences

The human genome has many different regulatory sequences which are crucial to controlling
gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...

gene expression
. Conservative estimates indicate that these sequences make up 8% of the genome, however extrapolations from the
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome ...
project give that 20-40% of the genome is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do regulate when and where genes are expressed (called enhancer (genetics), enhancers). Regulatory sequences have been known since the late 1960s. The first identification of regulatory sequences in the human genome relied on recombinant DNA technology. Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 70–90 million years ago. So computer comparisons of gene sequences that identify conserved non-coding sequences will be an indication of their importance in duties such as gene regulation. Other genomes have been sequenced with the same intention of aiding conservation-guided methods, for exampled the pufferfish genome. However, regulatory sequences disappear and re-evolve during evolution at a high rate. As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged by
histone In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological mechanisms ...
s (hypersensitive site, DNase hypersensitive sites), both of which tell where there are active regulatory sequences in the investigated cell type.


Repetitive DNA sequences

Repetitive DNA, Repetitive DNA sequences comprise approximately 50% of the human genome. About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies (e.g. "CAGCAGCAG..."). The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for genealogical DNA testing and forensic DNA, forensic DNA analysis. Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC)n) are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat (CAG)n within the ''Huntingtin'' gene on human chromosome 4. Telomeres (the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat of the sequence (TTAGGG)n. Tandem repeats of longer sequences (arrays of repeated sequences 10–60 nucleotides long) are termed minisatellites.


Mobile genetic elements (transposons) and their relics

Transposable element, Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, ''Alu'', has about 50,000 active copies, and can be inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active copies per genome (the number varies between people). Together with non-functional relics of old transposons, they account for over half of total human DNA. Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent endogenous retroviruses, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations. Mobile elements within the human genome can be classified into Retrotransposon#LTR retrotransposons, LTR retrotransposons (8.3% of total genome), short interspersed nuclear element, SINEs (13.1% of total genome) including Alu elements, long interspersed nuclear element, LINEs (20.4% of total genome), SVAs and Transposable element#Classification, Class II DNA transposons (2.9% of total genome).


Genomic variation in humans


Human reference genome

With the exception of identical twins, all humans show significant variation in genomic DNA sequences. The human
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides Nuc ...
(HRG) is used as a standard sequence reference. There are several important points concerning the human reference genome: * The HRG is a haploid sequence. Each chromosome is represented once. * The HRG is a composite sequence, and does not correspond to any actual human individual. * The HRG is periodically updated to correct errors, ambiguities, and unknown "gaps". * The HRG in no way represents an "ideal" or "perfect" human individual. It is simply a standardized representation or model that is used for comparative purposes. The Genome Reference Consortium is responsible for updating the HRG. Version 38 was released in December 2013.


Measuring human genetic variation

Most studies of human genetic variation have focused on single-nucleotide polymorphisms (SNPs), which are substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur 1 in 1000 base pairs, on average, in the euchromatin, euchromatic human genome, although they do not occur at a uniform density. Thus follows the popular statement that "we are all, regardless of Race (classification of human beings), race, genetically 99.9% the same", although this would be somewhat qualified by most geneticists. For example, a much larger fraction of the genome is now thought to be involved in copy number variation. A large-scale collaborative effort to catalog SNP variations in the human genome is being undertaken by the International HapMap Project. The genomic loci and length of certain types of small Repeated sequence (DNA), repetitive sequences are highly variable from person to person, which is the basis of DNA fingerprinting and DNA paternity testing technologies. The heterochromatin, heterochromatic portions of the human genome, which total several hundred million base pairs, are also thought to be quite variable within the human population (they are so repetitive and so long that they cannot be accurately sequenced with current technology). These regions contain few genes, and it is unclear whether any significant phenotype, phenotypic effect results from typical variation in repeats or heterochromatin. Most gross genomic mutations in
gamete A gamete ( /ˈɡæmiːt/; from Ancient Greek Ancient Greek includes the forms of the Greek language used in ancient Greece and the classical antiquity, ancient world from around 1500 BC to 300 BC. It is often roughly divided into the foll ...
germ cells probably result in inviable embryos; however, a number of human diseases are related to large-scale genomic abnormalities. Down syndrome, Turner Syndrome, and a number of other diseases result from nondisjunction of entire chromosomes. Cancer cells frequently have aneuploidy of chromosomes and chromosome arms, although a Causality, cause and effect relationship between aneuploidy and cancer has not been established.


Mapping human genomic variation

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome. An example of a variation map is the HapMap being developed by the International HapMap Project. The HapMap is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence variation." It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or bases. Researchers published the first sequence-based map of large-scale structural variation across the human genome in the journal ''Nature (journal), Nature'' in May 2008. Large-scale structural variations are differences in the genome among people that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome sequence and others appear as re-arrangements of stretches of sequence. These variations include copy number variation, differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions.


Structural variation

Structural variation refers to genetic variants that affect larger segments of the human genome, as opposed to point mutations. Often, structural variants (SVs) are defined as variants of 50 base pairs (bp) or greater, such as deletions, duplications, insertions, inversions and other rearrangements. About 90% of structural variants are noncoding deletions but most individuals have more than a thousand such deletions; the size of deletions ranges from dozens of base pairs to tens of thousands of bp. On average, individuals carry ~3 rare structural variants that alter coding regions, e.g. delete Exon, exons. About 2% of individuals carry ultra-rare megabase-scale structural variants, especially rearrangements. That is, millions of base pairs may be inverted within a chromosome; ultra-rare means that they are only found in individuals or their family members and thus have arisen very recently.


SNP frequency across the human genome

Single-nucleotide polymorphisms (SNPs) do not occur homogeneously across the human genome. In fact, there is enormous diversity in Single-nucleotide polymorphism, SNP frequency between genes, reflecting different selective pressures on each gene as well as different mutation and recombination rates across the genome. However, studies on SNPs are biased towards coding regions, the data generated from them are unlikely to reflect the overall distribution of SNPs throughout the genome. Therefore, the SNP Consortium protocol was designed to identify SNPs with no bias towards coding regions and the Consortium's 100,000 SNPs generally reflect sequence diversity across the human chromosomes. The SNP Consortium aims to expand the number of SNPs identified across the genome to 300 000 by the end of the first quarter of 2001. Changes in non-coding sequence and synonymous changes in coding sequence are generally more common than non-synonymous changes, reflecting greater selective pressure reducing diversity at positions dictating amino acid identity. Transitional changes are more common than transversions, with CpG dinucleotides showing the highest mutation rate, presumably due to deamination.


Personal genomes

A personal genome sequence is a (nearly) complete DNA sequencing, sequence of the chemical base pairs that make up the
DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five carbon rings. A molecule is an electrically neutral gro ...

DNA
of a single person. Because medical treatments have different effects on different people due to genetic variations such as single-nucleotide polymorphisms (SNPs), the analysis of personal genomes may lead to personalized medical treatment based on individual genotypes. The first personal genome sequence to be determined was that of Craig Venter in 2007. Personal genomes had not been sequenced in the public Human Genome Project to protect the identity of volunteers who provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse population. However, early in the Venter-led Celera Genomics genome sequencing effort the decision was made to switch from sequencing a composite sample to using DNA from a single individual, later revealed to have been Venter himself. Thus the Celera human genome sequence released in 2000 was largely that of one man. Subsequent replacement of the early composite-derived data and determination of the diploid sequence, representing both sets of
chromosomes A chromosome is a long DNA molecule with part or all of the genome, genetic material of an organism. Most eukaryotic chromosomes include packaging proteins called histones which, aided by Chaperone (protein), chaperone proteins, bind to and D ...
, rather than a haploid sequence originally reported, allowed the release of the first personal genome. In April 2008, that of James Watson was also completed. In 2009, Stephen Quake published his own genome sequence derived from a sequencer of his own design, the Heliscope. A Stanford team led by Euan Ashley published a framework for the medical interpretation of human genomes implemented on Quake’s genome and made whole genome-informed medical decisions for the first time. That team further extended the approach to the West family, the first family sequenced as part of Illumina’s Personal Genome Sequencing program. Since then hundreds of personal genome sequences have been released, including those of Desmond Tutu, and of a Paleo-Eskimo. In 2012, the whole genome sequences of two family trios among 1092 genomes was made public. In November 2013, a Spanish family made four personal exome datasets (about 1% of the genome) publicly available under a CC0#Zero .2F Public domain, Creative Commons public domain license. The Personal Genome Project (started in 2005) is among the few to make both genome sequences and corresponding medical phenotypes publicly available. The sequencing of individual genomes further unveiled levels of genetic complexity that had not been appreciated before. Personal genomics helped reveal the significant level of diversity in the human genome attributed not only to SNPs but structural variations as well. However, the application of such knowledge to the treatment of disease and in the medical field is only in its very beginnings. Exome sequencing has become increasingly popular as a tool to aid in diagnosis of genetic disease because the exome contributes only 1% of the genomic sequence but accounts for roughly 85% of mutations that contribute significantly to disease.


Human knockouts

In humans, gene knockouts naturally occur as heterozygous or homozygous loss-of-function gene knockouts. These knockouts are often difficult to distinguish, especially within heterogeneous genetic backgrounds. They are also difficult to find as they occur in low frequencies. Populations with high rates of consanguinity, such as countries with high rates of first-cousin marriages, display the highest frequencies of homozygous gene knockouts. Such populations include Pakistan, Iceland, and Amish populations. These populations with a high level of parental-relatedness have been subjects of human knock out research which has helped to determine the function of specific genes in humans. By distinguishing specific knockouts, researchers are able to use phenotypic analyses of these individuals to help characterize the gene that has been knocked out. Knockouts in specific genes can cause genetic diseases, potentially have beneficial effects, or even result in no phenotypic effect at all. However, determining a knockout's phenotypic effect and in humans can be challenging. Challenges to characterizing and clinically interpreting knockouts include difficulty calling of DNA variants, determining disruption of protein function (annotation), and considering the amount of influence mosaicism has on the phenotype. One major study that investigated human knockouts is the Pakistan Risk of Myocardial Infarction study. It was found that individuals possessing a heterozygous loss-of-function gene knockout for the APOC3 gene had lower triglycerides in the blood after consuming a high fat meal as compared to individuals without the mutation. However, individuals possessing homozygous loss-of-function gene knockouts of the APOC3 gene displayed the lowest level of triglycerides in the blood after the fat load test, as they produce no functional APOC3 protein.


Human genetic disorders

Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste or smell certain compounds, etc.). Moreover, some genetic disorders only cause disease in combination with the appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene and is the most common recessive disorder in caucasian populations with over 1,300 different mutations known. Disease-causing mutations in specific genes are usually severe in terms of gene function and are fortunately rare, thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause genetic disorders, in aggregate they constitute a significant component of known medical conditions, especially in pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has been identified. Currently there are approximately 2,200 such disorders annotated in the Online Mendelian Inheritance in Man, OMIM database. Studies of genetic disorders are often performed by means of family-based studies. In some instances, population based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a geneticist-physician trained in clinical/medical genetics. The results of the
Human Genome Project The Human Genome Project (''HGP'') was an international scientific research The scientific method is an Empirical evidence, empirical method of acquiring knowledge that has characterized the development of science since at least the 17th cen ...
are likely to provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. Parents can be screened for hereditary conditions and Genetic counseling, counselled on the consequences, the probability of inheritance, and how to avoid or ameliorate it in their offspring. There are many different kinds of DNA sequence variation, ranging from complete extra or missing chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic variation in human populations is phenotypically neutral, i.e., has little or no detectable effect on the physiology of the individual (although there may be fractional differences in fitness defined over evolutionary time frames). Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the clinical disease under investigation. Such studies constitute the realm of human molecular genetics. With the advent of the Human Genome and International HapMap Project, it has become feasible to explore subtle genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. Although some causal links have been made between genomic sequence variants in particular genes and some of these diseases, often with much publicity in the general media, these are usually not considered to be genetic disorders ''per se'' as their causes are complex, involving many different genetic and environmental factors. Thus there may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder. Additional genetic disorders of mention are Kallmann syndrome, Kallman syndrome and Pfeiffer syndrome (gene FGFR1), Fuchs' dystrophy, Fuchs corneal dystrophy (gene TCF4), Hirschsprung's disease (genes RET and FECH), Bardet-Biedl syndrome 1 (genes CCDC28B and BBS1), Bardet-Biedl syndrome 10 (gene BBS10), and facioscapulohumeral muscular dystrophy type 2 (genes D4Z4 and SMCHD1). Genome sequencing is now able to narrow the genome down to specific locations to more accurately find mutations that will result in a genetic disorder. Copy-number variation, Copy number variants (CNVs) and Single-nucleotide polymorphism, single nucleotide variants (SNVs) are also able to be detected at the same time as genome sequencing with newer sequencing procedures available, calle
Next Generation Sequencing
(NGS). This only analyzes a small portion of the genome, around 1-2%. The results of this sequencing can be used for clinical diagnosis of a genetic condition, including Usher syndrome, retinal disease, hearing impairments, diabetes, epilepsy, Leigh syndrome, Leigh disease, hereditary cancers, neuromuscular diseases, primary immunodeficiencies, severe combined immunodeficiency (SCID), and diseases of the mitochondria. NGS can also be used to identify carriers of diseases before conception. The diseases that can be detected in this sequencing include Tay–Sachs disease, Tay-Sachs disease, Bloom syndrome, Gaucher's disease, Gaucher disease, Canavan disease, familial dysautonomia, cystic fibrosis, spinal muscular atrophy, and Fragile X syndrome, fragile-X syndrome. The Next Genome Sequencing can be narrowed down to specifically look for diseases more prevalent in certain ethnic populations.


Evolution

Comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organism In biology, an organism (from Ancient Greek, Greek: ὀργανισμός, ''organismos'') is any individual contiguous system that ...
studies of mammalian genomes suggest that approximately 5% of the human genome has been conserved by evolution since the divergence of extant lineages approximately 200 million years ago, containing the vast majority of genes. The published
chimpanzee The chimpanzee (''Pan troglodytes''), also known simply as chimp, is a species of Hominidae, great ape native to the forest and savannah of tropical Africa. It has four confirmed subspecies and a fifth proposed subspecies. The chimpanzee and t ...
genome differs from that of the human genome by 1.23% in direct sequence comparisons. Around 20% of this figure is accounted for by variation within each species, leaving only ~1.06% consistent sequence divergence between humans and chimps at shared genes. This nucleotide by nucleotide difference is dwarfed, however, by the portion of each genome that is not shared, including around 6% of functional genes that are unique to either humans or chimps. In other words, the considerable observable differences between humans and chimps may be due as much or more to genome level variation in the number, function and expression of genes rather than DNA sequence changes in shared genes. Indeed, even within humans, there has been found to be a previously unappreciated amount of copy number variation (CNV) which can make up as much as 5 – 15% of the human genome. In other words, between humans, there could be +/- 500,000,000 base pairs of DNA, some being active genes, others inactivated, or active at different levels. The full significance of this finding remains to be seen. On average, a typical human protein-coding gene differs from its chimpanzee ortholog by only two amino acid substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee orthologs. A major difference between the two genomes is human chromosome 2 (human), chromosome 2, which is equivalent to a fusion product of chimpanzee chromosomes 12 and 13. (later renamed to chromosomes 2A and 2B, respectively). Humans have undergone an extraordinary loss of
olfactory receptor Olfactory receptors (ORs), also known as odorant receptors, are chemoreceptor A chemoreceptor, also known as chemosensor, is a specialized sensory receptor Sensory neurons, also known as afferent neurons, are neuron A neuron or nerve cell is ...
genes during our recent evolution, which explains our relatively crude sense of olfaction, smell compared to most other mammals. Evolutionary evidence suggests that the emergence of color vision in humans and several other primate species has diminished the need for the sense of smell. In September 2016, scientists reported that, based on human DNA genetic studies, all Behavioral modernity, non-Africans in the world today can be traced to a Anatomically modern human, single population that Recent African origin of modern humans, exited Africa between 50,000 and 80,000 years ago.


Mitochondrial DNA

The human
mitochondrial DNA Mitochondrial DNA (mtDNA or mDNA) is the DNA Deoxyribonucleic acid (; DNA) is a molecule File:Pentacene on Ni(111) STM.jpg, A scanning tunneling microscopy image of pentacene molecules, which consist of linear chains of five car ...

mitochondrial DNA
is of tremendous interest to geneticists, since it undoubtedly plays a role in mitochondrial disease. It also sheds light on human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of a recent common ancestor for all humans on the maternal line of descent (see Mitochondrial Eve). Due to the lack of a system for checking for copying errors, mitochondrial DNA (mtDNA) has a more rapid rate of variation than nuclear DNA. This 20-fold higher mutation rate allows mtDNA to be used for more accurate tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, such as the migration of Indigenous peoples of the Americas, Native Americans from Siberia or Polynesians from southeastern Asia. It has also been used to show that there is no trace of Neanderthal DNA in the European gene mixture inherited through purely maternal lineage. Due to the restrictive all or none manner of mtDNA inheritance, this result (no trace of Neanderthal mtDNA) would be likely unless there were a large percentage of Neanderthal ancestry, or there was strong positive selection for that mtDNA. For example, going back 5 generations, only 1 of a person's 32 ancestors contributed to that person's mtDNA, so if one of these 32 was pure Neanderthal an expected ~3% of that person's autosomal DNA would be of Neanderthal origin, yet they would have a ~97% chance of having no trace of Neanderthal mtDNA.


Epigenome

Epigenetics describes a variety of features of the human genome that transcend its primary DNA sequence, such as chromatin packaging,
histone In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological mechanisms ...
modifications and DNA methylation, and which are important in regulating gene expression, genome replication and other cellular processes. Epigenetic markers strengthen and weaken transcription of certain genes but do not affect the actual sequence of DNA nucleotides. DNA methylation is a major form of epigenetic control over gene expression and one of the most highly studied topics in epigenetics. During development, the human DNA methylation profile experiences dramatic changes. In early germ line cells, the genome has very low methylation levels. These low levels generally describe active genes. As development progresses, parental imprinting tags lead to increased methylation activity. Epigenetic patterns can be identified between tissues within an individual as well as between individuals themselves. Identical genes that have differences only in their epigenetic state are called epialleles. Epialleles can be placed into three categories: those directly determined by an individual's genotype, those influenced by genotype, and those entirely independent of genotype. The epigenome is also influenced significantly by environmental factors. Diet, toxins, and hormones impact the epigenetic state. Studies in dietary manipulation have demonstrated that methyl-deficient diets are associated with hypomethylation of the epigenome. Such studies establish epigenetics as an important interface between the environment and the genome.


See also

* Human Genome Organisation * Genome Reference Consortium *
Human Genome Project The Human Genome Project (''HGP'') was an international scientific research The scientific method is an Empirical evidence, empirical method of acquiring knowledge that has characterized the development of science since at least the 17th cen ...
* Genetics * Genomics * Genographic Project * Genomic organization * Low copy repeats * Non-coding DNA * Whole genome sequencing * Universal Declaration on the Human Genome and Human Rights


References


External links


Ensembl
The
EnsemblEnsembl genome database project is a scientific project at the European Bioinformatics Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project. Ensembl aims to provide a centralized resource for geneti ...
Genome Browser Project
National Library of Medicine human genome viewerUCSC Genome Browser

Human Genome Project

The National Human Genome Research InstituteSimple Human Genome viewer
{{DEFAULTSORT:Human Genome Genetic mapping Genomics Human genetics, Genome Human evolution, Genome Chromosomes (human),