The human genome is a complete set of

nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are us ...

s for

humans" \n\n\n\n\nThe robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visi ...

, encoded as DNA within the 23

chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...

pairs in

cell nuclei The cell nucleus (pl. nuclei; from Latin or , meaning ''kernel'' or ''seed'') is a membrane-bound organelle found in eukaryotic cells. Eukaryotic cells usually have a single nucleus, but a few cell types, such as mammalian red blood cells, ha ...

and in a small DNA molecule found within individual

mitochondria A mitochondrion (; ) is an organelle found in the cells of most Eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used ...

. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...

s include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for

ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from riboso ...

transfer RNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ...

ribozyme Ribozymes (ribonucleic acid enzymes) are RNA molecules that have the ability to catalyze specific biochemical reactions, including RNA splicing in gene expression, similar to the action of protein enzymes. The 1982 discovery of ribozymes demons ...

small nuclear RNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approximately 150 nucleotides. They are transcribe ...

s, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions,

telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...

centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...

s, and origins of replication, plus large numbers of

transposable elements A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Trans ...

, inserted viral DNA, non-functional

pseudogene Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes ar ...

s and simple, highly-repetitive sequences. Introns make up a large percentage of

non-coding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...

. Some of this non-coding DNA is non-functional

junk DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...

, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

Haploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respective ...

human genomes, which are contained in

germ cells Germ or germs may refer to: Science * Germ (microorganism), an informal word for a pathogen * Germ cell, cell that gives rise to the gametes of an organism that reproduces sexually * Germ layer, a primary layer of cells that forms during embryo ...

(the egg and

sperm Sperm is the male reproductive cell, or gamete, in anisogamous forms of sexual reproduction (forms in which there is a larger, female reproductive cell and a smaller, male one). Animals produce motile sperm with a tail known as a flagellum, ...

gamete A gamete (; , ultimately ) is a haploid cell that fuses with another haploid cell during fertilization in organisms that reproduce sexually. Gametes are an organism's reproductive cells, also referred to as sex cells. In species that produce ...

cells created in the

meiosis Meiosis (; , since it is a reductional division) is a special type of cell division of germ cells in sexually-reproducing organisms that produces the gametes, such as sperm or egg cells. It involves two rounds of division that ultimately ...

phase of

sexual reproduction Sexual reproduction is a type of reproduction that involves a complex life cycle in which a gamete ( haploid reproductive cells, such as a sperm or egg cell) with a single set of chromosomes combines with another gamete to produce a zygote th ...

before

fertilization Fertilisation or fertilization (see spelling differences), also known as generative fertilisation, syngamy and impregnation, is the fusion of gametes to give rise to a new individual organism or offspring and initiate its development. Pro ...

) consist of 3,054,815,472 DNA

base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both D ...

s (if X chromosome is used), while female

diploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respectiv ...

genomes (found in

somatic cells A somatic cell (from Ancient Greek σῶμα ''sôma'', meaning "body"), or vegetal cell, is any biological cell forming the body of a multicellular organism other than a gamete, germ cell, gametocyte or undifferentiated stem cell. Such cells compo ...

) have twice the DNA content. While there are significant differences among the genomes of human individuals (on the order of 0.1% due to single-nucleotide variants and 0.6% when considering

indel Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...

s), these are considerably smaller than the differences between humans and their closest living relatives, the

bonobo The bonobo (; ''Pan paniscus''), also historically called the pygmy chimpanzee and less often the dwarf chimpanzee or gracile chimpanzee, is an endangered great ape and one of the two species making up the genus '' Pan,'' the other being the comm ...

s and

chimpanzee The chimpanzee (''Pan troglodytes''), also known as simply the chimp, is a species of great ape native to the forest and savannah of tropical Africa. It has four confirmed subspecies and a fifth proposed subspecies. When its close relative t ...

s (~1.1%

fixed Fixed may refer to: * ''Fixed'' (EP), EP by Nine Inch Nails * ''Fixed'', an upcoming 2D adult animated film directed by Genndy Tartakovsky * Fixed (typeface), a collection of monospace bitmap fonts that is distributed with the X Window System * F ...

single-nucleotide variants and 4% when including indels). Size in basepairs can vary too; the

length decreases after every round of DNA replication. Although the sequence of the human genome has been completely determined by DNA sequencing, it is not yet fully understood. Most, but not all,

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

s have been identified by a combination of high throughput experimental and

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products (in particular, annotation of the complete CHM13v2.0 sequence is still ongoing). And yet,

overlapping gene An overlapping gene (or OLG) is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene. In this way, a nucleotide sequence may make a contribution to the function of one or more gen ...

s are quite common, in some cases allowing two protein coding genes from each strand to reuse base pairs twice (for example, genes DCDC2 and KAAG1). Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including

regulation of gene expression Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are w ...

, organization of chromosome architecture, and signals controlling epigenetic inheritance. There are also a significant number of retroviruses in human DNA, at least 3 of which have been proven to possess an important function (i.e., HIV-like HERV-K, HERV-W, and HERV-FRD play a role in placenta formation by inducing cell-cell fusion). In 2003, scientists reported the sequencing of 85% of the entire human genome, but as of 2020 at least 8% was still missing. In 2021, scientists reported sequencing the complete female genome (i.e., without the Y chromosome). This sequence identified 19,969

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...

-coding sequences, accounting for approximately 1.5% of the genome, and 63,494 genes in total, most of them being

non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally im ...

genes. The genome consists of regulatory DNA sequences, LINEs,

SINEs Sines () is a city and a municipality in Portugal. The municipality, divided into two parishes, has around 14,214 inhabitants (2021) in an area of . Sines holds an important oil refinery and several petrochemical industries. It is also a popular ...

intron An intron is any Nucleic acid sequence, nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of ...

s, and sequences for which as yet no function has been determined. The human Y chromosome, consisting of 62,460,029 base pairs from a different cell line and found in all males, was sequenced completely in January 2022.

Sequencing

The first human genome sequences were published in nearly complete draft form in February 2001 by the

Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both ...

and

Celera Corporation Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnos ...

. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence, leaving just 341 gaps in the sequence, representing highly-repetitive and other DNA that could not be sequenced with the technology available at the time. The human genome was the first of all vertebrates to be sequenced to such near-completion, and as of 2018, the diploid genomes of over a million individual humans had been determined using

next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation ...

. These data are used worldwide in

biomedical science Biomedical sciences are a set of sciences applying portions of natural science or formal science, or both, to develop knowledge, interventions, or technology that are of use in healthcare or public health. Such disciplines as medical microbi ...

anthropology Anthropology is the scientific study of humanity, concerned with human behavior, human biology, cultures, societies, and linguistics, in both the present and past, including past human species. Social anthropology studies patterns of be ...

forensics Forensic science, also known as criminalistics, is the application of science to criminal and civil laws, mainly—on the criminal side—during criminal investigation, as governed by the legal standards of admissible evidence and crimin ...

and other branches of science. Such genomic studies have led to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including

human evolution Human evolution is the evolutionary process within the history of primates that led to the emergence of '' Homo sapiens'' as a distinct species of the hominid family, which includes the great apes. This process involved the gradual developmen ...

. By 2018, the total number of genes had been raised to at least 46,831, plus another 2300

micro-RNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miRN ...

genes. A 2018 population survey found another 300 million bases of human genome that was not in the reference sequence. Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome. In 2022 the Telomere-to-Telomere (T2T) consortium reported the complete sequence of a human female genome, filling all the gaps in the

X chromosome The X chromosome is one of the two sex-determining chromosomes (allosomes) in many organisms, including mammals (the other is the Y chromosome), and is found in both males and females. It is a part of the XY sex-determination system and XO sex ...

(2020) and the 22 autosomes (May 2021). The previously unsequenced parts contain

immune response An immune response is a reaction which occurs within an organism for the purpose of defending against foreign invaders. These invaders include a wide variety of different microorganisms including viruses, bacteria, parasites, and fungi which coul ...

genes that help to adapt to and survive infections, as well as genes that are important for predicting drug response. The completed human genome sequence will also provide better understanding of human formation as an individual organism and how humans vary both between each other and other species.

Achieving completeness

Although the 'completion' of the human genome project was announced in 2001, there remained hundreds of gaps, with about 5–10% of the total sequence remaining undetermined. The missing genetic information was mostly in repetitive

heterochromatic Heterochromia is a variation in coloration. The term is most often used to describe color differences of the iris, but can also be applied to color variation of hair or skin. Heterochromia is determined by the production, delivery, and concentra ...

regions and near the

s and

s, but also some gene-encoding

euchromatic Euchromatin (also called "open chromatin") is a lightly packed form of chromatin (DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active Transcription (genetics), transcription. Euchromatin stands in contrast ...

regions. There remained 160 euchromatic gaps in 2015 when the sequences spanning another 50 formerly-unsequenced regions were determined. Only in 2020 was the first truly complete telomere-to-telomere sequence of a human chromosome determined, namely of the

. The first complete telomere-to-telomere sequence of a human autosomal chromosome,

chromosome 8 Chromosome 8 is one of the 23 pairs of chromosomes in humans. People normally have two copies of this chromosome. Chromosome 8 spans about 145 million base pairs (the building material of DNA) and represents between 4.5 and 5.0% of the total DNA ...

, followed a year later. The complete human genome (without Y chromosome) was published in 2021, while with Y chromosome in January 2022.

Molecular organization and gene content

The total length of the human

reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...

, that does not represent the sequence of any specific individual. The genome is organized into 22 paired chromosomes, termed

autosome An autosome is any chromosome that is not a sex chromosome. The members of an autosome pair in a diploid cell have the same morphology, unlike those in allosomal (sex chromosome) pairs, which may have different structures. The DNA in autosomes ...

s, plus the 23rd pair of

sex chromosome A sex chromosome (also referred to as an allosome, heterotypical chromosome, gonosome, heterochromosome, or idiochromosome) is a chromosome that differs from an ordinary autosome in form, size, and behavior. The human sex chromosomes, a typical ...

s (XX) in the female and (XY) in the male. The haploid genome is 3 054 815 472 base pairs, when the

is included, and 2 963 015 935 base pairs when the

Y chromosome The Y chromosome is one of two sex chromosomes (allosomes) in therian mammals, including humans, and many other animals. The other is the X chromosome. Y is normally the sex-determining chromosome in many species, since it is the presence or abs ...

is substituted for the X chromosome. These chromosomes are all large linear DNA molecules contained within the cell nucleus. The genome also includes the

mitochondrial DNA Mitochondrial DNA (mtDNA or mDNA) is the DNA located in mitochondria, cellular organelles within eukaryotic cells that convert chemical energy from food into a form that cells can use, such as adenosine triphosphate (ATP). Mitochondrial D ...

, a comparatively small circular molecule present in multiple copies in each

mitochondrion A mitochondrion (; ) is an organelle found in the cells of most Eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is use ...

Information content

The

haploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respective ...

human genome (23

chromosomes A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...

) is about 3 billion base pairs long and contains around 30,000 genes. Since every base pair can be coded by 2 bits, this is about 750

megabyte The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes o ...

s of data. An individual somatic (

) cell contains twice this amount, that is, about 6 billion base pairs. Males have fewer than females because the Y chromosome is about 57 million base pairs whereas the X is about 156 million. Since individual genomes vary in sequence by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes. The

entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...

of the genome differs significantly between coding and non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million base pairs), but less for the non-coding parts. It ranges between 1.5 and 1.9 bits per base pair for the individual chromosome, except for the Y chromosome, which has an entropy rate below 0.9 bits per base pair., fig. 6, using the Lempel-Ziv estimators of entropy rate.

Coding vs. noncoding DNA

The content of the human genome is commonly divided into coding and noncoding DNA sequences.

Coding DNA The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to n ...

is defined as those sequences that can be transcribed into

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...

and

translated Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%).

Noncoding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and r ...

is made up of all of those sequences (approx. 98% of the genome) that are not used to encode proteins. Some noncoding DNA contains genes for RNA molecules with important biological functions ( noncoding RNA, for example

and

). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the

ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...

(Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity. It is however disputed whether molecular activity (transcription of DNA into RNA) alone implies that the RNA produced has a meaningful biological function, since experiments have shown that random nonfunctional DNA will also reproducibly recruit transcription factors resulting in transcription into nonfunctional RNA. There is no consensus on what constitutes a "functional" element in the genome since geneticists, evolutionary biologists, and molecular biologists employ different definitions and methods. In evolutionary definitions, "functional" DNA, whether it is coding or non-coding, contributes to the fitness of the organism, and therefore is maintained by negative

evolutionary pressure Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...

whereas "non-functional" DNA has no benefit to the organism and therefore is under neutral selective pressure. This type of DNA has been described as

In genetic definitions, "functional" DNA is related to how DNA segments manifest by phenotype and "nonfunctional" is related to loss-of-function effects on the organism. In biochemical definitions, "functional" DNA relates to DNA sequences that specify molecular products (e.g. noncoding RNAs) and biochemical activities with mechanistic roles in gene or genome regulation (i.e. DNA sequences that impact cellular level activity such as cell type, condition, and molecular processes). There is no consensus in the literature on the amount of functional DNA since, depending on how "function" is understood, ranges have been estimated from up to 90% of the human genome is likely nonfunctional DNA (junk DNA) to up to 80% of the genome is likely functional.. It is also possible that junk DNA may acquire a function in the future and therefore may play a role in evolution, but this is likely to occur only very rarely. Finally DNA that is deliterious to the organism and is under negative selective pressure is called garbage DNA. Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced genome has become a more focused analytical concept than the classical concept of the DNA-coding gene.

Coding sequences (protein-coding genes)

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human

s, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the

exome The exome is composed of all of the exons within the genome, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing. This includes untranslated regions of messenger RNA (mRNA), and coding re ...

, and consists of DNA sequences encoded by

exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequen ...

s that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project. Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as

Uniprot UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...

. Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early 1970s that the estimated

mutational load Genetic load is the difference between the fitness of an average genotype in a population and the fitness of some reference genotype, which may be either the best present in a population, or may be the theoretically optimal genotype. The average ...

from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes). The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons. Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high

gene density In genetics, the gene density of an organism's genome is the ratio of the number of genes per number of base pairs, usually written in terms of a million base pairs, or ''megabase'' (Mb). The human genome has a gene density of 11-15 genes/Mb, while ...

within chromosomes 1, 11, and 19. Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content. The significance of these nonrandom patterns of gene density is not well understood. Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability. For example, the gene for

histone H1 Histone H1 is one of the five main histone protein families which are components of chromatin in eukaryotic cells. Though highly conserved, it is nevertheless the most variable histone in sequence across species. Structure Metazoan H1 prote ...

a (HIST1HIA) is relatively small and simple, lacking introns and encoding an 781 nucleotide-long mRNA that produces a 215 amino acid protein from its 648 nucleotide

open reading frame In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible readin ...

Dystrophin Dystrophin is a rod-shaped cytoplasmic protein, and a vital part of a protein complex that connects the cytoskeleton of a muscle fiber to the surrounding extracellular matrix through the cell membrane. This complex is variously known as the cost ...

(DMD) was the largest protein-coding gene in the 2001 human reference genome, spanning a total of 2.2 million nucleotides, while more recent systematic meta-analysis of updated human genome data identified an even larger protein-coding gene, ''RBFOX1'' (RNA binding protein, fox-1 homolog 1), spanning a total of 2.47 million nucleotides.

Titin Titin (contraction for Titan protein) (also called connectin) is a protein that in humans is encoded by the ''TTN'' gene. Titin is a giant protein, greater than 1 µm in length, that functions as a molecular spring that is responsible for th ...

(TTN) has the longest coding sequence (114,414 nucleotides), the largest number of exons (363), and the longest single exon (17,106 nucleotides). As estimated based on a curated set of protein-coding genes over the whole genome, the median size is 26,288 nucleotides (mean = 66,577), the median exon size, 133 nucleotides (mean = 309), the median number of exons, 8 (mean = 11), and the median encoded protein is 425 amino acids (mean = 553) in length.

Noncoding DNA (ncDNA)

Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA. Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements. Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA). Protein-coding sequences (specifically, coding

s) constitute less than 1.5% of the human genome. In addition, about 26% of the human genome is

introns An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...

. Aside from genes (exons and introns) and known regulatory sequences (8–20%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the

project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity. It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result of transcriptional and biochemical noise, which must be actively filtered out by the organism. Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in gene expression have important biological functions. Comparative genomics studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

ary pressure and

purifying selection In natural selection, negative selection or purifying selection is the selective removal of alleles that are deleterious. This can result in stabilising selection through the purging of deleterious genetic polymorphisms that arise through random ...

. Many of these sequences regulate the structure of chromosomes by limiting the regions of heterochromatin formation and regulating structural features of the chromosomes, such as the

telomeres A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...

and

centromeres The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers a ...

. Other noncoding regions serve as origins of DNA replication. Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example), mRNA translation and stability (see

miRNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miR ...

), chromatin structure (including

histone In biology, histones are highly basic proteins abundant in lysine and arginine residues that are found in eukaryotic cell nuclei. They act as spools around which DNA winds to create structural units called nucleosomes. Nucleosomes in turn a ...

modifications, for example), DNA methylation (for example), DNA recombination (for example), and cross-regulate other noncoding RNAs (for example). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific RNA Polymerase activity.

Pseudogenes

Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication, that have become nonfunctional through the accumulation of inactivating mutations. The number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during

molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...

. For example, the

olfactory receptor Olfactory receptors (ORs), also known as odorant receptors, are chemoreceptors expressed in the cell membranes of olfactory receptor neurons and are responsible for the detection of odorants (for example, compounds that have an odor) which give r ...

gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.

Genes for noncoding RNA (ncRNA)

Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of protein synthesis and

RNA processing Transcriptional modification or co-transcriptional modification is a set of biological processes common to most eukaryotic cells by which an RNA primary transcript is chemically altered following transcription from a gene to produce a mature, f ...

. Noncoding RNA include

tRNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ...

ribosomal Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to for ...

RNA,

microRNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. mi ...

snRNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approximately 150 nucleotides. They are transcri ...

and other non-coding RNA genes including about 60,000

long non-coding RNA Long non-coding RNAs (long ncRNAs, lncRNA) are a type of RNA, generally defined as transcripts more than 200 nucleotides that are not translated into protein. This arbitrary limit distinguishes long ncRNAs from small non-coding RNAs, such as mi ...

s (lncRNAs). Although the number of reported lncRNA genes continues to rise and the exact number in the human genome is yet to be defined, many of them are argued to be non-functional. Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.

Introns and untranslated regions of mRNA

In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of protein coding genes usually contain extensive noncoding sequences, in the form of

, 5'-untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon sequences.

Regulatory DNA sequences

The human genome has many different

regulatory sequences A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and vi ...

which are crucial to controlling gene expression. Conservative estimates indicate that these sequences make up 8% of the genome, however extrapolations from the

project give that 20-40% of the genome is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do regulate when and where genes are expressed (called

enhancers In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins ( activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcriptio ...

). Regulatory sequences have been known since the late 1960s. The first identification of regulatory sequences in the human genome relied on recombinant DNA technology. Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 70–90 million years ago. So computer comparisons of gene sequences that identify

conserved non-coding sequence A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production. CNSs in plants and animals are highly associated with tra ...

s will be an indication of their importance in duties such as gene regulation. Other genomes have been sequenced with the same intention of aiding conservation-guided methods, for exampled the pufferfish genome. However, regulatory sequences disappear and re-evolve during evolution at a high rate. As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique

ChIP-Seq ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated prote ...

, or gaps where the DNA is not packaged by

s ( DNase hypersensitive sites), both of which tell where there are active regulatory sequences in the investigated cell type.

Repetitive DNA sequences

Repetitive DNA sequences Repetition may refer to: *Repetition (rhetorical device), repeating a word within a short space of words * Repetition (bodybuilding), a single cycle of lifting and lowering a weight in strength training *Working title for the 1985 slasher film '' ...

comprise approximately 50% of the human genome. About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies (e.g. "CAGCAGCAG..."). The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for

genealogical DNA testing A genealogical DNA test is a DNA-based test used in genetic genealogy that looks at specific locations of a person's genome in order to find or verify ancestral genealogical relationships, or (with lower reliability) to estimate the ethnic mixt ...

and

forensic DNA analysis DNA profiling is the determination of a DNA profile for legal and investigative purposes. DNA analysis methods have changed numerous times over the years as technology improves and allows for more information to be determined with less startin ...

. Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC)_n) are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat (CAG)_n within the ''

Huntingtin Huntingtin (Htt) is the protein coded for in humans by the ''HTT'' gene, also known as the ''IT15'' ("interesting transcript 15") gene. Mutated ''HTT'' is the cause of Huntington's disease (HD), and has been investigated for this role and also fo ...

'' gene on human chromosome 4.

Telomeres A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...

(the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat of the sequence (TTAGGG)_n. Tandem repeats of longer sequences (arrays of repeated sequences 10–60 nucleotides long) are termed

minisatellite A minisatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from 10–60 base pairs) are typically repeated 5-50 times. Minisatellites occur at more than 1,000 locations in the human genome and they are notable for ...

Mobile genetic elements (transposons) and their relics

Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, ''Alu'', has about 50,000 active copies, and can be inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active copies per genome (the number varies between people). Together with non-functional relics of old transposons, they account for over half of total human DNA. Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent

endogenous retroviruses Endogenous retroviruses (ERVs) are endogenous viral elements in the genome that closely resemble and can be derived from retroviruses. They are abundant in the genomes of jawed vertebrates, and they comprise up to 5–8% of the human genome (l ...

, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations. Mobile elements within the human genome can be classified into

LTR retrotransposons LTR retrotransposons are class I transposable element characterized by the presence of long terminal repeats (LTRs) directly flanking an internal coding region. As retrotransposons, they mobilize through reverse transcription of their mRNA and in ...

(8.3% of total genome),

(13.1% of total genome) including Alu elements,

LINEs Line most often refers to: * Line (geometry), object with zero thickness and curvature that stretches to infinity * Telephone line, a single-user circuit on a telephone communication system Line, lines, The Line, or LINE may also refer to: Arts ...

(20.4% of total genome), SVAs (SINE-

VNTR A variable number tandem repeat (or VNTR) is a location in a genome where a short nucleotide sequence is organized as a tandem repeat. These can be found on many chromosomes, and often show variations in length (number of repeats) among individ ...

-Alu) and Class II DNA transposons (2.9% of total genome).

Genomic variation in humans

Human reference genome

With the exception of identical twins, all humans show significant variation in genomic DNA sequences. The human

(HRG) is used as a standard sequence reference. There are several important points concerning the human reference genome: * The HRG is a haploid sequence. Each chromosome is represented once. * The HRG is a composite sequence, and does not correspond to any actual human individual. * The HRG is periodically updated to correct errors, ambiguities, and unknown "gaps". * The HRG in no way represents an "ideal" or "perfect" human individual. It is simply a standardized representation or model that is used for comparative purposes. The

Genome Reference Consortium The Genome Reference Consortium (GRC) is an international collective of academic and research institutes with expertise in genome mapping, sequencing, and informatics, formed to improve the representation of reference genomes. At the time the Hum ...

is responsible for updating the HRG. Version 38 was released in December 2013.

Measuring human genetic variation

Most studies of human genetic variation have focused on

single-nucleotide polymorphism In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...

s (SNPs), which are substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur 1 in 1000 base pairs, on average, in the euchromatic human genome, although they do not occur at a uniform density. Thus follows the popular statement that "we are all, regardless of

race Race, RACE or "The Race" may refer to: * Race (biology), an informal taxonomic classification within a species, generally within a sub-species * Race (human categorization), classification of humans into groups based on physical traits, and/or s ...

, genetically 99.9% the same", although this would be somewhat qualified by most geneticists. For example, a much larger fraction of the genome is now thought to be involved in

copy number variation Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of ...

. A large-scale collaborative effort to catalog SNP variations in the human genome is being undertaken by the

International HapMap Project The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease ...

. The genomic loci and length of certain types of small repetitive sequences are highly variable from person to person, which is the basis of

DNA fingerprinting DNA profiling (also called DNA fingerprinting) is the process of determining an individual's DNA characteristics. DNA analysis intended to identify a species, rather than an individual, is called DNA barcoding. DNA profiling is a forensic tec ...

and DNA paternity testing technologies. The

portions of the human genome, which total several hundred million base pairs, are also thought to be quite variable within the human population (they are so repetitive and so long that they cannot be accurately sequenced with current technology). These regions contain few genes, and it is unclear whether any significant

phenotypic In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...

effect results from typical variation in repeats or heterochromatin. Most gross genomic mutations in

germ cells probably result in inviable embryos; however, a number of human diseases are related to large-scale genomic abnormalities.

Down syndrome Down syndrome or Down's syndrome, also known as trisomy 21, is a genetic disorder caused by the presence of all or part of a third copy of chromosome 21. It is usually associated with physical growth delays, mild to moderate intellectual dis ...

Turner Syndrome Turner syndrome (TS), also known as 45,X, or 45,X0, is a genetic condition in which a female is partially or completely missing an X chromosome. Signs and symptoms vary among those affected. Often, a short and webbed neck, low-set ears, low hair ...

, and a number of other diseases result from

nondisjunction Nondisjunction is the failure of homologous chromosomes or sister chromatids to separate properly during cell division (mitosis/meiosis). There are three forms of nondisjunction: failure of a pair of homologous chromosomes to separate in meiosis ...

of entire chromosomes.

Cancer Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal b ...

cells frequently have

aneuploidy Aneuploidy is the presence of an abnormal number of chromosomes in a cell, for example a human cell having 45 or 47 chromosomes instead of the usual 46. It does not include a difference of one or more complete sets of chromosomes. A cell with any ...

of chromosomes and chromosome arms, although a cause and effect relationship between aneuploidy and cancer has not been established.

Mapping human genomic variation

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome. An example of a variation map is the HapMap being developed by the

. The HapMap is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence variation." It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or bases. Researchers published the first sequence-based map of large-scale structural variation across the human genome in the journal ''

Nature Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to the phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. Although humans are ...

'' in May 2008. Large-scale structural variations are differences in the genome among people that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome sequence and others appear as re-arrangements of stretches of sequence. These variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions.

Structural variation

Structural variation refers to genetic variants that affect larger segments of the human genome, as opposed to point

mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA replication, DNA or viral repl ...

s. Often, structural variants (SVs) are defined as variants of 50 base pairs (bp) or greater, such as deletions, duplications, insertions, inversions and other rearrangements. About 90% of structural variants are noncoding deletions but most individuals have more than a thousand such deletions; the size of deletions ranges from dozens of base pairs to tens of thousands of bp. On average, individuals carry ~3 rare structural variants that alter coding regions, e.g. delete

s. About 2% of individuals carry ultra-rare megabase-scale structural variants, especially rearrangements. That is, millions of base pairs may be inverted within a chromosome; ultra-rare means that they are only found in individuals or their family members and thus have arisen very recently.

SNP frequency across the human genome

Single-nucleotide polymorphisms (SNPs) do not occur homogeneously across the human genome. In fact, there is enormous diversity in SNP frequency between genes, reflecting different selective pressures on each gene as well as different mutation and recombination rates across the genome. However, studies on SNPs are biased towards coding regions, the data generated from them are unlikely to reflect the overall distribution of SNPs throughout the genome. Therefore, the SNP Consortium protocol was designed to identify SNPs with no bias towards coding regions and the Consortium's 100,000 SNPs generally reflect sequence diversity across the human chromosomes. The SNP Consortium aims to expand the number of SNPs identified across the genome to 300 000 by the end of the first quarter of 2001. TSC SNP Distribution

Changes in non-coding sequence and synonymous changes in coding sequence are generally more common than non-synonymous changes, reflecting greater selective pressure reducing diversity at positions dictating amino acid identity. Transitional changes are more common than transversions, with CpG dinucleotides showing the highest mutation rate, presumably due to deamination.

Personal genomes

A personal genome sequence is a (nearly) complete

sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...

of the chemical base pairs that make up the DNA of a single person. Because medical treatments have different effects on different people due to genetic variations such as

single-nucleotide polymorphisms In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...

(SNPs), the analysis of personal genomes may lead to personalized medical treatment based on individual genotypes. The first personal genome sequence to be determined was that of

Craig Venter John Craig Venter (born October 14, 1946) is an American biotechnologist and businessman. He is known for leading one of the first draft sequences of the human genome and assembled the first team to transfect a cell with a synthetic chromosome. ...

in 2007. Personal genomes had not been sequenced in the public Human Genome Project to protect the identity of volunteers who provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse population. However, early in the Venter-led

Celera Genomics Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnost ...

genome sequencing effort the decision was made to switch from sequencing a composite sample to using DNA from a single individual, later revealed to have been Venter himself. Thus the Celera human genome sequence released in 2000 was largely that of one man. Subsequent replacement of the early composite-derived data and determination of the diploid sequence, representing both sets of

, rather than a haploid sequence originally reported, allowed the release of the first personal genome. In April 2008, that of James Watson was also completed. In 2009, Stephen Quake published his own genome sequence derived from a sequencer of his own design, the Heliscope. A Stanford team led by Euan Ashley published a framework for the medical interpretation of human genomes implemented on Quake's genome and made whole genome-informed medical decisions for the first time. That team further extended the approach to the West family, the first family sequenced as part of Illumina's Personal Genome Sequencing program. Since then hundreds of personal genome sequences have been released, including those of Desmond Tutu, and of a

Paleo-Eskimo The Paleo-Eskimo (also pre-Thule or pre-Inuit) were the peoples who inhabited the Arctic region from Chukotka (e.g., Chertov Ovrag) in present-day Russia across North America to Greenland prior to the arrival of the modern Inuit (Eskimo) and rel ...

. In 2012, the whole genome sequences of two family trios among 1092 genomes was made public. In November 2013, a Spanish family made four personal exome datasets (about 1% of the genome) publicly available under a Creative Commons public domain license. The

Personal Genome Project The Personal Genome Project (PGP) is a long term, large cohort study which aims to sequence and publicize the complete genomes and medical records of 100,000 volunteers, in order to enable research into personal genomics and personalized medicine. ...

(started in 2005) is among the few to make both genome sequences and corresponding medical phenotypes publicly available. The sequencing of individual genomes further unveiled levels of genetic complexity that had not been appreciated before. Personal genomics helped reveal the significant level of diversity in the human genome attributed not only to SNPs but structural variations as well. However, the application of such knowledge to the treatment of disease and in the medical field is only in its very beginnings.

Exome sequencing Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). It consists of two steps: the first step is to select only the subs ...

has become increasingly popular as a tool to aid in diagnosis of genetic disease because the exome contributes only 1% of the genomic sequence but accounts for roughly 85% of mutations that contribute significantly to disease.

Human knockouts

In humans, gene knockouts naturally occur as

heterozygous Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism. Mo ...

homozygous Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism. Mo ...

loss-of-function gene knockouts. These knockouts are often difficult to distinguish, especially within heterogeneous genetic backgrounds. They are also difficult to find as they occur in low frequencies. Gene Knockouts in Outbred vs

Populations with high rates of

consanguinity Consanguinity ("blood relation", from Latin '' consanguinitas'') is the characteristic of having a kinship with another person (being descended from a common ancestor). Many jurisdictions have laws prohibiting people who are related by blood fr ...

, such as countries with high rates of first-cousin marriages, display the highest frequencies of homozygous gene knockouts. Such populations include Pakistan, Iceland, and Amish populations. These populations with a high level of parental-relatedness have been subjects of human knock out research which has helped to determine the function of specific genes in humans. By distinguishing specific knockouts, researchers are able to use phenotypic analyses of these individuals to help characterize the gene that has been knocked out. Consanguineous Mating resulting in Knockout

Consanguineous Mating resulting in Knockout

Knockouts in specific genes can cause genetic diseases, potentially have beneficial effects, or even result in no phenotypic effect at all. However, determining a knockout's phenotypic effect and in humans can be challenging. Challenges to characterizing and clinically interpreting knockouts include difficulty calling of DNA variants, determining disruption of protein function (annotation), and considering the amount of influence

mosaicism Mosaicism or genetic mosaicism is a condition in multicellular organisms in which a single organism possesses more than one genetic line as the result of genetic mutation. This means that various genetic lines resulted from a single fertilized ...

has on the phenotype. One major study that investigated human knockouts is the Pakistan Risk of Myocardial Infarction study. It was found that individuals possessing a heterozygous loss-of-function gene knockout for the APOC3 gene had lower triglycerides in the blood after consuming a high fat meal as compared to individuals without the mutation. However, individuals possessing homozygous loss-of-function gene knockouts of the APOC3 gene displayed the lowest level of triglycerides in the blood after the fat load test, as they produce no functional APOC3 protein.

Human genetic disorders

Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste or smell certain compounds, etc.). Moreover, some genetic disorders only cause disease in combination with the appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene and is the most common recessive disorder in caucasian populations with over 1,300 different mutations known. Disease-causing mutations in specific genes are usually severe in terms of gene function and are fortunately rare, thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause genetic disorders, in aggregate they constitute a significant component of known medical conditions, especially in pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has been identified. Currently there are approximately 2,200 such disorders annotated in the

OMIM Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship. , approximately 9,000 of the over 25,000 entries in OMIM ...

database. Studies of genetic disorders are often performed by means of family-based studies. In some instances, population based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a

geneticist A geneticist is a biologist or physician who studies genetics, the science of genes, heredity, and variation of organisms. A geneticist can be employed as a scientist or a lecturer. Geneticists may perform general research on genetic processes ...

-physician trained in clinical/medical genetics. The results of the

are likely to provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. Parents can be screened for hereditary conditions and counselled on the consequences, the probability of inheritance, and how to avoid or ameliorate it in their offspring. There are many different kinds of DNA sequence variation, ranging from complete extra or missing chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic variation in human populations is phenotypically neutral, i.e., has little or no detectable effect on the physiology of the individual (although there may be fractional differences in fitness defined over evolutionary time frames). Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the clinical disease under investigation. Such studies constitute the realm of human molecular genetics. With the advent of the Human Genome and

, it has become feasible to explore subtle genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. Although some causal links have been made between genomic sequence variants in particular genes and some of these diseases, often with much publicity in the general media, these are usually not considered to be genetic disorders ''per se'' as their causes are complex, involving many different genetic and environmental factors. Thus there may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder. Additional genetic disorders of mention are

Kallman syndrome Kallmann syndrome (KS) is a genetic disorder that prevents a person from starting or fully completing puberty. Kallmann syndrome is a form of a group of conditions termed hypogonadotropic hypogonadism. To distinguish it from other forms of hypog ...

and

Pfeiffer syndrome Pfeiffer syndrome is a rare genetic disorder characterized by the premature fusion of certain bones of the skull (craniosynostosis) which affects the shape of the head and face. In addition, the syndrome includes abnormalities of the hands (such ...

(gene FGFR1), Fuchs corneal dystrophy (gene TCF4),

Hirschsprung's disease Hirschsprung's disease (HD or HSCR) is a birth defect in which nerves are missing from parts of the intestine. The most prominent symptom is constipation. Other symptoms may include vomiting, abdominal pain, diarrhea and slow growth. Symptoms us ...

(genes RET and FECH), Bardet-Biedl syndrome 1 (genes CCDC28B and BBS1), Bardet-Biedl syndrome 10 (gene BBS10), and facioscapulohumeral muscular dystrophy type 2 (genes D4Z4 and SMCHD1). Genome sequencing is now able to narrow the genome down to specific locations to more accurately find mutations that will result in a genetic disorder.

Copy number variants Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of G ...

(CNVs) and single nucleotide variants (SNVs) are also able to be detected at the same time as genome sequencing with newer sequencing procedures available, called Next Generation Sequencing (NGS). This only analyzes a small portion of the genome, around 1-2%. The results of this sequencing can be used for clinical diagnosis of a genetic condition, including

Usher syndrome Usher syndrome, also known as Hallgren syndrome, Usher–Hallgren syndrome, retinitis pigmentosa–dysacusis syndrome or dystrophia retinae dysacusis syndrome, is a rare genetic disorder caused by a mutation in any one of at least 11 genes result ...

, retinal disease, hearing impairments, diabetes, epilepsy, Leigh disease, hereditary cancers, neuromuscular diseases, primary immunodeficiencies,

severe combined immunodeficiency Severe combined immunodeficiency (SCID), also known as Swiss-type agammaglobulinemia, is a rare genetic disorder characterized by the disturbed development of functional T cells and B cells caused by numerous genetic mutations that result in diffe ...

(SCID), and diseases of the mitochondria. NGS can also be used to identify carriers of diseases before conception. The diseases that can be detected in this sequencing include Tay-Sachs disease,

Bloom syndrome Bloom syndrome (often abbreviated as BS in literature) is a rare autosomal recessive genetic disorder characterized by short stature, predisposition to the development of cancer, and genomic instability. BS is caused by mutations in the '' BLM'' ge ...

Gaucher disease Gaucher's disease or Gaucher disease () (GD) is a genetic disorder in which glucocerebroside (a sphingolipid, also known as glucosylceramide) accumulates in cells and certain organs. The disorder is characterized by bruising, fatigue, anemia, low ...

Canavan disease Canavan disease, or Canavan-Van Bogaert-Bertrand disease, is a rare and fatal autosomal recessive degenerative disease that causes progressive damage to nerve cells and loss of white matter in the brain. It is one of the most common degenerative ...

familial dysautonomia Familial dysautonomia (FD), also known as Riley-Day Syndrome, is a rare, progressive, recessive genetic disorder of the autonomic nervous system that affects the development and survival of sensory, sympathetic and some parasympathetic neurons ...

, cystic fibrosis, spinal muscular atrophy, and fragile-X syndrome. The Next Genome Sequencing can be narrowed down to specifically look for diseases more prevalent in certain ethnic populations.

Evolution

Comparative genomics studies of mammalian genomes suggest that approximately 5% of the human genome has been conserved by evolution since the divergence of extant lineages approximately 200 million years ago, containing the vast majority of genes. The published

genome differs from that of the human genome by 1.23% in direct sequence comparisons. Around 20% of this figure is accounted for by variation within each species, leaving only ~1.06% consistent sequence divergence between humans and chimps at shared genes. This nucleotide by nucleotide difference is dwarfed, however, by the portion of each genome that is not shared, including around 6% of functional genes that are unique to either humans or chimps. In other words, the considerable observable differences between humans and chimps may be due as much or more to genome level variation in the number, function and expression of genes rather than DNA sequence changes in shared genes. Indeed, even within humans, there has been found to be a previously unappreciated amount of copy number variation (CNV) which can make up as much as 5 – 15% of the human genome. In other words, between humans, there could be +/- 500,000,000 base pairs of DNA, some being active genes, others inactivated, or active at different levels. The full significance of this finding remains to be seen. On average, a typical human protein-coding gene differs from its chimpanzee

ortholog Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a s ...

by only two

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha a ...

substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee orthologs. A major difference between the two genomes is human

chromosome 2 Chromosome 2 is one of the twenty-three pairs of chromosomes in humans. People normally have two copies of this chromosome. Chromosome 2 is the second-largest human chromosome, spanning more than 242 million base pairs and representing almost e ...

, which is equivalent to a fusion product of chimpanzee chromosomes 12 and 13. (later renamed to chromosomes 2A and 2B, respectively). Humans have undergone an extraordinary loss of

genes during our recent evolution, which explains our relatively crude sense of smell compared to most other mammals. Evolutionary evidence suggests that the emergence of color vision in humans and several other

primate Primates are a diverse order of mammals. They are divided into the strepsirrhines, which include the lemurs, galagos, and lorisids, and the haplorhines, which include the tarsiers and the simians ( monkeys and apes, the latter including ...

species has diminished the need for the sense of smell. In September 2016, scientists reported that, based on human DNA genetic studies, all non-Africans in the world today can be traced to a single population that exited Africa between 50,000 and 80,000 years ago.

Mitochondrial DNA

The human

is of tremendous interest to geneticists, since it undoubtedly plays a role in

mitochondrial disease Mitochondrial disease is a group of disorders caused by mitochondrial dysfunction. Mitochondria are the organelles that generate energy for the cell and are found in every cell of the human body except red blood cells. They convert the energy of ...

. It also sheds light on human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of a recent common ancestor for all humans on the maternal line of descent (see

Mitochondrial Eve In human genetics, the Mitochondrial Eve (also ''mt-Eve, mt-MRCA'') is the matrilineal most recent common ancestor (MRCA) of all living humans. In other words, she is defined as the most recent woman from whom all living humans descend in an ...

). Due to the lack of a system for checking for copying errors, mitochondrial DNA (mtDNA) has a more rapid rate of variation than nuclear DNA. This 20-fold higher mutation rate allows mtDNA to be used for more accurate tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, such as the migration of Native Americans from

Siberia Siberia ( ; rus, Сибирь, r=Sibir', p=sʲɪˈbʲirʲ, a=Ru-Сибирь.ogg) is an extensive region, geographical region, constituting all of North Asia, from the Ural Mountains in the west to the Pacific Ocean in the east. It has been a ...

Polynesia Polynesia () "many" and νῆσος () "island"), to, Polinisia; mi, Porinihia; haw, Polenekia; fj, Polinisia; sm, Polenisia; rar, Porinetia; ty, Pōrīnetia; tvl, Polenisia; tkl, Polenihia (, ) is a subregion of Oceania, made up of ...

ns from southeastern

Asia Asia (, ) is one of the world's most notable geographical regions, which is either considered a continent in its own right or a subcontinent of Eurasia, which shares the continental landmass of Afro-Eurasia with Africa. Asia covers an are ...

. It has also been used to show that there is no trace of

Neanderthal Neanderthals (, also ''Homo neanderthalensis'' and erroneously ''Homo sapiens neanderthalensis''), also written as Neandertals, are an extinct species or subspecies of archaic humans who lived in Eurasia until about 40,000 years ago. While the ...

DNA in the European gene mixture inherited through purely maternal lineage. Due to the restrictive all or none manner of mtDNA inheritance, this result (no trace of Neanderthal mtDNA) would be likely unless there were a large percentage of Neanderthal ancestry, or there was strong positive selection for that mtDNA. For example, going back 5 generations, only 1 of a person's 32 ancestors contributed to that person's mtDNA, so if one of these 32 was pure Neanderthal an expected ~3% of that person's autosomal DNA would be of Neanderthal origin, yet they would have a ~97% chance of having no trace of Neanderthal mtDNA.

Epigenome

Epigenetics describes a variety of features of the human genome that transcend its primary DNA sequence, such as

chromatin Chromatin is a complex of DNA and protein found in eukaryotic cells. The primary function is to package long DNA molecules into more compact, denser structures. This prevents the strands from becoming tangled and also plays important roles in r ...

packaging,

modifications and DNA methylation, and which are important in regulating gene expression, genome replication and other cellular processes. Epigenetic markers strengthen and weaken transcription of certain genes but do not affect the actual sequence of DNA nucleotides. DNA methylation is a major form of epigenetic control over gene expression and one of the most highly studied topics in epigenetics. During development, the human DNA methylation profile experiences dramatic changes. In early germ line cells, the genome has very low methylation levels. These low levels generally describe active genes. As development progresses, parental imprinting tags lead to increased methylation activity. Epigenetic patterns can be identified between tissues within an individual as well as between individuals themselves. Identical genes that have differences only in their epigenetic state are called epialleles. Epialleles can be placed into three categories: those directly determined by an individual's genotype, those influenced by genotype, and those entirely independent of genotype. The epigenome is also influenced significantly by environmental factors. Diet, toxins, and hormones impact the epigenetic state. Studies in dietary manipulation have demonstrated that methyl-deficient diets are associated with hypomethylation of the epigenome. Such studies establish epigenetics as an important interface between the environment and the genome.

References

External links

Annotated (version 110) genome viewer of T2T-CHM13 v2.0

Complete human genome T2T-CHM13 v2.0 (no gaps)

Ensembl
The

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

Genome Browser Project
National Library of Medicine Genome Data Viewer (GDV)

UCSC Genome Browser using T2T-CHM13 v2.0

Uniprot: per chromosome gene list

Human Genome Project

The National Human Genome Research Institute

Simple Human Genome viewer
{{DEFAULTSORT:Human Genome Genetic mapping Genomics

Genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...