
A reference genome (also known as a reference assembly) is a digital
nucleic acid sequence
A nucleic acid sequence is a succession of Nucleobase, bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the orde ...
database, assembled by scientists as a representative example of the
set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
from a number of individual donors, reference
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
s do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a
haploid
Ploidy () is the number of complete sets of chromosomes in a cell (biology), cell, and hence the number of possible alleles for Autosome, autosomal and Pseudoautosomal region, pseudoautosomal genes. Here ''sets of chromosomes'' refers to the num ...
mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly ''
GRCh38/hg38'', is derived from >60
genomic clone libraries.
There are reference genomes for multiple species of
viruses
A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Viruses are found in almo ...
,
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
,
fungus
A fungus (: fungi , , , or ; or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified as one ...
,
plants
Plants are the eukaryotes that form the kingdom Plantae; they are predominantly photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with cyanobacteria to produce sugars f ...
, and
animals
Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia (). With few exceptions, animals consume organic material, breathe oxygen, have myocytes and are able to move, can reproduce sexually, and grow from a ...
. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial
Human Genome Project
The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
. Reference genomes can be accessed online at several locations, using dedicated browsers such as
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
or
UCSC Genome Browser.
Properties of reference genomes
Measures of length
The length of a genome can be measured in multiple different ways.
A simple way to measure genome length is to count the number of base pairs in the assembly.
The ''golden path'' is an alternative measure of length that omits redundant regions such as
haplotype
A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material (DNA) which is inherited from two parents. Normally these organisms have their DNA orga ...
s and
pseudo autosomal regions. It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
will look like and typically includes gaps, making it longer than the typical base pair assembly.
Contigs and scaffolds

Reference genomes assembly requires reads overlapping, creating
contigs, which are contiguous DNA regions of
consensus sequence
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the result ...
s.
If there are gaps between contigs, these can be filled by
scaffolding, either by contigs amplification with PCR and sequencing or by
Bacterial Artificial Chromosome (BAC) cloning.
Filling these gaps is not always possible, in this case multiple scaffolds are created in a reference assembly. Scaffolds are classified in 3 types: 1) Placed, whose chromosome, genomic coordinates and orientations are known; 2) Unlocalised, when only the chromosome is known but not the coordinates or orientation; 3) Unplaced, whose chromosome is not known.
The number of
contigs and
scaffolds, as well as their average lengths are relevant parameters, among many others, for a reference genome assembly quality assessment since they provide information about the continuity of the final mapping from the original genome. The smaller the number of scaffolds per chromosome, until a single scaffold occupies an entire chromosome, the greater the continuity of the genome assembly. Other related parameters are
N50 and
L50. N50 is the length of the contigs/scaffolds in which the 50% of the assembly is found in fragments of this length or greater, while L50 is the number of contigs/scaffolds whose length is N50. The higher the value of N50, the lower the value of L50, and vice versa, indicating high continuity in the assembly.
Mammalian genomes
The human and mouse reference genomes are maintained and improved by the
Genome Reference Consortium (GRC), a group of fewer than 20 scientists from a number of genome research institutes, including the
European Bioinformatics Institute
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
, the
National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is lo ...
, the
Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit organisation, non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
It is l ...
and
McDonnell Genome Institute at
Washington University in St. Louis. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.
Human reference genome
The original human reference genome was derived from thirteen anonymous volunteers from
Buffalo, New York
Buffalo is a Administrative divisions of New York (state), city in the U.S. state of New York (state), New York and county seat of Erie County, New York, Erie County. It lies in Western New York at the eastern end of Lake Erie, at the head of ...
. Donors were recruited by advertisement in ''
The Buffalo News
''The Buffalo News'' is the daily newspaper of the Buffalo–Niagara Falls metropolitan area, located in downtown Buffalo, New York.
It was for decades the only paper fully owned by Warren Buffett's Berkshire Hathaway. On January 29, 2020, th ...
'', on Sunday, March 23, 1997. The first ten male and ten female volunteers were invited to make an appointment with the project's
genetic counselors and donate blood from which DNA was extracted. As a result of how the DNA samples were processed, about 80 percent of the reference genome came from eight people and one male, designated ''RP11'', accounts for 66 percent of the total. The
ABO blood group system
The ABO blood group system is used to denote the presence of one, both, or neither of the A and B antigens on erythrocytes (red blood cells). For human blood transfusions, it is the most important of the 47 different blood type (or group) c ...
differs among humans, but the human reference genome contains only an
O allele, although the others are
annotated
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
.

As the cost of
DNA sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
falls, and new
full genome sequencing
Whole genome sequencing (WGS), also known as full genome sequencing or just genome sequencing, is the process of determining the entirety of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's ...
technologies emerge, more genome sequences continue to be generated. In several cases people such as
James D. Watson had their genome assembled using
massive parallel DNA sequencing.
Comparison between the reference (assembly NCBI36/hg18) and Watson's genome revealed 3.3 million
single nucleotide polymorphism
In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in ...
differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all.
For regions where there is known to be large-scale variation, sets of alternate
loci are assembled alongside the reference locus.

The latest human reference genome assembly, released by the
Genome Reference Consortium, was GRCh38 in 2017. Several patches were added to update it, the latest patch being GRCh38.p14, published on the 3rd of February 2022. This build only has 349 gaps across the entire assembly, which implies a great improvement in comparison with the first version, which had roughly 150,000 gaps.
The gaps are mostly in areas such as
telomere
A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes (see #Sequences, Sequences). Telomeres are a widespread genetic feature most commonly found in eukaryotes. In ...
s,
centromere
The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fiber ...
s, and long
repetitive sequences, with the biggest gap along the long arm of the Y chromosome, a region of ~30 Mb in length (~52% of the Y chromosome's length). The number of
genomic clone libraries contributing to the reference has increased steadily to >60 over the years, although individual ''RP11'' still accounts for 70% of the reference genome.
Genomic analysis of this anonymous male suggests that he is of African-European ancestry.
According to the GRC website, their next assembly release for the human genome (version GRCh39) is currently "indefinitely postponed".
In 2022, the Telomere-to-Telomere (T2T) Consortium, an open, community-based effort, published the first completely assembled reference genome (version T2T-CHM13), without any gaps in the assembly. It did not contain a Y-chromosome until version 2.0. This assembly allows for the examination of centromeric and pericentromeric sequence evolution. The consortium employed rigorous methods to assemble, clean, and validate complex repeat regions which are particularly difficult to sequence. It used ultra-long–read (>100 kb) sequencing to accurately sequence
segmental duplications.
The T2T-CHM13 is sequenced from CHM13hTERT, a cell line from an essentially haploid
hydatidiform mole. "CHM" stands for "Complete Hydatidiform Mole," and "13" is its line number. "hTERT" stands for "human
Telomerase Reverse Transcriptase
Telomerase reverse transcriptase (abbreviated to TERT, or hTERT in humans) is a catalytic subunit of the enzyme telomerase, which, together with the telomerase RNA component (TERC), comprises the most important unit of the telomerase complex.
...
". The cell line has been transfected with the TERT gene, which is responsible for maintaining telomere length and thus contributes to the
cell line's immortality. A hydatidiform mole contains two copies of the same parental genome, and thus is essentially haploid. This eliminates allelic variation and allows better sequencing accuracy.
Recent genome assemblies are as follows:
Limitations
For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high
allelic diversity, such as the
major histocompatibility complex
The major histocompatibility complex (MHC) is a large Locus (genetics), locus on vertebrate DNA containing a set of closely linked polymorphic genes that code for Cell (biology), cell surface proteins essential for the adaptive immune system. The ...
in humans and the
major urinary proteins of mice, the reference genome may differ significantly from other individuals.
Due to the fact that the reference genome is a "single" distinct sequence, which gives its utility as an index or locator of genomic features, there are limitations in terms of how faithfully it represents the human genome and its
variability. Most of the initial samples used for reference genome sequencing came from people of European ancestry. In 2010, it was found that, by ''de novo'' assembling genomes from African and Asian populations with the NCBI reference genome (version NCBI36), these genomes had ~5Mb sequences that did not align against any region of the reference genome.
Following projects to the Human Genome Project seek to address a deeper and more diverse characerization of the human genetic variability, which the reference genome is not able to represent. The
HapMap Project, active during the period 2002 -2010, with the purpose of creating a
haplotype
A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material (DNA) which is inherited from two parents. Normally these organisms have their DNA orga ...
s map and their most common variations among different human populations. Up to 11 populations of different ancestry were studied, such as individuals of the
Han ethnic group from China,
Gujaratis from India, the
Yoruba people from Nigeria or
Japanese people
are an East Asian ethnic group native to the Japanese archipelago. Japanese people constitute 97.4% of the population of the country of Japan. Worldwide, approximately 125 million people are of Japanese descent, making them list of contempora ...
, among others. The
1000 Genomes Project
The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least o ...
, carried out between 2008 and 2015, with the aim of creating a database that includes more than 95% of the variations present in the human genome and whose results can be used in studies of association with diseases (
GWAS) such as diabetes, cardiovascular or autoimmune diseases. A total of 26 ethnic groups were studied in this project, expanding the scope of the HapMap project to new ethnic groups such as the
Mende people
The Mende are one of the two largest ethnic groups in Sierra Leone; their neighbours, the Temne people, constitute the largest ethnic groups in Sierra Leone, ethnic group at 35.5% of the total population, which is slightly larger than the Mende ...
of Sierra Leone, the
Vietnamese people
The Vietnamese people (, ) or the Kinh people (), also known as the Viet people or the Viets, are a Southeast Asian ethnic group native to modern-day northern Vietnam and Dongxing, Guangxi, southern China who speak Vietnamese language, Viet ...
or the
Bengali people
Bengalis ( ), also rendered as endonym Bangalee, are an Indo-Aryan ethnolinguistic group originating from and culturally affiliated with the Bengal region of South Asia. The current population is divided between the sovereign country Bangl ...
. The
Human Pangenome Project, which started its initial phase in 2019 with the creation of the Human Pangenome Reference Consortium, seeks to create the largest map of human genetic variability taking the results of previous studies as a starting point.
Mouse reference genome
Recent mouse genome assemblies are as follows:
Other genomes
Since the Human Genome Project was finished, multiple international projects have started, focused on assembling reference genomes for many organisms. Model organisms (e.g., zebrafish (''
Danio rerio
The zebrafish (''Danio rerio'') is a species of freshwater ray-finned fish belonging to the family Danionidae of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (a ...
''), chicken (''
Gallus gallus
The red junglefowl (''Gallus gallus''), also known as the Indian red junglefowl (and formerly the bankiva or bankiva-fowl), is a species of Tropics, tropical, predominantly Terrestrial animal, terrestrial bird in the fowl and pheasant family, P ...
''), ''
Escherichia coli
''Escherichia coli'' ( )Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. is a gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus '' Escherichia'' that is commonly fo ...
'' etc.) are of special interest to the scientific community, as well as, for example, endangered species (e.g., Asian arowana (''
Scleropages formosus)'' or the American bison (''
Bison bison'')). As of August 2022, the NCBI database supports 71 886 partially or completely sequenced and assembled genomes from different species, such as 676
mammal
A mammal () is a vertebrate animal of the Class (biology), class Mammalia (). Mammals are characterised by the presence of milk-producing mammary glands for feeding their young, a broad neocortex region of the brain, fur or hair, and three ...
s, 590
bird
Birds are a group of warm-blooded vertebrates constituting the class (biology), class Aves (), characterised by feathers, toothless beaked jaws, the Oviparity, laying of Eggshell, hard-shelled eggs, a high Metabolism, metabolic rate, a fou ...
s and 865
fish
A fish (: fish or fishes) is an aquatic animal, aquatic, Anamniotes, anamniotic, gill-bearing vertebrate animal with swimming fish fin, fins and craniate, a hard skull, but lacking limb (anatomy), limbs with digit (anatomy), digits. Fish can ...
es. Also noteworthy are the numbers of 1796
insect
Insects (from Latin ') are Hexapoda, hexapod invertebrates of the class (biology), class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body (Insect morphology#Head, head, ...
s genomes, 3747
fungi
A fungus (: fungi , , , or ; or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified as one ...
, 1025
plant
Plants are the eukaryotes that form the Kingdom (biology), kingdom Plantae; they are predominantly Photosynthesis, photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with c ...
s, 33 724
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
, 26 004
virus
A virus is a submicroscopic infectious agent that replicates only inside the living Cell (biology), cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Viruses are ...
and 2040
archaea
Archaea ( ) is a Domain (biology), domain of organisms. Traditionally, Archaea only included its Prokaryote, prokaryotic members, but this has since been found to be paraphyletic, as eukaryotes are known to have evolved from archaea. Even thou ...
. A lot of these species have annotation data associated with their reference genomes that can be publicly accessed and ''visuali''zed in genome browsers such as
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
and
UCSC Genome Browser.
Some examples of these international projects are: the
Chimpanzee Genome Project, carried out between 2005 and 2013 jointly by the
Broad Institute
The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The institu ...
and the
McDonnell Genome Institute of
Washington University in St. Louis, which generated the first reference genomes for 4 subspecies of ''
Pan troglodytes''; the
100K Pathogen Genome Project, which started in 2012 with the main goal of creating a database of reference genomes for 100 000
pathogen
In biology, a pathogen (, "suffering", "passion" and , "producer of"), in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a Germ theory of d ...
microorganisms to use in public health, outbreaks detection, agriculture and environment; the
Earth BioGenome Project, which started in 2018 and aims to sequence and catalog the genomes of all the eukaryotic organisms on Earth to promote biodiversity conservation projects. Inside this big-science project there are up to 50 smaller-scale affiliated projects such as the
Africa BioGenome Project or the
1000 Fungal Genomes Project.
See also
*
European Reference Genome Atlas
References
{{reflist, 2
External links
Genome Reference Consortium
Genome projects
Genomics
Human genetics
Bioinformatics
DNA sequencing