HOME

TheInfoList



OR:

A reference genome (also known as a reference assembly) is a digital
nucleic acid sequence A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are us ...
database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
s do not accurately represent the set of genes of any single individual organism. Instead a reference provides a
haploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respective ...
mosaic of different DNA sequences from each donor. For example, the most recent human reference genome (assembly '' GRCh38/hg38'') is derived from >60 genomic clone libraries. There are reference genomes for multiple species of
viruses A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's ...
,
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
,
fungus A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately fr ...
,
plants Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all current definitions of Plantae exclude ...
, and
animals Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage in ...
. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both ...
. Reference genomes can be accessed online at several locations, using dedicated browsers such as
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
or
UCSC Genome Browser The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spec ...
.


Properties of reference genomes


Measures of length

The length of a genome can be measured in multiple different ways. A simple way to measure genome length is to count the number of base pairs in the assembly. The ''golden path'' is an alternative measure of length that omits redundant regions such as
haplotype A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent. Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA o ...
s and
pseudoautosomal region The pseudoautosomal regions, PAR1, PAR2, are homologous sequences of nucleotides on the X and Y chromosomes. The pseudoautosomal regions get their name because any genes within them (so far at least 29 have been found for humans) are inherited ...
s. It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
will look like and typically includes gaps, making it longer than the typical base pair assembly.


Contigs and scaffolds

Reference genomes assembly requires reads overlaping, creating contigs, which are contiguous DNA regions of
consensus sequences In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It serves as a simplified r ...
. If there are gaps between contigs, these can be filled by
scaffolding Scaffolding, also called scaffold or staging, is a temporary structure used to support a work crew and materials to aid in the construction, maintenance and repair of buildings, bridges and all other man-made structures. Scaffolds are widely use ...
, either by contigs amplification with PCR and sequencing or by Bacterial Artificial Chromosome (BAC) cloning. Filling these gaps is not always possible, in this case multiple scaffolds are created in a reference assembly. Scaffolds are classified in 3 types: 1) Placed, whose chromosome, genomic coordinates and orientations are known; 2) Unlocalised, when only the chromosome is known but not the coordinates or orientation; 3) Unplaced, whose chromosome is not known. The number of contigs and
scaffolds Scaffolding, also called scaffold or staging, is a temporary structure used to support a work crew and materials to aid in the construction, maintenance and repair of buildings, bridges and all other man-made structures. Scaffolds are widely used ...
, as well as their average lengths are relevant parameters, among many others, for a reference genome assembly quality assessment since they provide information about the continuity of the final mapping from the original genome. The smaller the number of scaffolds per chromosome, until a single scaffold occupies an entire chromosome, the greater the continuity of the genome assembly. Other related parameters are N50 and L50. N50 is the length of the contigs/scaffolds in which the 50% of the assembly is found in fragments of this length or greater, while L50 is the number of contigs/scaffolds whose length is N50. The higher the value of N50, the lower the value of L50, and vice versa, indicating high continuity in the assembly.


Mammalian genomes

The human and mouse reference genomes are maintained and improved by the
Genome Reference Consortium The Genome Reference Consortium (GRC) is an international collective of academic and research institutes with expertise in genome mapping, sequencing, and informatics, formed to improve the representation of reference genomes. At the time the Hum ...
(GRC), a group of fewer than 20 scientists from a number of genome research institutes, including the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
, the
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. Th ...
, the
Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome Ge ...
and
McDonnell Genome Institute McDonnell Genome Institute (The Elizabeth H. and James S. McDonnell III Genome Institute) at Washington University in St. Louis, Missouri, is one of three NIH funded large-scale sequencing centers in the United States. Affiliated with Washington ...
at Washington University in St. Louis. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.


Human reference genome

The original human reference genome was derived from thirteen anonymous volunteers from
Buffalo, New York Buffalo is the second-largest city in the U.S. state of New York (behind only New York City) and the seat of Erie County. It is at the eastern end of Lake Erie, at the head of the Niagara River, and is across the Canadian border from Sou ...
. Donors were recruited by advertisement in ''
The Buffalo News ''The Buffalo News'' is the daily newspaper of the Buffalo–Niagara Falls metropolitan area, located in downtown Buffalo, New York. It recently sold its headquarters to Uniland Development Corp. It was for decades the only paper fully owned by ...
'', on Sunday, March 23, 1997. The first ten male and ten female volunteers were invited to make an appointment with the project's genetic counselors and donate blood from which DNA was extracted. As a result of how the DNA samples were processed, about 80 percent of the reference genome came from eight people and one male, designated ''RP11'', accounts for 66 percent of the total. The
ABO blood group system The ABO blood group system is used to denote the presence of one, both, or neither of the A and B antigens on erythrocytes. For human blood transfusions, it is the most important of the 43 different blood type (or group) classification system ...
differs among humans, but the human reference genome contains only an O allele, although the others are annotated. As the cost of
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. T ...
falls, and new
full genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a ...
technologies emerge, more genome sequences continue to be generated. In several cases people such as James D. Watson had their genome assembled using massive parallel DNA sequencing. Comparison between the reference (assembly NCBI36/hg18) and Watson's genome revealed 3.3  million
single nucleotide polymorphism In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all. For regions where there is known to be large-scale variation, sets of alternate loci are assembled alongside the reference locus. The latest human reference genome assembly, released by the
Genome Reference Consortium The Genome Reference Consortium (GRC) is an international collective of academic and research institutes with expertise in genome mapping, sequencing, and informatics, formed to improve the representation of reference genomes. At the time the Hum ...
, was GRCh38 in 2017. Several patches were added to update it, being the latest the patch GRCh38.p14, published in March 2022. This build only has 349 gaps across all the assembly, which implies a great improvement in comparison with the first version, which had roughly 150,000 gaps. It presents gaps mostly in areas concerning
telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...
s,
centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...
s and long repetitive sequences, being the biggest gap along the long arm of the Y chromosome, a region of ~30 Mb length (~52% of the Y chromosome length). The number of genomic clone libraries contributing to the reference has increased steadily to >60 along the years, although individual ''RP11'' still accounts for 70% of the genome. Genomic analysis of this anonymous male suggests that he is of African-European ancestry. In 2022, the Telomere-to-Telomere (T2T) Consortium published the first completely assembled reference genome (version T2T-CHM13), without any gaps in the assembly. On the other hand, according to the GRC website, their next assembly release for the human genome (version GRCh39) is currently "indefinitely postponed". Recent genome assemblies are as follows:


Limitations

For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high allelic diversity, such as the
major histocompatibility complex The major histocompatibility complex (MHC) is a large locus on vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. These cell surface proteins are cal ...
in humans and the major urinary proteins of mice, the reference genome may differ significantly from other individuals. Due to the fact that the reference genome is a "single" distinct sequence, which gives its utility as an index or locator of genomic features, there are limitations in terms of how faithfully it represents the human genome and its variability. On the other hand, most of the samples used for reference genomes sequencing come from people of european ancestry, being these human populations the best characterized and studied at the expense of non-european populations. In 2010, it was found that, by ''de novo'' assembling genomes from african and asian populations with the NCBI reference genome (version NCBI36), these genomes had ~5Mb sequences that did not align against any region of the reference genome. Following projects to the Human Genome Project seek to address a deeper and more diverse characerization of the human genetic variability, which the reference genome is not able to represent. The
HapMap Project The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease ...
, active during the period 2002 -2010, with the purpose of creating a
haplotype A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent. Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA o ...
s map and their most common variations among different human populations. Up to 11 populations of different ancestry were studied, such as individuals of the Han ethnic group from China, Gujaratis from India, the
Yoruba The Yoruba people (, , ) are a West African ethnic group that mainly inhabit parts of Nigeria, Benin, and Togo. The areas of these countries primarily inhabited by Yoruba are often collectively referred to as Yorubaland. The Yoruba constitute ...
people from Nigeria or
Japanese people The are an East Asian ethnic group native to the Japanese archipelago."人類学上は,旧石器時代あるいは縄文時代以来,現在の北海道〜沖縄諸島(南西諸島)に住んだ集団を祖先にもつ人々。" () Ja ...
, among others. The
1000 Genomes Project The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
, carried out between 2008 and 2015, with the aim of creating a database that includes more than 95% of the variations present in the human genome and whose results can be used in studies of association with diseases (
GWAS In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...
) such as diabetes, cardiovascular or autoimmune diseases. A total of 26 ethnic groups were studied in this project, expanding the scope of the HapMap project to new ethnic groups such as the
Mende people The Mende are one of the two largest ethnic groups in Sierra Leone; their neighbours, the Temne people, constitute the largest ethnic group at 35.5% of the total population, which is slightly larger than the Mende at 31.2%. The Mende are pre ...
of Sierra Leone, the
Vietnamese people The Vietnamese people ( vi, người Việt, lit=Viet people) or Kinh people ( vi, người Kinh) are a Southeast Asian ethnic group native to modern-day Northern Vietnam and Southern China (Jing Islands, Dongxing, Guangxi). The native la ...
or the
Bengali people Bengalis (singular Bengali bn, বাঙ্গালী/বাঙালি ), also rendered as Bangalee or the Bengali people, are an Indo-Aryan ethnolinguistic group originating from and culturally affiliated with the Bengal region of S ...
. The Human Pangenome Project, which started its initial phase in 2019 with the creation of the Human Pangenome Reference Consortium, seeks to create the largest map of human genetic variability taking the results of previous studies as a starting point.


Mouse reference genome

Recent mouse genome assemblies are as follows:


Other genomes

Since the Human Genome Project was finished, multiple international projects have started, focused on assembling reference genomes for many organisms. Model organisms (e.g., zebrafish (''
Danio rerio The zebrafish (''Danio rerio'') is a freshwater fish belonging to the minnow family (Cyprinidae) of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (and thus often ca ...
''), chicken (''
Gallus gallus The red junglefowl (''Gallus gallus'') is a tropical bird in the family Phasianidae. It ranges across much of Southeast Asia and parts of South Asia. It was formerly known as the Bankiva or Bankiva Fowl. It is the species that gave rise to the ...
''), ''
Escherichia coli ''Escherichia coli'' (),Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. also known as ''E. coli'' (), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus '' Esc ...
'' etc.) are of special interest to the scientific community, as well as, for example, endangered species (e.g., Asian arowana (''
Scleropages formosus The Asian arowana (''Scleropages formosus'') comprises several phenotypic varieties of freshwater fish distributed geographically across Southeast Asia. While most consider the different varieties to belong to a single species, work by Pouyaud ' ...
)'' or the American bison (''
Bison bison The American bison (''Bison bison'') is a species of bison native to North America. Sometimes colloquially referred to as American buffalo or simply buffalo (a different clade of bovine), it is one of two extant species of bison, alongside the ...
'')). As of August 2022, the NCBI database supports 71 886 partially or completely sequenced and assembled genomes from different species, such as 676
mammal Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fur ...
s, 590
bird Birds are a group of warm-blooded vertebrates constituting the class Aves (), characterised by feathers, toothless beaked jaws, the laying of hard-shelled eggs, a high metabolic rate, a four-chambered heart, and a strong yet lightweig ...
s and 865
fish Fish are Aquatic animal, aquatic, craniate, gill-bearing animals that lack Limb (anatomy), limbs with Digit (anatomy), digits. Included in this definition are the living hagfish, lampreys, and Chondrichthyes, cartilaginous and bony fish as we ...
es. Also noteworthy are the numbers of 1796
insect Insects (from Latin ') are pancrustacean hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body ( head, thorax and abdomen), three pa ...
s genomes, 3747
fungi A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately fr ...
, 1025
plant Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all current definitions of Plantae excl ...
s, 33 724
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
, 26 004
virus A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsk ...
and 2040
archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaeba ...
. A lot of these species have annotation data associated with their reference genomes that can be publicly accessed and ''visuali''zed in genome browsers such as
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
and
UCSC Genome Browser The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spec ...
. Some examples of these international projects are: the
Chimpanzee Genome Project The Chimpanzee Genome Project was an effort to determine the DNA sequence of the chimpanzee genome. Sequencing began in 2005 and by 2013 twenty-four individual chimpanzees had been sequenced. This project was folded into the Great Ape Genome Pro ...
, carried out between 2005 and 2013 jointly by the
Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The insti ...
and the
McDonnell Genome Institute McDonnell Genome Institute (The Elizabeth H. and James S. McDonnell III Genome Institute) at Washington University in St. Louis, Missouri, is one of three NIH funded large-scale sequencing centers in the United States. Affiliated with Washington ...
of Washington University in St. Louis, which generated the first reference genomes for 4 subspecies of '' Pan troglodytes''; the
100K Pathogen Genome Project The 100K Pathogen Genome Project was launched in July 2012 by Bart Weimer (UC Davis) as an academic, public, and private partnership. It aims to sequence the genomes of 100,000 infectious microorganisms to create a database of bacterial genome seq ...
, which started in 2012 with the main goal of creating a database of reference genomes for 100 000
pathogen In biology, a pathogen ( el, πάθος, "suffering", "passion" and , "producer of") in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a g ...
microorganisms to use in public health, outbreaks detection, agriculture and environment; the
Earth BioGenome Project The Earth BioGenome Project (EBP) is an initiative that aims to sequence and catalog the genomes of all of Earth's currently described eukaryotic species over a period of ten years. The initiative would produce an open DNA database of biological ...
, which started in 2018 and aims to sequence and catalog the genomes of all the eukaryotic organisms on Earth to promote biodiversity conservation projects. Inside this big-science project there are up to 50 smaller-scale affiliated projects such as the Africa BioGenome Project or the 1000 Fungal Genomes Project.


References

{{reflist, 2


External links


Genome Reference Consortium
Genome projects Genomics Human genetics Bioinformatics DNA sequencing