The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of
human genetic variation
Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.
No two humans are genetically identical. Even m ...
. Scientists planned to
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
s of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using
newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal ''
Nature
Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...
''.
In 2012, the sequencing of 1092 genomes was announced in a ''Nature'' publication.
In 2015, two papers in ''Nature'' reported results and the completion of the project and opportunities for future research.
Many rare variations, restricted to closely related groups, were identified, and eight structural-variation classes were analyzed.
The project unites multidisciplinary research teams from institutes around the world, including
China
China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...
,
Italy
Italy ( it, Italia ), officially the Italian Republic, ) or the Republic of Italy, is a country in Southern Europe. It is located in the middle of the Mediterranean Sea, and its territory largely coincides with the homonymous geographical re ...
,
Japan
Japan ( ja, 日本, or , and formally , ''Nihonkoku'') is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north ...
,
Kenya
)
, national_anthem = "Ee Mungu Nguvu Yetu"()
, image_map =
, map_caption =
, image_map2 =
, capital = Nairobi
, coordinates =
, largest_city = Nairobi
, ...
,
Nigeria
Nigeria ( ), , ig, Naìjíríyà, yo, Nàìjíríà, pcm, Naijá , ff, Naajeeriya, kcg, Naijeriya officially the Federal Republic of Nigeria, is a country in West Africa. It is situated between the Sahel to the north and the Gulf o ...
,
Peru
, image_flag = Flag of Peru.svg
, image_coat = Escudo nacional del Perú.svg
, other_symbol = Great Seal of the State
, other_symbol_type = Seal (emblem), National seal
, national_motto = "Fi ...
, the
United Kingdom
The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Europe, off the north-western coast of the continental mainland. It comprises England, Scotland, Wales and North ...
, and the
United States
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territorie ...
. Each will contribute to the enormous sequence dataset and to a refined
human genome map, which will be freely accessible through public databases to the scientific community and the general public alike.
By providing an overview of all human genetic variation, the consortium will generate a valuable tool for all fields of biological science, especially in the disciplines of
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
,
medicine
Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pract ...
,
pharmacology
Pharmacology is a branch of medicine, biology and pharmaceutical sciences concerned with drug or medication action, where a drug may be defined as any artificial, natural, or endogenous (from within the body) molecule which exerts a biochemica ...
,
biochemistry
Biochemistry or biological chemistry is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology and ...
, and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.
[G Spencer, International Consortium Announces the 1000 Genomes Project, EMBARGOED (2008) http://www.nih.gov/news/health/jan2008/nhgri-22.htm]
__TOC__
Background
Since the completion of the
Human Genome Project advances in human
population genetics
Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and pop ...
and
comparative genomics have made it possible to gain increasing insight into the nature of genetic diversity.
However, we are just beginning to understand how processes like the random sampling of
gamete
A gamete (; , ultimately ) is a haploid cell that fuses with another haploid cell during fertilization in organisms that reproduce sexually. Gametes are an organism's reproductive cells, also referred to as sex cells. In species that produce t ...
s,
structural variations Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, ...
(insertions/deletions (
indel
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
s),
copy number variations
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of G ...
(CNV),
retroelements
Retrotransposons (also called Class I transposable elements or transposons via RNA intermediates) are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through ...
),
single-nucleotide polymorphism
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...
s (SNPs), and
natural selection
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the heritable traits characteristic of a population over generations. Charle ...
have shaped the level and pattern of variation within
species
In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
and also between species.
[JC Long, Human Genetic Variation: The mechanisms and results of microevolution, American Anthropological Association (2004)]
Human genetic variation
The random sampling of gametes during sexual reproduction leads to
genetic drift
Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance.
Genetic drift may cause gene variants to disappear completely and there ...
— a random fluctuation in the population frequency of a trait — in subsequent generations and would result in the loss of all variation in the absence of external influence. It is postulated that the rate of genetic drift is inversely proportional to population size, and that it may be accelerated in specific situations such as
bottlenecks, where the population size is reduced for a certain period of time, and by the
founder effect (individuals in a population tracing back to a small number of founding individuals).
Anzai et al. demonstrated that indels account for 90.4% of all observed variations in the sequence of the
major histocompatibility locus (MHC) between
humans
Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...
and
chimpanzees
The chimpanzee (''Pan troglodytes''), also known as simply the chimp, is a species of great ape native to the forest and savannah of tropical Africa. It has four confirmed subspecies and a fifth proposed subspecies. When its close relative th ...
. After taking multiple indels into consideration, the high degree of genomic similarity between the two species (98.6%
nucleotide sequence
A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usu ...
identity) drops to only 86.7%. For example, a large deletion of 95 kilobases (kb) between the
loci of the human ''
MICA
Micas ( ) are a group of silicate minerals whose outstanding physical characteristic is that individual mica crystals can easily be split into extremely thin elastic plates. This characteristic is described as perfect basal cleavage. Mica is ...
'' and ''
MICB''
genes
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
, results in a single hybrid chimpanzee ''MIC'' gene, linking this region to a species-specific handling of several
retroviral
A retrovirus is a type of virus that inserts a DNA copy of its RNA genome into the DNA of a host cell that it invades, thus changing the genome of that cell. Once inside the host cell's cytoplasm, the virus uses its own reverse transcriptas ...
infections and the resultant susceptibility to various
autoimmune diseases. The authors conclude that instead of more subtle SNPs, indels were the driving mechanism in primate speciation.
Besides
mutations, SNPs and other
structural variants such as
copy-number variants (CNVs) are contributing to the genetic diversity in human populations. Using
microarrays
A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silicon ...
, almost 1,500 copy number variable regions, covering around 12% of the genome and containing hundreds of genes, disease loci, functional elements and
segmental duplication
Low copy repeats (LCRs), also known as segmental duplications (SDs), are highly homologous sequence elements within the eukaryotic genome.
Repeats
The repeats, or duplications, are typically 10–300 kb in length, and bear greater than 95% sequ ...
s, have been identified in the
HapMap
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
sample collection. Although the specific function of CNVs remains elusive, the fact that CNVs span more nucleotide content per genome than SNPs emphasizes the importance of CNVs in genetic diversity and evolution.
Investigating human genomic variations holds great potential for identifying genes that might underlie differences in disease resistance (e.g.
MHC region) or
drug metabolism
Drug metabolism is the metabolic breakdown of drugs by living organisms, usually through specialized enzymatic systems. More generally, xenobiotic metabolism (from the Greek xenos "stranger" and biotic "related to living beings") is the set o ...
.
Natural selection
Natural selection
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the heritable traits characteristic of a population over generations. Charle ...
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
of a trait can be divided into three classes. Directional or
positive selection
In population genetics, directional selection, is a mode of negative natural selection in which an extreme phenotype is favored over other phenotypes, causing the allele frequency to shift over time in the direction of that phenotype. Under dir ...
refers to a situation where a certain allele has a greater fitness than other
alleles
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chrom ...
, consequently increasing its population frequency (e.g.
antibiotic resistance
Antimicrobial resistance (AMR) occurs when microbes evolve mechanisms that protect them from the effects of antimicrobials. All classes of microbes can evolve resistance. Fungi evolve antifungal resistance. Viruses evolve antiviral resistance. ...
of bacteria). In contrast, stabilizing or
negative selection (also known as purifying selection) lowers the frequency or even removes alleles from a population due to disadvantages associated with it with respect to other alleles. Finally, a number of forms of
balancing selection Balancing selection refers to a number of selective processes by which multiple alleles (different versions of a gene) are actively maintained in the gene pool of a population at frequencies larger than expected from genetic drift alone. Balanci ...
exist; those increase genetic variation within a species by being overdominant (
heterozygous
Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism.
Mo ...
individuals are fitter than
homozygous
Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism.
Mo ...
individuals, e.g. ''
G6PD
Glucose-6-phosphate dehydrogenase (G6PD or G6PDH) () is a cytosolic enzyme that catalyzes the chemical reaction
: D-glucose 6-phosphate + NADP+ + H2O 6-phospho-D-glucono-1,5-lactone + NADPH + H+
This enzyme participates in the pentose phosp ...
'', a gene that is involved in both
Hemolytic anaemia
Hemolytic anemia or haemolytic anaemia is a form of anemia due to hemolysis, the abnormal breakdown of red blood cells (RBCs), either in the blood vessels (intravascular hemolysis) or elsewhere in the human body (extravascular). This most commonly ...
and
malaria
Malaria is a mosquito-borne infectious disease that affects humans and other animals. Malaria causes symptoms that typically include fever, tiredness, vomiting, and headaches. In severe cases, it can cause jaundice, seizures, coma, or death. S ...
resistance) or can vary spatially within a species that inhabits different niches, thus favouring different alleles.
[EE Harris et al., The molecular signature of selection underlying human adaptations, Yearbook of Physical Anthropology 49: 89-130 (2006)] Some genomic differences may not affect fitness. Neutral variation, previously thought to be “junk” DNA, is unaffected by natural selection resulting in higher genetic variation at such sites when compared to sites where variation does influence fitness.
It is not fully clear how natural selection has shaped population differences; however, genetic candidate regions under selection have been identified recently.
Patterns of
DNA polymorphisms can be used to reliably detect signatures of selection and may help to identify genes that might underlie variation in disease resistance or drug metabolism.
Barreiro et al. found evidence that negative selection has reduced population differentiation at the
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
–altering level (particularly in disease-related genes), whereas, positive selection has ensured regional adaptation of human populations by increasing population differentiation in gene regions (mainly
nonsynonymous and
5'-untranslated region variants).
It is thought that most
complex
Complex commonly refers to:
* Complexity, the behaviour of a system whose components interact in multiple ways so possible interactions are difficult to describe
** Complex system, a system composed of many components which may interact with each ...
and
Mendelian diseases
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
(except diseases with late onset, assuming that older individuals no longer contribute to the fitness of their offspring) will have an effect on survival and/or reproduction, thus, genetic factors underlying those diseases should be influenced by natural selection. Although, diseases that have late onset today could have been childhood diseases in the past as genes delaying disease progression could have undergone selection.
Gaucher disease
Gaucher's disease or Gaucher disease () (GD) is a genetic disorder in which glucocerebroside (a sphingolipid, also known as glucosylceramide) accumulates in cells and certain organs. The disorder is characterized by bruising, fatigue, anemia, low ...
(mutations in the ''
GBA
The (GBA) is a 32-bit handheld game console developed, manufactured and marketed by Nintendo as the successor to the Game Boy Color. It was released in Japan on March 21, 2001, in North America on June 11, 2001, in the PAL region on June 22, 2 ...
'' gene),
Crohn's disease (mutation of ''
NOD2
Nucleotide-binding oligomerization domain-containing protein 2 (NOD2), also known as caspase recruitment domain-containing protein 15 (CARD15) or inflammatory bowel disease protein 1 (IBD1), is a protein that in humans is encoded by the ''NOD2'' g ...
'') and
familial hypertrophic cardiomyopathy (mutations in ''
MYH7
MYH7 is a gene encoding a myosin heavy chain beta (MHC-β) isoform (slow twitch) expressed primarily in the heart, but also in skeletal muscles (type I fibers). This isoform is distinct from the fast isoform of cardiac myosin heavy chain, MYH6, re ...
'', ''
TNNT2
Cardiac muscle troponin T (cTnT) is a protein that in humans is encoded by the ''TNNT2'' gene. Cardiac TnT is the tropomyosin-binding subunit of the troponin complex, which is located on the thin filament of striated muscles and regulates muscle c ...
'', ''
TPM1
Tropomyosin alpha-1 chain is a protein that in humans is encoded by the ''TPM1'' gene. This gene is a member of the tropomyosin (Tm) family of highly conserved, widely distributed actin-binding proteins involved in the contractile system of striate ...
'' and ''
MYBPC3'') are all examples of negative selection. These disease mutations are primarily recessive and segregate as expected at a low frequency, supporting the hypothesized negative selection. There is evidence that the genetic-basis of
Type 1 Diabetes
Type 1 diabetes (T1D), formerly known as juvenile diabetes, is an autoimmune disease that originates when cells that make insulin (beta cells) are destroyed by the immune system. Insulin is a hormone required for the cells to use blood sugar ...
may have undergone positive selection.
Few cases have been reported, where disease-causing mutations appear at the high frequencies supported by balanced selection. The most prominent example is mutations of the ''G6PD'' locus where, if homozygous G6PD
enzyme
Enzymes () are proteins that act as biological catalysts by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as products. A ...
deficiency and consequently
Hemolytic anaemia
Hemolytic anemia or haemolytic anaemia is a form of anemia due to hemolysis, the abnormal breakdown of red blood cells (RBCs), either in the blood vessels (intravascular hemolysis) or elsewhere in the human body (extravascular). This most commonly ...
results, but in the heterozygous state are partially protective against
malaria
Malaria is a mosquito-borne infectious disease that affects humans and other animals. Malaria causes symptoms that typically include fever, tiredness, vomiting, and headaches. In severe cases, it can cause jaundice, seizures, coma, or death. S ...
. Other possible explanations for segregation of disease alleles at moderate or high frequencies include genetic drift and recent alterations towards positive selection due to environmental changes such as diet or
genetic hitch-hiking.
Genome-wide comparative analyses of different human populations, as well as between species (e.g. human versus chimpanzee) are helping us to understand the relationship between diseases and selection and provide evidence of mutations in constrained genes being disproportionally associated with
heritable disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
phenotypes
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...
. Genes implicated in complex disorders tend to be under less negative selection than Mendelian disease genes or non-disease genes.
Project description
Goals
There are two kinds of genetic variants related to disease. The first are rare genetic variants that have a severe effect predominantly on simple traits (e.g.
Cystic fibrosis
Cystic fibrosis (CF) is a rare genetic disorder that affects mostly the lungs, but also the pancreas, liver, kidneys, and intestine. Long-term issues include difficulty breathing and coughing up mucus as a result of frequent lung infections. O ...
,
Huntington disease
Huntington's disease (HD), also known as Huntington's chorea, is a neurodegenerative disease that is mostly inherited. The earliest symptoms are often subtle problems with mood or mental abilities. A general lack of coordination and an unst ...
). The second, more common, genetic variants have a mild effect and are thought to be implicated in complex traits (e.g.
Cognition
Cognition refers to "the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, thought, ...
,
Diabetes
Diabetes, also known as diabetes mellitus, is a group of metabolic disorders characterized by a high blood sugar level ( hyperglycemia) over a prolonged period of time. Symptoms often include frequent urination, increased thirst and increased ap ...
,
Heart Disease
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. CVD includes coronary artery diseases (CAD) such as angina and myocardial infarction (commonly known as a heart attack). Other CVDs include stroke, hea ...
). Between these two types of genetic variants lies a significant gap of knowledge, which the 1000 Genomes Project is designed to address.
The primary goal of this project is to create a complete and detailed catalogue of
human genetic variation
Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.
No two humans are genetically identical. Even m ...
s, which in turn can be used for
association studies
Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.
Studies of genetic association aim to test whether single-locus alleles or genotype fr ...
relating genetic variation to disease. By doing so the consortium aims to discover >95 % of the variants (e.g. SNPs, CNVs, indels) with
minor allele frequencies as low as 1% across the genome and 0.1-0.5% in gene regions, as well as to estimate the population frequencies,
haplotype
A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA or ...
backgrounds and
linkage disequilibrium
In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...
patterns of variant alleles.
[Meeting Report: A Workshop to Plan a Deep Catalog of Human Genetic Variation, (2007) http://www.1000genomes.org/sites/1000genomes.org/files/docs/1000Genomes-MeetingReport.pdf]
Secondary goals will include the support of better SNP and probe selection for
genotyping
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. ...
platforms in future studies and the improvement of the
human reference sequence. Furthermore, the completed database will be a useful tool for studying regions under selection, variation in multiple populations and understanding the underlying processes of mutation and
recombination.
Outline
The
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
consists of approximately 3 billion DNA base pairs and is estimated to carry around 20,000
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
coding
genes
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
. In designing the study the consortium needed to address several critical issues regarding the project metrics such as technology challenges, data quality standards and sequence coverage.
Over the course of the next three years, scientists at the
Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
It is located on the Wellcome Ge ...
,
BGI Shenzhen and the
National Human Genome Research Institute
The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland.
NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
’s Large-Scale Sequencing Network are planning to sequence a minimum of 1,000 human genomes. Due to the large amount of sequence data that need to be generated and analyzed it is possible that other participants may be recruited over time.
Almost 10 billion bases will be sequenced per day over a period of the two year production phase. This equates to more than two human genomes every 24 hours; a groundbreaking capacity. Challenging the leading experts of
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
and statistical genetics, the sequence dataset will comprise 6 trillion DNA bases, 60-fold more sequence data than what has been published in
DNA databases over the past 25 years.
To determine the final design of the full project three pilot studies were designed and will be carried out within the first year of the project. The first pilot intends to genotype 180 people of 3
major geographic groups at low coverage (2x). For the second pilot study,
the genomes of two nuclear families (both parents and an adult child) are going to be sequenced with deep coverage (20x per genome). The third pilot study involves sequencing the coding regions (
exons) of 1,000 genes in 1,000 people with deep coverage (20x).
It has been estimated that the project would likely cost more than $500 million if standard DNA sequencing technologies were used. Therefore, several new technologies (e.g.
Solexa
Illumina, Inc. is an American biotechnology company, headquartered in San Diego, California. Incorporated on April 1, 1998, Illumina develops, manufactures, and markets integrated systems for the analysis of genetic variation and biological funct ...
,
454,
SOLiD
Solid is one of the State of matter#Four fundamental states, four fundamental states of matter (the others being liquid, gas, and Plasma (physics), plasma). The molecules in a solid are closely packed together and contain the least amount o ...
) will be applied, lowering the expected costs to between $30 million and $50 million. The major support will be provided by the
Wellcome Trust Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
It is located on the Wellcome G ...
in Hinxton, England; the
Beijing Genomics Institute
BGI Group, formerly Beijing Genomics Institute, is a Chinese genomics company with headquarters in Yantian District, Shenzhen. The company was originally formed in 1999 as a genetics research center to participate in the Human Genome Project. ...
, Shenzhen (BGI Shenzhen), China; and the
NHGRI
The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland.
NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
, part of the National Institutes of Health (NIH).
In keeping wit
Fort Lauderdale principles, all genome sequence data (including variant calls) is freely available as the project progresses and can be downloaded via ftp from th
1000 genomes project webpage
Human genome samples
Based on the overall goals for the project, the samples will be chosen to provide power in populations where
association studies
Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.
Studies of genetic association aim to test whether single-locus alleles or genotype fr ...
for common diseases are being carried out. Furthermore, the samples do not need to have medical or phenotype information since the proposed catalogue will be a basic resource on human variation.
For the pilot studies human genome samples from the
HapMap
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
collection will be sequenced. It will be useful to focus on samples that have additional data available (such as
ENCODE
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
sequence, genome-wide genotypes,
fosmid Fosmids are similar to cosmids but are based on the bacterial F-plasmid. The cloning vector is limited, as a host (usually '' E. coli'') can only contain one fosmid molecule. Fosmids can hold DNA inserts of up to 40 kb in size; often the source of t ...
-end sequence, structural variation assays, and
gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
) to be able to compare the results with those from other projects.
Complying with extensive ethical procedures, the 1000 Genomes Project will then use samples from volunteer donors. The following populations will be included in the study:
Yoruba
The Yoruba people (, , ) are a West African ethnic group that mainly inhabit parts of Nigeria, Benin, and Togo. The areas of these countries primarily inhabited by Yoruba are often collectively referred to as Yorubaland. The Yoruba constitute ...
in
Ibadan
Ibadan (, ; ) is the capital and most populous city of Oyo State, in Nigeria. It is the third-largest city by population in Nigeria after Lagos and Kano, with a total population of 3,649,000 as of 2021, and over 6 million people within its me ...
(YRI),
Nigeria
Nigeria ( ), , ig, Naìjíríyà, yo, Nàìjíríà, pcm, Naijá , ff, Naajeeriya, kcg, Naijeriya officially the Federal Republic of Nigeria, is a country in West Africa. It is situated between the Sahel to the north and the Gulf o ...
;
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
in
Tokyo
Tokyo (; ja, 東京, , ), officially the Tokyo Metropolis ( ja, 東京都, label=none, ), is the capital and largest city of Japan. Formerly known as Edo, its metropolitan area () is the most populous in the world, with an estimated 37.468 ...
(JPT);
Chinese
Chinese can refer to:
* Something related to China
* Chinese people, people of Chinese nationality, citizenship, and/or ethnicity
**''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation
** List of ethnic groups in China, people of ...
in
Beijing
}
Beijing ( ; ; ), alternatively romanized as Peking ( ), is the capital of the People's Republic of China. It is the center of power and development of the country. Beijing is the world's most populous national capital city, with over 21 ...
(CHB);
Utah
Utah ( , ) is a state in the Mountain West subregion of the Western United States. Utah is a landlocked U.S. state bordered to its east by Colorado, to its northeast by Wyoming, to its north by Idaho, to its south by Arizona, and to it ...
residents with ancestry from northern and western
Europe
Europe is a large peninsula conventionally considered a continent in its own right because of its great physical size and the weight of its history and traditions. Europe is also considered a Continent#Subcontinents, subcontinent of Eurasia ...
(CEU);
Luhya Luhya or Abaluyia may refer to:
* Luhya people
* Luhya language
Luhya (; also Luyia, Luhia or Luhiya) is a Bantu language of western Kenya.
Dialects
The various Luhya tribes speak several related languages and dialects, though some of them are ...
in
Webuye
Webuye, previously named Broderick Falls, is an industrial town in Bungoma County, Kenya. Located on the main road to Uganda, the town is home to the Pan African Paper Mills, the largest paper factory in the region, as well as a number of heav ...
,
Kenya
)
, national_anthem = "Ee Mungu Nguvu Yetu"()
, image_map =
, map_caption =
, image_map2 =
, capital = Nairobi
, coordinates =
, largest_city = Nairobi
, ...
(LWK);
Maasai Maasai may refer to:
* Maasai people
*Maasai language
* Maasai mythology
* MAASAI (band)
See also
* Masai (disambiguation)
* Massai
Massai (also known as: Masai, Massey, Massi, Mah–sii, Massa, Wasse, Wassil or by the nickname "Big Foot" Mas ...
in
Kinyawa, Kenya (MKK); Toscani in
Italy
Italy ( it, Italia ), officially the Italian Republic, ) or the Republic of Italy, is a country in Southern Europe. It is located in the middle of the Mediterranean Sea, and its territory largely coincides with the homonymous geographical re ...
(TSI); Peruvians in
Lima
Lima ( ; ), originally founded as Ciudad de Los Reyes (City of The Kings) is the capital and the largest city of Peru. It is located in the valleys of the Chillón River, Chillón, Rímac River, Rímac and Lurín Rivers, in the desert zone of t ...
,
Peru
, image_flag = Flag of Peru.svg
, image_coat = Escudo nacional del Perú.svg
, other_symbol = Great Seal of the State
, other_symbol_type = Seal (emblem), National seal
, national_motto = "Fi ...
(PEL); Gujarati Indians in
Houston
Houston (; ) is the most populous city in Texas, the most populous city in the Southern United States, the fourth-most populous city in the United States, and the sixth-most populous city in North America, with a population of 2,304,580 in ...
(GIH); Chinese in metropolitan
Denver
Denver () is a consolidated city and county, the capital, and most populous city of the U.S. state of Colorado. Its population was 715,522 at the 2020 census, a 19.22% increase since 2010. It is the 19th-most populous city in the Unit ...
(CHD); people of Mexican people, Mexican ancestry in Los Angeles (MXL); and people of :African people, African ancestry in the southwestern
United States
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territorie ...
(ASW).
* Population that was collected in diaspora
Community meeting
Data generated by the 1000 Genomes Project is widely used by the genetics community, making the first 1000 Genomes Project one of the most cited papers in biology.
[C. King (2012) The Hottest Research of 2011. ''Science Watch'' http://archive.sciencewatch.com/newsletter/2012/201203/hottest_research_2012/] To support this user community, the project held a community analysis meeting in July 2012 that included talks highlighting key project discoveries, their impact on population genetics and human disease studies, and summaries of other large-scale sequencing studies.
[1000 Genomes Project Community Analysis Meeting http://1000gconference.sph.umich.edu/]
Project findings
Pilot phase
The pilot phase consisted of three projects:
* low-coverage whole-genome sequencing of 179 individuals from 4 populations
* high-coverage sequencing of 2 trios (mother-father-child)
* exon-targeted sequencing of 697 individuals from 7 populations
It was found that on average, each person carries around 250–300 loss-of-function variants in annotated genes and 50-100 variants previously implicated in inherited disorders. Based on the two trios, it is estimated that the rate of de novo germline mutation is approximately 10
−8 per base per generation.
See also
*
Human Genome Project
* HapMap Project
* Personal genomics
* Population groups in biomedicine
* 1000 Plant Genomes Project
* List of biological databases
References
External links
1000 Genomes- A Deep Catalog of Human Genetic Variation - official web page
International HapMap Project- official web page
Human Genome Project Information
{{DEFAULTSORT:1000 Genomes Project, The
Human genome projects
Population genetics organizations
Single-nucleotide polymorphisms
Genome projects
Genomics
Bioinformatics