In
genomics
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an
observational study
In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample (statistics), sample to a statistical population, population where the dependent and independent variables, independ ...
of a genome-wide set of
genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between
single-nucleotide polymorphism
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...
s (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
When applied to human data, GWA studies compare the DNA of participants having varying
phenotype
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...
s for a particular trait or disease. These participants may be people with a disease (cases) and similar people without the disease (controls), or they may be people with different phenotypes for a particular trait, for example blood pressure. This approach is known as phenotype-first, in which the participants are classified first by their clinical manifestation(s), as opposed to
genotype-first. Each person gives a sample of DNA, from which millions of
genetic variants are read using
SNP array
In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the geno ...
s. If one type of the variant (one
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
) is more frequent in people with the disease, the variant is said to be ''associated'' with the disease. The associated SNPs are then considered to mark a region of the human genome that may influence the risk of disease.
GWA studies investigate the entire genome, in contrast to methods that specifically test a small number of pre-specified genetic regions. Hence, GWAS is a ''non-candidate-driven'' approach, in contrast to ''
gene-specific candidate-driven studies''. GWA studies identify SNPs and other variants in DNA associated with a disease, but they cannot on their own specify which genes are causal.
The first successful GWAS published in 2002 studied myocardial infarction. This study design was then implemented in the landmark GWA 2005 study investigating patients with
age-related macular degeneration
Macular degeneration, also known as age-related macular degeneration (AMD or ARMD), is a medical condition which may result in blurred or no vision in the center of the visual field. Early on there are often no symptoms. Over time, however, som ...
, and found two SNPs with significantly altered
allele frequency
Allele frequency, or gene frequency, is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage. Specifically, it is the fraction of all chromosomes in the population that ...
compared to healthy controls.
, over 3,000 human GWA studies have examined over 1,800 diseases and traits, and thousands of SNP associations have been found. Except in the case of rare
genetic disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s, these associations are very weak, but while they may not explain much of the risk, they provide insight into genes and pathways that can be important.
Background
Any two
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
s differ in millions of different ways. There are small variations in the individual nucleotides of the genomes (
SNPs
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
) as well as many larger variations, such as
deletions,
insertions and
copy number variation
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of d ...
s. Any of these may cause alterations in an individual's traits, or
phenotype
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...
, which can be anything from disease risk to physical properties such as height.
Around the year 2000, prior to the introduction of GWA studies, the primary method of investigation was through inheritance studies of
genetic linkage
Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separ ...
in families. This approach had proven highly useful towards
single gene disorders
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
.
However, for common and complex diseases the results of genetic linkage studies proved hard to reproduce.
A suggested alternative to linkage studies was the
genetic association
Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.
Studies of genetic association aim to test whether single-locus alleles or genotype fre ...
study. This study type asks if the
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
of a
genetic variant is found more often than expected in individuals with the phenotype of interest (e.g. with the disease being studied). Early calculations on statistical power indicated that this approach could be better than linkage studies at detecting weak genetic effects.
In addition to the conceptual framework several additional factors enabled the GWA studies. One was the advent of
biobanks
A biobank is a type of biorepository that stores biological samples (usually human) for use in research. Biobanks have become an important resource in medical research, supporting many types of contemporary research like genomics and personalized ...
, which are repositories of human genetic material that greatly reduced the cost and difficulty of collecting sufficient numbers of biological specimens for study.
Another was the
International HapMap Project
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
, which, from 2003 identified a majority of the common SNPs interrogated in a GWA study.
The
haploblock structure identified by HapMap project also allowed the focus on the subset of SNPs that would describe most of the variation. Also the development of the methods to genotype all these SNPs using
genotyping arrays was an important prerequisite.
Methods
The most common approach of GWA studies is the
case-control setup, which compares two large groups of individuals, one healthy control group and one case group affected by a disease. All individuals in each group are genotyped for the majority of common known SNPs. The exact number of SNPs depends on the genotyping technology, but are typically one million or more.
For each of these SNPs it is then investigated if the
allele frequency
Allele frequency, or gene frequency, is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage. Specifically, it is the fraction of all chromosomes in the population that ...
is significantly altered between the case and the control group.
In such setups, the fundamental unit for reporting effect sizes is the
odds ratio
An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
. The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.
Example: suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).
When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a
P-value
In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
for the significance of the odds ratio is typically calculated using a simple
chi-squared test
A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.
Because so many variants are tested, it is standard practice to require the p-value to be lower than to consider a variant significant.
Variations on the case-control approach. A common alternative to case-control GWA studies is the analysis of quantitative phenotypic data, e.g. height or
biomarker
In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...
concentrations or even
gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
. Likewise, alternative statistics designed for
dominance or
recessive
In genetics, dominance is the phenomenon of one variant (allele) of a gene on a chromosome masking or overriding the effect of a different variant of the same gene on the other copy of the chromosome. The first variant is termed dominant and t ...
penetrance patterns can be used.
Calculations are typically done using
bioinformatics software The list of bioinformatics software tools can be split up according to the license used:
*List of proprietary bioinformatics software
*List of open-source bioinformatics software
Alternatively, here is a categorization according to the respective b ...
such as SNPTEST and PLINK, which also include support for many of these alternative statistics.
GWAS focuses on the effect of individual SNPs. However, it is also possible that complex interactions among two or more SNPs,
epistasis
Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is dep ...
, might contribute to complex diseases. Due to the potentially exponential number of interactions, detecting statistically significant interactions in GWAS data is both computationally and statistically challenging. This task has been tackled in existing publications that use algorithms inspired from data mining. Moreover, the researchers try to integrate GWA data with other biological data such as
protein-protein interaction network to extract more informative results.
A key step in the majority of GWA studies is the
imputation of genotypes at SNPs not on the genotype chip used in the study. This process greatly increases the number of SNPs that can be tested for association, increases the power of the study, and facilitates meta-analysis of GWAS across distinct cohorts. Genotype imputation is carried out by statistical methods that combine the GWAS data together with a reference panel of haplotypes. These methods take advantage of sharing of haplotypes between individuals over short stretches of sequence to impute alleles. Existing software packages for genotype imputation include IMPUTE2, Minimac, Beagle and MaCH.
In addition to the calculation of association, it is common to take into account any variables that could potentially
confound
In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...
the results. Sex and age are common examples of confounding variables. Moreover, it is also known that many genetic variations are associated with the geographical and historical populations in which the mutations first arose.
Because of this association, studies must take account of the geographic and ethnic background of participants by controlling for what is called
population stratification
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies are ...
. If they fail to do so, these studies can produce false positive results.
After odds ratios and
P-value
In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
s have been calculated for all SNPs, a common approach is to create a
Manhattan plot. In the context of GWA studies, this plot shows the negative logarithm of the
P-value
In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
as a function of genomic location. Thus the SNPs with the most significant association stand out on the plot, usually as stacks of points because of haploblock structure. Importantly, the P-value threshold for significance is corrected for
multiple testing issues. The exact threshold varies by study,
but the conventional
genome-wide significance
In genome-wide association studies, genome-wide significance (abbreviated GWS) is a specific threshold for determining the statistical significance of a reported association between a given single-nucleotide polymorphism (SNP) and a given trait. ...
threshold is to be significant in the face of hundreds of thousands to millions of tested SNPs.
GWA studies typically perform the first analysis in a discovery cohort, followed by validation of the most significant SNPs in an independent validation cohort.
Results
Attempts have been made at creating comprehensive catalogues of SNPs that have been identified from GWA studies.
As of 2009, SNPs associated with diseases are numbered in the thousands.
The first GWA study, conducted in 2005, compared 96 patients with
age-related macular degeneration
Macular degeneration, also known as age-related macular degeneration (AMD or ARMD), is a medical condition which may result in blurred or no vision in the center of the visual field. Early on there are often no symptoms. Over time, however, som ...
(ARMD) with 50 healthy controls.
It identified two SNPs with significantly altered allele frequency between the two groups. These SNPs were located in the gene encoding
complement factor H
Factor H is a member of the regulators of complement activation family and is a complement control protein. It is a large (155 kilodaltons), soluble glycoprotein that circulates in human plasma (at typical concentrations of 200–300 microgra ...
, which was an unexpected finding in the research of ARMD. The findings from these first GWA studies have subsequently prompted further functional research towards therapeutical manipulation of the complement system in ARMD.
Another landmark publication in the history of GWA studies was the
Wellcome Trust Case Control Consortium
The Wellcome Trust Case Control Consortium (abbreviated WTCCC) is a collaboration between fifty research groups in the United Kingdom in the field of human genetics. Established in 2005, the WTCCC aims to conduct genome-wide association studies (G ...
(WTCCC) study, the largest GWA study ever conducted at the time of its publication in 2007. The WTCCC included 14,000 cases of seven common diseases (~2,000 individuals for each of
coronary heart disease
Coronary artery disease (CAD), also called coronary heart disease (CHD), ischemic heart disease (IHD), myocardial ischemia, or simply heart disease, involves the reduction of blood flow to the heart muscle due to build-up of atherosclerotic pla ...
,
type 1 diabetes
Type 1 diabetes (T1D), formerly known as juvenile diabetes, is an autoimmune disease that originates when cells that make insulin (beta cells) are destroyed by the immune system. Insulin is a hormone required for the cells to use blood sugar for ...
,
type 2 diabetes
Type 2 diabetes, formerly known as adult-onset diabetes, is a form of diabetes mellitus that is characterized by high blood sugar, insulin resistance, and relative lack of insulin. Common symptoms include increased thirst, frequent urination, ...
,
rheumatoid arthritis
Rheumatoid arthritis (RA) is a long-term autoimmune disorder that primarily affects joints. It typically results in warm, swollen, and painful joints. Pain and stiffness often worsen following rest. Most commonly, the wrist and hands are involv ...
,
Crohn's disease
Crohn's disease is a type of inflammatory bowel disease (IBD) that may affect any segment of the gastrointestinal tract. Symptoms often include abdominal pain, diarrhea (which may be bloody if inflammation is severe), fever, abdominal distension ...
,
bipolar disorder
Bipolar disorder, previously known as manic depression, is a mental disorder characterized by periods of depression and periods of abnormally elevated mood that last from days to weeks each. If the elevated mood is severe or associated with ...
, and
hypertension
Hypertension (HTN or HT), also known as high blood pressure (HBP), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated. High blood pressure usually does not cause symptoms. Long-term high bl ...
) and 3,000 shared controls.
This study was successful in uncovering many new disease genes underlying these diseases.
Since these first landmark GWA studies, there have been two general trends.
One has been towards larger and larger sample sizes. In 2018, several genome-wide association studies are reaching a total sample size of over 1 million participants, including 1.1 million in a genome-wide study of
educational attainment Educational attainment is a term commonly used by statisticians to refer to the highest degree of education an individual has completed as defined by the US Census Bureau Glossary.
See also
*Academic achievement
*Academic degree
*Bachelor's degree
...
and a study of
insomnia
Insomnia, also known as sleeplessness, is a sleep disorder in which people have trouble sleeping. They may have difficulty falling asleep, or staying asleep as long as desired. Insomnia is typically followed by daytime sleepiness, low energy, ...
containing 1.3 million individuals. The reason is the drive towards reliably detecting risk-SNPs that have smaller
odds ratios
An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
and lower allele frequency. Another trend has been towards the use of more narrowly defined phenotypes, such as
blood lipids
Blood lipids (or blood fats) are lipids in the blood, either free or bound to other molecules. They are mostly transported in a protein capsule, and the density of the lipids and type of protein determines the fate of the particle and its influence ...
,
proinsulin
Proinsulin is the prohormone precursor to insulin made in the beta cells of the islets of Langerhans, specialized regions of the pancreas. In humans, proinsulin is encoded by the ''INS'' gene. The islets of Langerhans only secrete between 1% and 3 ...
or similar biomarkers.
These are called ''intermediate phenotypes'', and their analyses may be of value to functional research into biomarkers.
A variation of GWAS uses participants that are first-degree ''relatives'' of people with a disease. This type of study has been named genome-wide association study by proxy (''GWAX'').
A central point of debate on GWA studies has been that most of the SNP variations found by GWA studies are associated with only a small increased risk of the disease, and have only a small predictive value. The median odds ratio is 1.33 per risk-SNP, with only a few showing odds ratios above 3.0.
These magnitudes are considered small because they do not explain much of the heritable variation. This
heritable
Heredity, also called inheritance or biological inheritance, is the passing on of traits from parents to their offspring; either through asexual reproduction or sexual reproduction, the offspring cells or organisms acquire the genetic informa ...
variation is estimated from heritability studies based on
monozygotic
Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of TwinLast Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two em ...
twins.
For example, it is known that 80-90% of variance in height can be explained by hereditary differences, but GWA studies only account for a minority of this variance.
Clinical applications and examples
A challenge for future successful GWA study is to apply the findings in a way that accelerates
drug
A drug is any chemical substance that causes a change in an organism's physiology or psychology when consumed. Drugs are typically distinguished from food and substances that provide nutritional support. Consumption of drugs can be via insuffla ...
and diagnostics development, including better integration of genetic studies into the drug-development process and a focus on the role of genetic variation in maintaining health as a blueprint for designing new
drugs
A drug is any chemical substance that causes a change in an organism's physiology or psychology when consumed. Drugs are typically distinguished from food and substances that provide nutritional support. Consumption of drugs can be via inhalat ...
and
diagnostics
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine "cause and effect". In systems engineer ...
.
Several studies have looked into the use of risk-SNP markers as a means of directly improving the accuracy of
prognosis
Prognosis (Greek: πρόγνωσις "fore-knowing, foreseeing") is a medical term for predicting the likely or expected development of a disease, including whether the signs and symptoms will improve or worsen (and how quickly) or remain stabl ...
. Some have found that the accuracy of prognosis improves,
while others report only minor benefits from this use.
Generally, a problem with this direct approach is the small magnitudes of the effects observed. A small effect ultimately translates into a poor separation of cases and controls and thus only a small improvement of prognosis accuracy. An alternative application is therefore the potential for GWA studies to elucidate
pathophysiology
Pathophysiology ( physiopathology) – a convergence of pathology with physiology – is the study of the disordered physiological processes that cause, result from, or are otherwise associated with a disease or injury. Pathology is the ...
.
Hepatitis C treatment
One such success is related to identifying the genetic variant associated with response to anti-
hepatitis C
Hepatitis C is an infectious disease caused by the hepatitis C virus (HCV) that primarily affects the liver; it is a type of viral hepatitis. During the initial infection people often have mild or no symptoms. Occasionally a fever, dark urine, a ...
virus treatment. For genotype 1 hepatitis C treated with
Pegylated interferon-alpha-2a
Pegylated interferon alfa-2a, sold under the brand name Pegasys among others, is medication used to treat hepatitis C and hepatitis B. For hepatitis C it is typically used together with ribavirin and cure rates are between 24 and 92%. For hepati ...
or
Pegylated interferon-alpha-2b
Pegylated interferon alfa-2b is a drug used to treat melanoma, as an adjuvant therapy to surgery. Also used to treat hepatitis C (typically, in combination with ribavarin), it is no longer recommended due to poor efficacy and adverse side-effects ...
combined with
ribavirin
Ribavirin, also known as tribavirin, is an antiviral medication used to treat RSV infection, hepatitis C and some viral hemorrhagic fevers. For hepatitis C, it is used in combination with other medications such as simeprevir, sofosbuvir, pegin ...
, a GWA study
has shown that SNPs near the human
IL28B
Interferon lambda 3 (gene symbol: ''IFNL3)'' encodes the IFNL3 protein. ''IFNL3'' was formerly named ''IL28B'', but the Human Genome Organization Gene Nomenclature Committee renamed this gene in 2013 while assigning a name to the then newly disc ...
gene, encoding interferon lambda 3, are associated with significant differences in response to the treatment. A later report demonstrated that the same genetic variants are also associated with the natural clearance of the genotype 1 hepatitis C virus.
These major findings facilitated the development of personalized medicine and allowed physicians to customize medical decisions based on the patient's genotype.
eQTL, LDL and cardiovascular disease
The goal of elucidating pathophysiology has also led to increased interest in the association between risk-SNPs and the
gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
of nearby genes, the so-called
expression quantitative trait loci
Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs.
Distant and local, trans- and cis-eQTLs, respectively
An expression quantitative trait is an amount of an mRNA transcript or a pr ...
(eQTL) studies.
The reason is that GWAS studies identify risk-SNPs, but not risk-genes, and specification of genes is one step closer towards actionable
drug targets. As a result, major GWA studies by 2011 typically included extensive eQTL analysis.
One of the strongest eQTL effects observed for a GWA-identified risk SNP is the SORT1 locus.
Functional follow up studies of this locus using
small interfering RNA
Small interfering RNA (siRNA), sometimes known as short interfering RNA or silencing RNA, is a class of double-stranded RNA at first non-coding RNA molecules, typically 20-24 (normally 21) base pairs in length, similar to miRNA, and operating wi ...
and
gene knock-out mice have shed light on the metabolism of
low-density lipoprotein
Low-density lipoprotein (LDL) is one of the five major groups of lipoprotein that transport all fat molecules around the body in extracellular water. These groups, from least dense to most dense, are chylomicrons (aka ULDL by the overall densit ...
s, which have important clinical implications for
cardiovascular disease
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. CVD includes coronary artery diseases (CAD) such as angina and myocardial infarction (commonly known as a heart attack). Other CVDs include stroke, h ...
.
Atrial fibrillation
For example, a
meta-analysis
A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting me ...
accomplished in 2018 revealed the discovery of 70 new loci associated with
atrial fibrillation. It has been identified different variants associated with
transcription factor
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...
coding-genes, such as
TBX3
T-box transcription factor TBX3 is a protein that in humans is encoded by the ''TBX3'' gene.
T-box 3 (TBX3) is a member of the T-box gene family of transcription factors which all share a highly conserved DNA binding domain known as the T-box. The ...
and
TBX5
T-box transcription factor TBX5, (T-box protein 5) is a protein that in humans is encoded by the ''TBX5'' gene.
This gene is a member of a phylogenetically conserved family of genes that share a common DNA-binding domain, the T-box. T-box genes ...
,
NKX2-5
Homeobox protein Nkx-2.5 is a protein that in humans is encoded by the ''NKX2-5'' gene.
Function
Homeobox-containing genes play critical roles in regulating tissue-specific gene expression essential for tissue differentiation, as well as deter ...
o
PITX2
Paired-like homeodomain transcription factor 2 also known as pituitary homeobox 2 is a protein that in humans is encoded by the ''PITX2'' gene.
Function
This gene encodes a member of the RIEG/PITX homeobox family, which is in the bicoid clas ...
, which are involved in cardiac conduction regulation, in
ionic channel
Ion channels are pore-forming membrane proteins that allow ions to pass through the channel pore. Their functions include establishing a resting membrane potential, shaping action potentials and other electrical signals by gating the flow of io ...
modulation and cardiac development. It was also identified new genes involved in
tachycardia
Tachycardia, also called tachyarrhythmia, is a heart rate that exceeds the normal resting rate. In general, a resting heart rate over 100 beats per minute is accepted as tachycardia in adults. Heart rates above the resting rate may be normal (su ...
(
CASQ2
Calsequestrin is a calcium-binding protein that acts as a calcium buffer within the sarcoplasmic reticulum. The protein helps hold calcium in the cisterna of the sarcoplasmic reticulum after a muscle contraction, even though the concentration ...
) or associated with alteration of
cardiac muscle cell
Cardiac muscle (also called heart muscle, myocardium, cardiomyocytes and cardiac myocytes) is one of three types of vertebrate muscle tissues, with the other two being skeletal muscle and smooth muscle. It is an involuntary, striated muscle th ...
communication (
PKP2
Plakophilin-2 is a protein that in humans is encoded by the ''PKP2'' gene. Plakophilin 2 is expressed in skin and cardiac muscle, where it functions to link cadherins to intermediate filaments in the cytoskeleton. In cardiac muscle, plakophilin-2 ...
).
Schizophrenia
While there is some research using a High-Precision Protein Interaction Prediction (HiPPIP) computational model that discovered 504 new
protein-protein interactions (PPIs) associated with genes linked to
schizophrenia
Schizophrenia is a mental disorder characterized by continuous or relapsing episodes of psychosis. Major symptoms include hallucinations (typically hearing voices), delusions, and disorganized thinking. Other symptoms include social withdra ...
, the evidence supporting the genetic basis of schizophrenia is actually controversial and may suffer from some of the limitation of this method of study.
Agricultural applications
Plant growth stages and yield components
GWA studies act as an important tool in plant breeding. With large genotyping and phenotyping data, GWAS are powerful in analyzing complex inheritance modes of traits that are important yield components such as number of grains per spike, weight of each grain and plant structure. In a study on GWAS in spring wheat, GWAS have revealed a strong correlation of grain production with booting data, biomass and number of grains per spike. GWA study is also a success in study genetic architecture of complex traits in rice.
Plant pathogens
The emergences of plant pathogens have posed serious threats to plant health and biodiversity. Under this consideration, identification of wild types that have the natural resistance to certain pathogens could be of vital importance. Furthermore, we need to predict which alleles are associated with the resistance. GWA studies is a powerful tool to detect the relationships of certain variants and the resistance to the plant pathogen, which is beneficial for developing new pathogen-resisted cultivars.
Chicken
The first GWA study in chickens was done by Abasht and Lamont in 2007. This GWA was used to study the fatness trait in F2 population found previously. Significantly related SNPs were found are on 10 chromosomes (1, 2, 3, 4, 7, 8, 10, 12, 15 and 27).
Limitations
GWA studies have several issues and limitations that can be taken care of through proper quality control and study setup. Lack of well defined case and control groups, insufficient sample size, control for
multiple testing and control for
population stratification
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies are ...
are common problems.
Particularly the statistical issue of multiple testing wherein it has been noted that "the GWA approach can be problematic because the massive number of statistical tests performed presents an unprecedented potential for
false-positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
results".
Ignoring these correctible issues has been cited as contributing to a general sense of problems with the GWA methodology. In addition to easily correctible problems such as these, some more subtle but important issues have surfaced. A high-profile GWA study that investigated individuals with very long life spans to identify SNPs associated with longevity is an example of this.
The publication came under scrutiny because of a discrepancy between the type of
genotyping array in the case and control group, which caused several SNPs to be falsely highlighted as associated with longevity.
The study was subsequently
retracted,
but a modified manuscript was later published.
This points to the general vulnerability of GWA studies based on
genotyping arrays, which is the high reliance on array design. Another consequence is that such studies are unable to detect the contribution of very rare mutations not included in the array.
In addition to these preventable issues, GWA studies have attracted more fundamental criticism, mainly because of their assumption that common genetic variation plays a large role in explaining the heritable variation of common disease.
Indeed, it has been estimated that for most conditions the SNP heritability attributable to common SNPs is <0.05.
This aspect of GWA studies has attracted the criticism that, although it could not have been known prospectively, GWA studies were ultimately not worth the expenditure.
GWA studies also face criticism that the broad variation of individual responses or compensatory mechanisms to a disease state cancel out and mask potential genes or causal variants associated with the disease. Additionally, GWA studies identify candidate risk variants for the population from which their analysis is performed, and with most GWA studies stemming from European databases, there is a lack of translation of the identified risk variants to other non-European populations. Alternative strategies suggested involve
linkage analysis
Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separ ...
.
More recently, the rapidly decreasing price of complete genome
sequencing
In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succ ...
have also provided a realistic alternative to
genotyping array-based GWA studies. It can be discussed if the use of this new technique is still referred to as a GWA study, but high-throughput sequencing does have potential to side-step some of the shortcomings of non-sequencing GWA.
Fine-mapping
Genotyping arrays designed for GWAS rely on
linkage disequilibrium
In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...
to provide coverage of the entire genome by genotyping a subset of variants. Because of this, the reported associated variants are unlikely to be the actual causal variants. Associated regions can contain hundreds of variants spanning large regions and encompassing many different genes, making the biological interpretation of GWAS loci more difficult. Fine-mapping is a process to refine these lists of associated variants to a credible set most likely to include the causal variant.
Fine-mapping requires all variants in the associated region to have been genotyped or imputed (dense coverage), very stringent quality control resulting in high-quality genotypes, and large sample sizes sufficient in separating out highly correlated signals. There are several different methods to perform fine-mapping, and all methods produce a posterior probability that a variant in that locus is causal. Because the requirements are often difficult to satisfy, there are still limited examples of these methods being more generally applied.
See also
*
Association mapping In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes (observable characteristics) to genoty ...
*
Transcriptome-wide association study
Transcriptome-wide association study (TWAS) is a statistical genetics methodology to improve detection power and provide functional annotation for genetic associations with phenotypes by integrating single-nucleotide polymorphism to trait (SNP-tr ...
*
Epidemiology
Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population.
It is a cornerstone of public health, and shapes policy decisions and evidenc ...
*
Gene–environment interaction
Gene–environment interaction (or genotype–environment interaction or G×E) is when two different genotypes respond to environmental variation in different ways. A norm of reaction is a graph that shows the relationship between genes and envi ...
*
Genomics
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
*
Linkage disequilibrium
In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...
*
Molecular epidemiology Molecular epidemiology is a branch of epidemiology and medical science that focuses on the contribution of potential genetic and environmental risk factors, identified at the molecular level, to the etiology, distribution and prevention of disease ...
*
Polygenic score
In genetics, a polygenic score (PGS), also called a polygenic risk score (PRS), polygenic index (PGI), genetic risk score, or genome-wide score, is a number that summarizes the estimated effect of many genetic variants on an individual's phenotyp ...
*
Genetic epidemiology
Genetic epidemiology is the study of the role of genetic factors in determining health and disease in families and in populations, and the interplay of such genetic factors with environmental factors. Genetic epidemiology seeks to derive a statist ...
*
Common disease-common variant hypothesis
*
Microbiome-wide association study
References
External links
Genotype-phenotype interaction software tools and databases on omicXStatistical Methods for the Analysis of Genome-Wide Association Studies ideo lecture seriesWhole genome association studies— by the
National Human Genome Research Institute
The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland.
NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
GWAS Central— a central database of summary-level genetic association findings
*
— by Bennett SN, Caporaso, NE, ''et al.''
PLINK— whole genome association analysis toolset
ENCODE threads explorer Impact of functional information on understanding variation.
Nature (journal)
''Nature'' is a British weekly scientific journal founded and based in London, England. As a multidisciplinary publication, ''Nature'' features peer-reviewed research from a variety of academic disciplines, mainly in science and technology. ...
{{DEFAULTSORT:Genome-Wide Association Study
Genetic epidemiology
Genetics studies
Human genome projects
Personalized medicine