HOME

TheInfoList



OR:

Genome-wide complex trait analysis (GCTA) Genome-based
restricted maximum likelihood In statistics, the restricted (or residual, or reduced) maximum likelihood (REML) approach is a particular form of maximum likelihood estimation that does not base estimates on a maximum likelihood fit of all the information, but instead uses a like ...
(GREML) is a statistical method for
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's
heritability Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of her ...
of a particular subset of genetic variants (typically limited to
SNPs In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
with
MAF MAF may refer to: Military * Myanmar Air Force * Malaysian Armed Forces * Marine Amphibious Force, a former name for Marine Expeditionary Force, a type of U.S. Marine Corps task force Organizations * Majid Al Futtaim Group * Move America Forw ...
>1%, hence terms such as "chip heritability"/"SNP heritability"). This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development."Genetic contributions to stability and change in intelligence from childhood to old age"
Deary et al 2012
It can also be extended to analyse bivariate genetic correlations between traits.Lee et al 2012
"Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood"
/ref> There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to
population stratification Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies are ...
. GCTA heritability estimates are useful because they provide lower bounds for the genetic contributions to traits such as
intelligence Intelligence has been defined in many ways: the capacity for abstraction, logic, understanding, self-awareness, learning, emotional knowledge, reasoning, planning, creativity, critical thinking, and problem-solving. More generally, it can ...
without relying on the assumptions used in twin studies and other family and
pedigree Pedigree may refer to: Breeding * Pedigree chart, a document to record ancestry, used by genealogists in study of human family lines, and in selective breeding of other animals ** Pedigree, a human genealogy (ancestry chart) ** Pedigree (anim ...
studies, thereby corroborating them and enabling the design of well- powered
genome-wide association study In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any vari ...
(GWAS) designs to find the specific genetic variants involved. For example, a GCTA estimate of 30% SNP heritability is consistent with a larger total genetic heritability of 70%. However, if the GCTA estimate was ~0%, then that would imply one of three things: a) there is no genetic contribution, b) the genetic contribution is entirely in the form of genetic variants not included, or c) the genetic contribution is entirely in the form of non-additive effects such as
epistasis Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is dep ...
/ dominance. Running GCTA on individual chromosomes and regressing the estimated proportion of trait variance explained by each chromosome against that chromosome's length can reveal whether the responsible genetic variants cluster or are distributed evenly across the genome or are
sex-linked Sex linked describes the sex-specific patterns of inheritance and presentation when a gene mutation (allele) is present on a sex chromosome (allosome) rather than a non-sex chromosome (autosome). In humans, these are termed X-linked recessi ...
. Chromosomes can of course be replaced by more fine-grained or functionally informed subdivisions. Examining genetic correlations can reveal to what extent observed correlations, such as between intelligence and socioeconomic status, are due to the same genetic traits, and in the case of diseases, can indicate shared causal pathways such as can be inferred from the genetic variation jointly associated with schizophrenia and other mental diseases or reduced intelligence.


History

Estimation in biology/animal breeding using standard
ANOVA Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
/ REML methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding (which threatens the
external validity External validity is the validity of applying the conclusions of a scientific study outside the context of that study. In other words, it is the extent to which the results of a study can be generalized to and across other situations, people, stim ...
of all estimates), and several authors have noted that relatedness could be measured directly from genetic markers (and if individuals were reasonably related, economically few markers would have to be obtained for statistical power), leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements (Ritland 1996
"A Marker-based Method for Inferences About Quantitative Inheritance in Natural Populations"
. As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 2006/2007, which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs."Common SNPs explain a large proportion of heritability for human height"
Yang et al 2010
(Large GWASes on height have since confirmed the estimate."Defining the role of common variation in the genomic and biological architecture of adult human height"
Wood et al 2014
) The GCTA algorithm was then described and a software implementation published in 2011. It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches.


Benefits


Robust heritability

Twin and family studies have long been used to estimate variance explained by particular categories of genetic and environmental causes. Across a wide variety of human traits studied, there is typically minimal shared-environment influence, considerable non-shared environment influence, and a large genetic component (mostly additive), which is on average ~50% and sometimes much higher for some traits such as height or intelligence. However, the twin and family studies have been criticized for their reliance on a number of assumptions that are difficult or impossible to verify, such as the equal environments assumption (that the environments of
monozygotic Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of TwinLast Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two em ...
and
dizygotic Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of TwinLast Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two em ...
twins are equally similar), that there is no misclassification of zygosity (mistaking identical for fraternal & vice versa), that twins are unrepresentative of the general population, and that there is no
assortative mating Assortative mating (also referred to as positive assortative mating or homogamy) is a mating pattern and a form of sexual selection in which individuals with similar phenotypes or genotypes mate with one another more frequently than would be ...
. Violations of these assumptions can result in both upwards and downwards bias of the parameter estimates. (This debate & criticism have particularly focused on the
heritability of IQ Research on the heritability of IQ inquires into the degree of variation in IQ within a population that is due to genetic variation between individuals in that population. There has been significant controversy in the academic community about the ...
.) The use of SNP or whole-genome data from unrelated subject participants (with participants too related, typically >0.025 or ~fourth cousins levels of similarity, being removed, and several
principal components Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
included in the regression to avoid & control for
population stratification Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies are ...
) bypasses many heritability criticisms: twins are often entirely uninvolved, there are no questions of equal treatment, relatedness is estimated precisely, and the samples are drawn from a broad variety of subjects. In addition to being more robust to violations of the twin study assumptions, SNP data can be easier to collect since it does not require rare twins and thus also heritability for rare traits can be estimated (with due correction for
ascertainment bias In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human f ...
).


GWAS power

GCTA estimates can be used to resolve the
missing heritability problem The "missing heritability" problem is the fact that single genetic variations cannot account for much of the heritability of diseases, behaviors, and other phenotypes. This is a problem that has significant implications for medicine, since a person' ...
and
design A design is a plan or specification for the construction of an object or system or for the implementation of an activity or process or the result of that plan or specification in the form of a prototype, product, or process. The verb ''to design' ...
GWASes which will yield genome-wide statistically-significant hits. This is done by comparing the GCTA estimate with the results of smaller GWASes. If a GWAS of n=10k using SNP data fails to turn up any hits, but the GCTA indicates a high heritability accounted for by SNPs, then that implies that a large number of variants are involved ( polygenicity) and thus that much larger GWASes will be required to accurately estimate each SNP's effect and directly account for a fraction of the GCTA heritability.


Disadvantages

# Limited inference: GCTA estimates are inherently limited in that they cannot estimate broadsense heritability like twin/family studies as they only estimate the heritability due to SNPs. Hence, while they serve as a critical check on the unbiasedness of the twin/family studies, GCTAs cannot replace them for estimating total genetic contributions to a trait. # Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity (although this is no longer such an issue for current commercial chips which default to hundreds of thousands or millions of markers); and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least ''n''>1000 and ideally ''n''>10000. In contrast, twin studies can offer precise estimates with a fraction of the sample size. # Computational inefficiency: The original GCTA implementation scales poorly with increasing data size (\mathcal(\text \cdot n^2)), so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis, so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically (at the cost of additional complexity and some loss of precision). This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using
moment matching Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
. # Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, a
meta-analysis A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...
must be performed.
In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data e.g. " LD score regression" contrasts
linkage disequilibrium In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...
statistics (available from public datasets like 1000 Genomes) with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The
Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The institu ...
run
LD Hub
which provides a public web interface to >=177 traits with LD score regression. Another method using summary data is HESS. # Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics. # Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects as they represent mutation load being purged by negative selection. As with measurement error, this will bias GCTA estimates towards underestimating heritability.


Interpretation

GCTA estimates are often misinterpreted as "the total genetic contribution", and since they are often much less than the twin study estimates, the twin studies are presumed to be biased and the genetic contribution to a particular trait is minor. This is incorrect, as GCTA estimates are lower bounds. A more correct interpretation would be that: GCTA estimates are the expected amount of variance that could be predicted by an indefinitely large GWAS using a simple additive linear model (without any interactions or higher-order effects) in a particular population at a particular time given the limited selection of SNPs and a trait measured with a particular amount of precision. Hence, there are many ways to exceed GCTA estimates: # SNP genotyping data is typically limited to 200k-1m of the most common or scientifically interesting SNPs, though 150 million+ have been documented by genome sequencing; as SNP prices drop and arrays become more comprehensive or whole-genome sequencing replaces SNP genotyping entirely, the expected narrowsense heritability will increase as more genetic variants are included in the analysis. The selection can also be expanded considerably using haplotypes and imputation (SNPs can proxy for unobserved genetic variants which they tend to be inherited with); e.g. Yang et al. 2015"Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index"
Yang et al 2015
finds that with more aggressive use of imputation to infer unobserved variants, the height GCTA estimate expands to 56% from 45%, and Hill et al. 2017 finds that expanding GCTA to cover rarer variants raises the intelligence estimates from ~30% to ~53% and explains all the heritability in their sample;Hill et al 2017
"Genomic analysis of family data reveals additional genetic effects on intelligence and personality"
/ref> for 4 traits in the UK Biobank, imputing raised the SNP heritability estimates.Evans et al 2017
"Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits"
/ref> Additional genetic variants include ''de novo'' mutations/ mutation load &
structural variation Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, ...
s such as copy-number variations. # narrowsense heritability estimates assume simple additivity of effects, ignoring interactions. As some trait values will be due to these more complicated effects, the total genetic effect will exceed that of the subset measured by GCTA, and as the additive SNPs are found and measured, it will become possible to find interactions as well using more sophisticated statistical models. # all correlation & heritability estimates are biased downwards to zero by the presence of
measurement error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a "mistake ...
; the need for adjusting this leads to techniques such as Spearman's correction for measurement error, as the underestimate can be quite severe for traits where large-scale and accurate measurement is difficult and expensive, such as intelligence. For example, an intelligence GCTA estimate of 0.31, based on an intelligence measurement with
test-retest reliability Repeatability or test–retest reliability is the closeness of the agreement between the results of successive measurements of the same measure, when carried out under the same conditions of measurement. In other words, the measurements are taken ...
r=0.65, would after correction (\frac), be a true estimate of ~0.48, indicating that common SNPs alone explain half of variance. Hence, a GWAS with a better measurement of intelligence can expect to find more intelligence hits than indicated by a GCTA based on a noisier measurement.


Implementations

The original "GCTA" software package is the most widely used; its primary functionality covers the GREML estimation of SNP heritability, but includes other functionality: Other implementations and variant algorithms include: * FAST-LMM * FAST-LMM-Select: like GCTA in using
ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
but including
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
to try to exclude irrelevant SNPs which only add noise to the relatedness estimates * LMM-
Lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
* GEMMA * EMMAX
REACTA (formerly ACTA)
claims order of magnitude runtime reductions
BOLT-REML
BOLT-LMM
manual
, faster & better scaling;"Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis"
Loh et al 2015; see als
"Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis"
Loh et al 2015
with potentially better efficiency in the meta-analysis scenario
MEGHA
* PLINK >1.9 (December 2013) support
"the use of genetic relationship matrices in mixed model association analysis and other calculations"
* LDAK: loosens the GCTA assumption that all SNPs, regardless of genotyping quality or frequency, have same averaged expected effect, allowing for potentially finding much more SNP heritability * GREML-IBD:Evans et al 2017
"Narrow-sense heritability estimation of complex traits using identity-by-descent information."
/ref> GCTA for
identity by descent A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common a ...
, attempting to estimate heritability based on shared genome segments in distant otherwise-unrelated relatives, in order to capture the effect of rarer variants which are not measured by SNP panels or otherwise imputed


Traits

GCTA estimates frequently find estimates 0.1-0.5, consistent with broadsense heritability estimates (with the exception of personality traits, for which theory & current GWAS results suggest non-additive genetics driven by
frequency-dependent selection Frequency-dependent selection is an evolutionary process by which the fitness of a phenotype or genotype depends on the phenotype or genotype composition of a given population. * In positive frequency-dependent selection, the fitness of a phenotyp ...
"Maintenance of genetic variation in human personality: Testing evolutionary models by estimating heritability due to common causal variants and investigating the effect of distant inbreeding"
Verweij et al 2012
"The Evolutionary Genetics of Personality"
Penke et al 2007
"The Evolutionary Genetics of Personality Revisited"
Penke & Jokela 2016
). Traits univariate GCTA has been used on (excluding SNP heritability estimates computed using other algorithms such as LD score regression, and bivariate GCTAs which are listed in genetic correlation) include (point-estimate format: "h^2_{SNP}(
standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error o ...
)"):


See also

*
Animal breeding Animal breeding is a branch of animal science that addresses the evaluation (using best linear unbiased prediction and other methods) of the genetic value (estimated breeding value, EBV) of livestock. Selecting for breeding animals with superior E ...
*
Best linear unbiased prediction In statistics, best linear unbiased prediction (BLUP) is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term "best linear unbiased predictor" (or "prediction") seems not ...
* Mendelian randomization * Pleiotropy * "
The Correlation between Relatives on the Supposition of Mendelian Inheritance #REDIRECT The Correlation between Relatives on the Supposition of Mendelian Inheritance {{R from other capitalisation ...
", Fisher 1918


References


Further reading


"Research review: Polygenic methods and their application to psychiatric traits"
Wray et al. 2014
"Heritability in the genomics era — concepts and misconceptions"
Visscher et al. 2008
"Uncovering the Genetic Architectures of Quantitative Traits"
Lee et al. 2016
"Estimating heritability using genomic data"
Stanton-Geddes et al. 2013
"MultiBLUP: improved SNP-based prediction for complex traits"
Speed & Balding 2012
"Advantages and pitfalls in the application of mixed-model association methods"
Yang et al. 2013
"Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps"
Meuwissen et al. 2001
"Understanding and using quantitative genetic variation"
Hill 2009
"One Hundred Years of Statistical Developments in Animal Breeding"
Gianola & Rosa 2015
"Conditions for the validity of SNP-based heritability estimation"
Lee & Chow 2014
"Measuring missing heritability: Inferring the contribution of common variants"
Golan et al. 2014
"Concepts, estimation and interpretation of SNP-based heritability"
Yang et al. 2017 * Maier et al. 2017
"Embracing polygenicity: a review of methods and tools for psychiatric genetics research"
* Ronald & Pain 2018
"A systematic review of genome-wide research on psychotic experiences and negative symptom traits: new revelations and implications for psychiatry"


External links


GCTA-GREML Power Calculator"Statistical Power to Detect Genetic (Co)Variance of Complex Traits Using SNP Data in Unrelated Samples"
Visscher et al. 2014)
"Genomics, Big Data, Medicine, and Complex Traits"
(Peter Visscher talk)
"The Genetic Architectures of Psychological Traits"
Lee 2014 slides
"Heritability-based models for prediction of complex traits"David Balding
2015 Behavioural genetics Medical genetics Single-nucleotide polymorphisms Statistical genetics Twin studies Genetics studies Quantitative genetics Molecular genetics