Human genetic clustering refers to patterns of relative genetic similarity among human individuals and populations, as well as the wide range of scientific and statistical methods used to study this aspect of
human genetic variation
Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.
No two humans are genetically identical. Even m ...
.
Clustering studies are thought to be valuable for characterizing the general structure of genetic variation among human populations, to contribute to the study of ancestral origins, evolutionary history, and precision medicine. Since the mapping of the human genome, and with the availability of increasingly powerful analytic tools,
cluster analyses have revealed a range of ancestral and migratory trends among human populations and individuals.
Human genetic clusters tend to be organized by geographic ancestry, with divisions between clusters aligning largely with geographic barriers such as oceans or mountain ranges.
Clustering studies have been applied to global populations,
as well as to population subsets like post-colonial North America.
Notably, the practice of defining clusters among modern human populations is largely arbitrary and variable due to the continuous nature of human genotypes; although individual genetic markers can be used to produce smaller groups, there are no models that produce completely distinct subgroups when larger numbers of genetic markers are used.
Many studies of human genetic clustering have been implicated in discussions of
race
Race, RACE or "The Race" may refer to:
* Race (biology), an informal taxonomic classification within a species, generally within a sub-species
* Race (human categorization), classification of humans into groups based on physical traits, and/or s ...
,
ethnicity
An ethnic group or an ethnicity is a grouping of people who identify with each other on the basis of shared attributes that distinguish them from other groups. Those attributes can include common sets of traditions, ancestry, language, history, ...
, and
scientific racism
Scientific racism, sometimes termed biological racism, is the pseudoscience, pseudoscientific belief that empirical evidence exists to support or justify racism (racial discrimination), racial inferiority, or racial superiority.. "Few tragedies ...
, as some have controversially suggested that genetically derived clusters may be understood as proof of genetically determined races.
Although cluster analyses invariably organize humans (or groups of humans) into subgroups, debate is ongoing on how to interpret these genetic clusters with respect to race and its social and phenotypic features. And, because there is such a small fraction of genetic variation between human genotypes overall, genetic clustering approaches are highly dependent on the sampled data, genetic markers, and statistical methods applied to their construction.
Genetic clustering algorithms and methods
A wide range of methods have been developed to assess the structure of human populations with the use of genetic data. Early studies of within and between-group genetic variation used physical phenotypes and blood groups, with modern genetic studies using genetic markers such as
Alu sequences,
short tandem repeat polymorphisms, and
single nucleotide polymorphisms
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
(SNPs), among others. Models for genetic clustering also vary by algorithms and programs used to process the data. Most sophisticated methods for determining clusters can be categorized as model-based clustering methods (such as the algorithm STRUCTURE
) or multidimensional summaries (typically through principal component analysis).
By processing a large number of SNPs (or other genetic marker data) in different ways, both approaches to genetic clustering tend to converge on similar patterns by identifying similarities among SNPs and/or
haplotype
A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA or ...
tracts to reveal ancestral genetic similarities.
Model-based clustering
Common model-based clustering algorithms include STRUCTURE, ADMIXTURE, and HAPMIX. These algorithms operate by finding the best fit for genetic data among an arbitrary or mathematically derived number of clusters, such that differences within clusters are minimized and differences between clusters are maximized. This clustering method is also referred to as "
admixture inference," as individual genomes (or individuals within populations) can be characterized by the proportions of
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
s linked to each cluster.
In other words, algorithms like STRUCTURE generate results that assume the existence of discrete ancestral populations, operationalized through unique genetic markers, which have combined over time to form the admixed populations of the modern day.
Multidimensional summary statistics
Where model-based clustering characterizes populations using proportions of presupposed ancestral clusters, multidimensional summary statistics characterize populations on a continuous spectrum. The most common multidimensional statistical method used for genetic clustering is
principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA), which plots individuals by two or more axes (their "principal components") that represent aggregations of genetic markers that account for the highest variance. Clusters can then be identified by visually assessing the distribution of data; with larger samples of human genotypes, data tends to cluster in distinct groups as well as admixed positions between groups.
Caveats and limitations
There are caveats and limitations to genetic clustering methods of any type, given the degree of admixture and relative similarity within the human population. All genetic cluster findings are
biased by the sampling process used to gather data, and by the quality and quantity of that data. For example, many clustering studies use data derived from populations that are geographically distinct and far apart from one another, which may present an illusion of discrete clusters where, in reality, populations are much more blended with one another when intermediary groups are included.
Sample size also plays an important moderating role on cluster findings, as different sample size inputs can influence cluster assignment, and more subtle relationships between genotypes may only emerge with larger sample sizes.
In particular, the use of STRUCTURE has been widely criticized as being potentially misleading through requiring data to be sorted into a predetermined number of clusters which may or may not reflect the actual population's distribution.
The creators of STRUCTURE originally described the algorithm as an "
exploratory" method to be interpreted with caution and not as a test with statistically significant power.
Notable applications to human genetic data
Modern applications of genetic clustering methods to global-scale genetic data were first marked by studies associated with the
Human Genome Diversity Project
The Human Genome Diversity Project (HGDP) was started by Stanford University's Morrison Institute in 1990s along with collaboration of scientists around the world. It is the result of many years of work by Luigi Cavalli-Sforza, one of the most ci ...
(HGDP) data.
These early HGDP studies, such as those by Rosenberg et al. (2002),
contributed to theories of the serial founder effect and early human migration out of Africa, and clustering methods have been notably applied to describe admixed continental populations.
Genetic clustering and HGDP studies have also contributed to methods for, and criticisms of, the
genetic ancestry consumer testing industry.
A number of landmark genetic cluster studies have been conducted on global human populations since 2002, including the following:
Genetic clustering and race
Clusters of individuals are often
geographically structured. For example, when clustering a population of East Asians and Europeans, each group will likely form its own respective cluster based on similar
allele frequencies
Allele frequency, or gene frequency, is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage. Specifically, it is the fraction of all chromosomes in the population that ...
. In this way, clusters can have a correlation with traditional concepts of race and self-identified ancestry; in some cases, such as medical questionnaires, the latter variables can be used as a proxy for genetic ancestry where genetic data is unavailable.
However, genetic variation is distributed in a complex, continuous, and overlapping manner, so this correlation is imperfect and the use of
racial categories in medicine can introduce additional hazards.
Some scholars have challenged the idea that race can be inferred by genetic clusters, drawing distinctions between arbitrarily assigned genetic clusters, ancestry, and race. One recurring caution against thinking of human populations in terms of clusters is the notion that genotypic variation and traits are distributed evenly between populations, along gradual
clines Clines is a surname. Notable people with the surname include:
*Gene Clines (1946–2022), American baseball player and coach
* Hoyt Franklin Clines (1956–1994), American murderer
*Peter Clines (born 1969), American author and novelist
* Thomas G. ...
rather than along discrete population boundaries; so although genetic similarities are usually organized geographically, their underlying populations have never been completely separated from one another. Due to migration, gene flow, and baseline homogeneity, features between groups are extensively overlapping and intermixed.
Moreover, genetic clusters do not typically match socially defined racial groups; many commonly understood races may not be sorted into the same genetic cluster, and many genetic clusters are made up of individuals who would have distinct racial identities.
In general, clusters may most simply be understood as products of the methods used to sample and analyze genetic data; not without meaning for understanding ancestry and genetic characteristics, but inadequate to fully explaining the concept of race, which is more often described in terms of social and cultural forces.
In the related context of
personalized medicine
Personalized medicine, also referred to as precision medicine, is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on the ...
, race is currently listed as a
risk factor
In epidemiology, a risk factor or determinant is a variable associated with an increased risk of disease or infection.
Due to a lack of harmonization across disciplines, determinant, in its more widely accepted scientific meaning, is often use ...
for a wide range of medical conditions with genetic and non-genetic causes. Questions have emerged regarding whether or not genetic clusters support the idea of race as a valid construct to apply to medical research and treatment of disease, because there are many diseases that correspond with specific genetic markers and/or with specific populations, as seen with
Tay-Sachs disease or
sickle cell disease
Sickle cell disease (SCD) is a group of blood disorders typically inherited from a person's parents. The most common type is known as sickle cell anaemia. It results in an abnormality in the oxygen-carrying protein haemoglobin found in red blo ...
.
Researchers are careful to emphasize that ancestry—revealed in part through cluster analyses—plays an important role in understanding risk of disease. But racial or ethnic identity does not perfectly align with genetic ancestry, and so race and ethnicity do not reveal enough information to make a medical diagnosis.
Race as a variable in medicine is more likely to reflect social factors, where ancestry information is more likely to be meaningful when considering genetic ancestry.
References
{{DEFAULTSORT:Human genetic clustering
Human population genetics
Biological anthropology