Selection bias is the

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...

introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a

statistical analysis Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...

, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may be false.

Types

Sampling bias

Sampling bias In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human f ...

is systematic error due to a non- random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a

biased sample In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human fa ...

, defined as a

statistical sample In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attem ...

of a

population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction usi ...

(or non-human factors) in which all participants are not equally balanced or objectively represented. It is mostly classified as a subtype of selection bias, sometimes specifically termed ''sample selection bias'', but some classify it as a separate type of bias. A distinction of sampling bias (albeit not a universally accepted one) is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses

internal validity Internal validity is the extent to which a piece of evidence supports a claim about cause and effect, within the context of a particular study. It is one of the most important properties of scientific studies and is an important concept in reasoni ...

for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias. Examples of sampling bias include self-selection, pre-screening of trial participants, discounting trial subjects/tests that did not run to completion and migration bias by excluding subjects who have recently moved into or out of the study area, length-time bias, where slowly developing disease with better prognosis is detected, and lead time bias, where disease is diagnosed earlier participants than in comparison populations, although the average course of disease is the same.

Time interval

* Early termination of a trial at a time when its results support the desired conclusion. * A trial may be terminated early at an extreme value (often for

ethical Ethics or moral philosophy is a branch of philosophy that "involves systematizing, defending, and recommending concepts of right and wrong behavior".''Internet Encyclopedia of Philosophy'' The field of ethics, along with aesthetics, concerns ma ...

reasons), but the extreme value is likely to be reached by the variable with the largest

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

, even if all variables have a similar

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ar ...

Exposure

* ''Susceptibility bias'' ** ''Clinical susceptibility bias'', when one disease predisposes for a second disease, and the treatment for the first disease erroneously appears to predispose to the second disease. For example, postmenopausal syndrome gives a higher likelihood of also developing

endometrial cancer Endometrial cancer is a cancer that arises from the endometrium (the lining of the uterus or womb). It is the result of the abnormal growth of cells that have the ability to invade or spread to other parts of the body. The first sign is most ...

, so estrogens given for the postmenopausal syndrome may receive a higher than actual blame for causing endometrial cancer. ** ''Protopathic bias'', when a treatment for the first symptoms of a disease or other outcome appear to cause the outcome. It is a potential bias when there is a lag time from the first symptoms and start of treatment before actual diagnosis. It can be mitigated by lagging, that is, exclusion of exposures that occurred in a certain time period before diagnosis. ** ''Indication bias'', a potential mixup between cause and effect when exposure is dependent on indication, e.g. a treatment is given to people in high risk of acquiring a disease, potentially causing a preponderance of treated people among those acquiring the disease. This may cause an erroneous appearance of the treatment being a cause of the disease.

Data

* Partitioning (dividing) data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions. * Post hoc alteration of data inclusion based on arbitrary or subjective reasons, including: **

Cherry picking Cherry picking, suppressing evidence, or the fallacy of incomplete evidence is the act of pointing to individual cases or data that seem to confirm a particular position while ignoring a significant portion of related and similar cases or data th ...

, which actually is not selection bias, but

confirmation bias Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one's prior beliefs or values. People display this bias when they select information that supports their views, ignoring ...

, when specific subsets of data are chosen to support a conclusion (e.g. citing examples of plane crashes as evidence of airline flight being unsafe, while ignoring the far more common example of flights that complete safely. See:

Availability heuristic The availability heuristic, also known as availability bias, is a mental shortcut that relies on immediate examples that come to a given person's mind when evaluating a specific topic, concept, method, or decision. This heuristic, operating on the ...

) **Rejection of bad data on (1) arbitrary grounds, instead of according to previously stated or generally agreed criteria or (2) discarding "

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s" on statistical grounds that fail to take into account important information that could be derived from "wild" observations.

Studies

* Selection of which studies to include in a

meta-analysis A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...

(see also

combinatorial meta-analysis Combinatorial meta-analysis (CMA) is the study of the behaviour of statistical properties of combinations of studies from a meta-analytic dataset (typically in social science research). In an article that develops the notion of "gravity" in the co ...

). * Performing repeated experiments and reporting only the most favorable results, perhaps relabelling lab records of other experiments as "calibration tests", "instrumentation errors" or "preliminary surveys". * Presenting the most significant result of a data dredge as if it were a single experiment (which is logically the same as the previous item, but is seen as much less dishonest).

Attrition

''Attrition bias'' is a kind of selection bias caused by attrition (loss of participants), discounting trial subjects/tests that did not run to completion. It is closely related to the

survivorship bias Survivorship bias or survival bias is the logical error of concentrating on entities that passed a selection process while overlooking those that did not. This can lead to incorrect conclusions because of incomplete data. Survivorship bias is ...

, where only the subjects that "survived" a process are included in the analysis or the

failure bias Failure bias is the logical error of concentrating on the people or things that failed to make it past some selection process and overlooking those that did, typically because of their lack of visibility. This can lead to false conclusions in seve ...

, where only the subjects that "failed" a process are included. It includes ''dropout'', ''nonresponse'' (lower response rate), ''withdrawal'' and ''protocol deviators''. It gives biased results where it is unequal in regard to exposure and/or outcome. For example, in a test of a dieting program, the researcher may simply reject everyone who drops out of the trial, but most of those who drop out are those for whom it was not working. Different loss of subjects in intervention and comparison group may change the characteristics of these groups and outcomes irrespective of the studied intervention.

Lost to follow-up In the clinical research trial industry, loss to follow-up refers to patients who at one point in time were actively participating in a clinical research trial, but have become lost (either by error in a computer tracking system or by being unrea ...

, is another form of Attrition bias, mainly occurring in medicinal studies over a lengthy time period. Non-Response or Retention bias can be influenced by a number of both tangible and intangible factors, such as; wealth, education, altruism, initial understanding of the study and its requirements. Researchers may also be incapable of conducting follow-up contact resulting from inadequate identifying information and contact details collected during the initial recruitment and research phase.

Observer selection

Philosopher

Nick Bostrom Nick Bostrom ( ; sv, Niklas Boström ; born 10 March 1973) is a Swedish-born philosopher at the University of Oxford known for his work on existential risk, the anthropic principle, human enhancement ethics, superintelligence risks, and the ...

has argued that data are filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study. In situations where the existence of the observer or the study is correlated with the data, observation selection effects occur, and anthropic reasoning is required. An example is the past

impact event An impact event is a collision between astronomical objects causing measurable effects. Impact events have physical consequences and have been found to regularly occur in planetary systems, though the most frequent involve asteroids, comets or ...

record of Earth: if large impacts cause mass extinctions and ecological disruptions precluding the evolution of intelligent observers for long periods, no one will observe any evidence of large impacts in the recent past (since they would have prevented intelligent observers from evolving). Hence there is a potential bias in the impact record of Earth. Astronomical existential risks might similarly be underestimated due to selection bias, and an anthropic correction has to be introduced.

Volunteer bias

Self-selection bias or a volunteer bias in studies offer further threats to the validity of a study as these participants may have intrinsically different characteristics from the target population of the study. Studies have shown that volunteers tend to come from a higher social standing than from a lower socio-economic background. Furthermore, another study shows that women are more probable to volunteer for studies than males. Volunteer bias is evident throughout the study life-cycle, from recruitment to follow-ups. More generally speaking volunteer response can be put down to individual altruism, a desire for approval, personal relation to the study topic and other reasons. As with most instances mitigation in the case of volunteer bias is an increased sample size.

Mitigation

In the general case, selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases. An assessment of the degree of selection bias can be made by examining correlations between

exogenous In a variety of contexts, exogeny or exogeneity () is the fact of an action or object originating externally. It contrasts with endogeneity or endogeny, the fact of being influenced within a system. Economics In an economic model, an exogen ...

(background) variables and a treatment indicator. However, in regression models, it is correlation between ''unobserved'' determinants of the outcome and ''unobserved'' determinants of selection into the sample which bias estimates, and this correlation between unobservables cannot be directly assessed by the observed determinants of treatment. When data are selected for fitting or forecast purposes, a coalitional game can be set up so that a fitting or forecast accuracy function can be defined on all subsets of the data variables.

Related issues

Selection bias is closely related to: *

publication bias In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance o ...

reporting bias In epidemiology, reporting bias is defined as "selective revealing or suppression of information" by subjects (for example about past medical history, smoking, sexual experiences). In artificial intelligence research, the term reporting bias is ...

, the distortion produced in community perception or

meta-analyses A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...

by not publishing uninteresting (usually negative) results, or results which go against the experimenter's prejudices, a sponsor's interests, or community expectations. *

, the general tendency of humans to give more attention to whatever confirms our pre-existing perspective; or specifically in experimental science, the distortion produced by experiments that are designed to seek confirmatory evidence instead of trying to disprove the hypothesis. * exclusion bias, results from applying different criteria to cases and controls in regards to participation eligibility for a study/different variables serving as basis for exclusion.

References

{{Medical research studies Sampling (statistics) Experimental bias Scientific method Causal inference