Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The science of why things occur is called

etiology Etiology (pronounced ; alternatively: aetiology or ætiology) is the study of causation or origination. The word is derived from the Greek (''aitiología'') "giving a reason for" (, ''aitía'', "cause"); and ('' -logía''). More completely, ...

. Causal inference is said to provide the evidence of causality theorized by

causal reasoning Causal reasoning is the process of identifying causality: the relationship between a cause and its effect. The study of causality extends from ancient philosophy to contemporary neuropsychology; assumptions about the nature of causality may be sh ...

. Causal inference is widely studied across all sciences. Several innovations in the development and implementation of methodology designed to determine causality have proliferated in recent decades. Causal inference remains especially difficult where experimentation is difficult or impossible, which is common throughout most sciences. The approaches to causal inference are broadly applicable across all types of scientific disciplines, and many methods of causal inference that were designed for certain disciplines have found use in other disciplines. This article outlines the basic process behind causal inference and details some of the more conventional tests used across different disciplines; however, this should not be mistaken as a suggestion that these methods apply only to those disciplines, merely that they are the most commonly used in that discipline. Causal inference is difficult to perform and there is significant debate amongst scientists about the proper way to determine causality. Despite other innovations, there remain concerns of misattribution by scientists of correlative results as causal, of the usage of incorrect methodologies by scientists, and of deliberate manipulation by scientists of analytical results in order to obtain statistically significant estimates. Particular concern is raised in the use of regression models, especially linear regression models.

Definition

Inferring the cause of something has been described as: *"...reason ngto the conclusion that something is, or is likely to be, the cause of something else". *"Identification of the cause or causes of a phenomenon, by establishing covariation of cause and effect, a time-order relationship with the cause preceding the effect, and the elimination of plausible alternative causes."

Methodology

General

Causal inference is conducted via the study of systems where the measure of one variable is suspected to affect the measure of another. Causal inference is conducted with regard to the

scientific method The scientific method is an empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century (with notable practitioners in previous centuries; see the article history of scientifi ...

. The first step of causal inference is to formulate a falsifiable

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is ...

, which is subsequently tested with statistical methods. Frequentist

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...

is the use of statistical methods to determine the probability that the data occur under the null hypothesis by chance;

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...

is used to determine the effect of an independent variable. Statistical inference is generally used to determine the difference between variations in the original data that are random variation or the effect of a well-specified causal mechanism. Notably,

correlation does not imply causation The phrase "correlation does not imply causation" refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them. The id ...

, so the study of causality is as concerned with the study of potential causal mechanisms as it is with variation amongst the data. A frequently sought after standard of causal inference is an experiment wherein treatment is randomly assigned but all other confounding factors are held constant. Most of the efforts in causal inference are in the attempt to replicate experimental conditions. Epidemiological studies employ different

epidemiological method The science of epidemiology has matured significantly from the times of Hippocrates, Semmelweis and John Snow. The techniques for gathering and analyzing epidemiological data vary depending on the type of disease being monitored but each study wil ...

s of collecting and measuring evidence of risk factors and effect and different ways of measuring association between the two. Results of a 2020 review of methods for causal inference found that using existing literature for clinical training programs can be challenging. This is because published articles often assume an advanced technical background, they may be written from multiple statistical, epidemiological, computer science, or philosophical perspectives, methodological approaches continue to expand rapidly, and many aspects of causal inference receive limited coverage. Common frameworks for causal inference include the

causal pie model In the field of epidemiology, the causal mechanisms responsible for diseases A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to ...

(component-cause), Pearl's structural causal model (

causal diagram In the philosophy of science, a causal model (or structural causal model) is a conceptual model that describes the causal mechanisms of a system. Causal models can improve study designs by providing clear rules for deciding which independent va ...

+ do-calculus),

structural equation modeling Structural equation modeling (SEM) is a label for a diverse set of methods used by scientists in both experimental and observational research across the sciences, business, and other fields. It is used most in the social and behavioral scienc ...

, and

Rubin causal model The Rubin causal model (RCM), also known as the Neyman–Rubin causal model, is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes, named after Donald Rubin. The name "Rubin causal model" was ...

(potential-outcome), which are often used in areas such as social sciences and epidemiology.

Experimental

Experimental verification of causal mechanisms is possible using experimental methods. The main motivation behind an experiment is to hold other experimental variables constant while purposefully manipulating the variable of interest. If the experiment produces statistically significant effects as a result of only the treatment variable being manipulated, there is grounds to believe that a causal effect can be assigned to the treatment variable, assuming that other standards for experimental design have been met.

Quasi-experimental

Quasi-experimental verification of causal mechanisms is conducted when traditional experimental methods are unavailable. This may be the result of prohibitive costs of conducting an experiment, or the inherent infeasibility of conducting an experiment, especially experiments that are concerned with large systems such as economies of electoral systems, or for treatments that are considered to present a danger to the well-being of test subjects. Quasi-experiments may also occur where information is withheld for legal reasons.

Approaches in epidemiology

Epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evi ...

studies patterns of health and disease in defined populations of living beings in order to

infer Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that i ...

causes and effects. An association between an exposure to a putative

risk factor In epidemiology, a risk factor or determinant is a variable associated with an increased risk of disease or infection. Due to a lack of harmonization across disciplines, determinant, in its more widely accepted scientific meaning, is often u ...

and a disease may be suggestive of, but is not equivalent to causality because

. Historically,

Koch's postulates Koch's postulates ( )"Koch"
''

Bradford Hill criteria The Bradford Hill criteria, otherwise known as Hill's criteria for causation, are a group of nine principles that can be useful in establishing epidemiologic evidence of a causal relationship between a presumed cause and an observed effect and have ...

, described in 1965 have been used to assess causality of variables outside microbiology, although even these criteria are not exclusive ways to determine causality. In molecular epidemiology the phenomena studied are on a

molecular biology Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and phys ...

level, including genetics, where

biomarkers In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, pa ...

are evidence of cause or effects. A recent trend is to identify evidence for influence of the exposure on

molecular pathology Molecular pathology is an emerging discipline within pathology which is focused in the study and diagnosis of disease through the examination of molecules within organs, tissues or bodily fluids. Molecular pathology shares some aspects of practice ...

within diseased tissue or cells, in the emerging interdisciplinary field of molecular pathological epidemiology (MPE). Linking the exposure to molecular pathologic signatures of the disease can help to assess causality. Considering the inherent nature of

heterogeneity Homogeneity and heterogeneity are concepts often used in the sciences and statistics relating to the uniformity of a substance or organism. A material or image that is homogeneous is uniform in composition or character (i.e. color, shape, siz ...

of a given disease, the unique disease principle, disease phenotyping and subtyping are trends in biomedical and

public health Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the det ...

sciences, exemplified as

personalized medicine Personalized medicine, also referred to as precision medicine, is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on the ...

and

precision medicine Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digi ...

Approaches in computer science

Determination of cause and effect from joint observational data for two time-independent variables, say X and Y, has been tackled using asymmetry between evidence for some model in the directions, X → Y and Y → X. The primary approaches are based on

Algorithmic information theory Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects (as opposed to stochastically generated), such as str ...

models and noise models.

Noise models

Incorporate an independent noise term in the model to compare the evidences of the two directions. Here are some of the noise models for the hypothesis Y → X with the noise E: * Additive noise:

Y = F(X)+E

* Linear noise:

Y = pX + qE

* Post-nonlinear:

Y = G(F(X)+E)

* Heteroskedastic noise:

Y = F(X)+E.G(X)

* Functional noise:Mooij, Joris M., et al.
Probabilistic latent variable models for distinguishing between cause and effect
." NIPS. 2010.

Y = F(X,E)

The common assumption in these models are: * There are no other causes of Y. * X and E have no common causes. * Distribution of cause is independent from causal mechanisms. On an intuitive level, the idea is that the factorization of the joint distribution P(Cause, Effect) into P(Cause)*P(Effect , Cause) typically yields models of lower total complexity than the factorization into P(Effect)*P(Cause , Effect). Although the notion of "complexity" is intuitively appealing, it is not obvious how it should be precisely defined. A different family of methods attempt to discover causal "footprints" from large amounts of labeled data, and allow the prediction of more flexible causal relations.

Approaches in social sciences

Social science

The social sciences in general have moved increasingly toward including quantitative frameworks for assessing causality. Much of this has been described as a means of providing greater rigor to social science methodology. Political science was significantly influenced by the publication of

Designing Social Inquiry ''Designing Social Inquiry: Scientific Inference in Qualitative Research'' (or KKV) is an influential 1994 book written by Gary King, Robert Keohane, and Sidney Verba that lays out guidelines for conducting qualitative research. The central thesis ...

, by Gary King, Robert Keohane, and Sidney Verba, in 1994. King, Keohane, and Verba recommend that researchers apply both quantitative and qualitative methods and adopt the language of statistical inference to be clearer about their subjects of interest and units of analysis. Proponents of quantitative methods have also increasingly adopted the potential outcomes framework, developed by

Donald Rubin Donald is a masculine given name derived from the Gaelic name ''Dòmhnall''.. This comes from the Proto-Celtic *''Dumno-ualos'' ("world-ruler" or "world-wielder"). The final -''d'' in ''Donald'' is partly derived from a misinterpretation of the ...

, as a standard for inferring causality. While much of the emphasis remains on statistical inference in the potential outcomes framework, social science methodologists have developed new tools to conduct causal inference with both qualitative and quantitative methods, sometimes called a "mixed methods" approach. Advocates of diverse methodological approaches argue that different methodologies are better suited to different subjects of study. Sociologist Herbert Smith and Political Scientists James Mahoney and Gary Goertz have cited the observation of Paul Holland, a statistician and author of the 1986 article "Statistics and Causal Inference", that statistical inference is most appropriate for assessing the "effects of causes" rather than the "causes of effects". Qualitative methodologists have argued that formalized models of causation, including process tracing and

fuzzy set In mathematics, fuzzy sets (a.k.a. uncertain sets) are sets whose elements have degrees of membership. Fuzzy sets were introduced independently by Lotfi A. Zadeh in 1965 as an extension of the classical notion of set. At the same time, defined ...

theory, provide opportunities to infer causation through the identification of critical factors within case studies or through a process of comparison among several case studies. These methodologies are also valuable for subjects in which a limited number of potential observations or the presence of confounding variables would limit the applicability of statistical inference.

Economics and political science

In the economic sciences and

political science Political science is the scientific study of politics. It is a social science dealing with systems of governance and power, and the analysis of political activities, political thought, political behavior, and associated constitutions and ...

s causal inference is often difficult, owing to the real world complexity of economic and political realities and the inability to recreate many large-scale phenomena within controlled experiments. Causal inference in the economic and political sciences continues to see improvement in methodology and rigor, due to the increased level of technology available to social scientists, the increase in the number of social scientists and research, and improvements to causal inference methodologies throughout social sciences. Despite the difficulties inherent in determining causality in economic systems, several widely employed methods exist throughout those fields.

Theoretical methods

Economists and political scientists can use theory (often studied in theory-driven econometrics) to estimate the magnitude of supposedly causal relationships in cases where they believe a causal relationship exists. Theorists can presuppose a mechanism believed to be causal and describe the effects using data analysis to justify their proposed theory. For example, theorists can use logic to construct a model, such as theorizing that rain causes fluctuations in economic productivity but that the converse is not true. However, using purely theoretical claims that do not offer any predictive insights has been called "pre-scientific" because there is no ability to predict the impact of the supposed causal properties. It is worth reiterating that regression analysis in the social science does not inherently imply causality, as many phenomena may correlate in the short run or in particular datasets but demonstrate no correlation in other time periods or other datasets. Thus, the attribution of causality to correlative properties is premature absent a well defined and reasoned causal mechanism.

Instrumental variables

The

instrumental variables In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered ...

(IV) technique is a method of determining causality that involves the elimination of a correlation between one of a model's explanatory variables and the model's error term. This method presumes that if a model's error term moves similarly with the variation of another variable, then the model's error term is probably an effect of variation in that explanatory variable. The elimination of this correlation through the introduction of a new instrumental variable thus reduces the error present in the model as a whole.

Model specification

Model specification is the act of selecting a model to be used in data analysis. Social scientists (and, indeed, all scientists) must determine the correct model to use because different models are good at estimating different relationships. Model specification can be useful in determining causality that is slow to emerge, where the effects of an action in one period are only felt in a later period. It is worth remembering that correlations only measure whether two variables have similar variance, not whether they affect one another in a particular direction; thus, one cannot determine the direction of a causal relation based on correlations only. Because causal acts are believed to precede causal effects, social scientists can use a model that looks specifically for the effect of one variable on another over a period of time. This leads to using the variables representing phenomena happening earlier as treatment effects, where econometric tests are used to look for later changes in data that are attributed to the effect of such treatment effects, where a meaningful difference in results following a meaningful difference in treatment effects may indicate causality between the treatment effects and the measured effects (e.g., Granger-causality tests). Such studies are examples of

time-series analysis In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

Sensitivity analysis

Other variables, or regressors in regression analysis, are either included or not included across various implementations of the same model to ensure that different sources of variation can be studied more separately from one another. This is a form of sensitivity analysis: it is the study of how sensitive an implementation of a model is to the addition of one or more new variables. A chief motivating concern in the use of sensitivity analysis is the pursuit of discovering

confounding variable In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...

s. Confounding variables are variables that have a large impact on the results of a statistical test but are not the variable that causal inference is trying to study. Confounding variables may cause a regressor to appear to be significant in one implementation, but not in another.

= Multicollinearity

= Another reason for the use of sensitivity analysis is to detect

multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...

. Multicollinearity is the phenomenon where the correlation between two variables is very high. A high level of correlation between two variables can dramatically affect the outcome of a statistical analysis, where small variations in highly correlated data can flip the effect of a variable from a positive direction to a negative direction, or vice versa. This is an inherent property of variance testing. Determining multicollinearity is useful in sensitivity analysis because the elimination of highly correlated variables in different model implementations can prevent the dramatic changes in results that result from the inclusion of such variables. However, there are limits to sensitivity analysis' ability to prevent the deleterious effects of multicollinearity, especially in the social sciences, where systems are complex. Because it is theoretically impossible to include or even measure all of the confounding factors in a sufficiently complex system, econometric models are susceptible to the common-cause fallacy, where causal effects are incorrectly attributed to the wrong variable because the correct variable was not captured in the original data. This is an example of the failure to account for a lurking variable.

Design-based econometrics

Recently, improved methodology in design-based econometrics has popularized the use of both natural experiments and quasi-experimental research designs to study the causal mechanisms that such experiments are believed to identify.

Malpractice in causal inference

Despite the advancements in the development of methodologies used to determine causality, significant weaknesses in determining causality remain. These weaknesses can be attributed both to the inherent difficulty of determining causal relations in complex systems but also to cases of scientific malpractice. Separate from the difficulties of causal inference, the perception that large numbers of scholars in the social sciences engage in non-scientific methodology exists among some large groups of social scientists. Criticism of economists and social scientists as passing off descriptive studies as causal studies are rife within those fields.

Scientific malpractice and flawed methodology

In the sciences, especially in the social sciences, there is concern among scholars that scientific malpractice is widespread. As scientific study is a broad topic, there are theoretically limitless ways to have a causal inference undermined through no fault of a researcher. Nonetheless, there remain concerns among scientists that large numbers of researchers do not perform basic duties or practice sufficiently diverse methods in causal inference. One prominent example of common non-causal methodology is the erroneous assumption of correlative properties as causal properties. There is no inherent causality in phenomena that correlate. Regression models are designed to measure variance within data relative to a theoretical model: there is nothing to suggest that data that presents high levels of covariance have any meaningful relationship (absent a proposed causal mechanism with predictive properties or a random assignment of treatment). The use of flawed methodology has been claimed to be widespread, with common examples of such malpractice being the overuse of correlative models, especially the overuse of regression models and particularly linear regression models. The presupposition that two correlated phenomena are inherently related is a logical fallacy known as

spurious correlation In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but '' not'' causally related, due to either coincidence or the presence of a certain third, u ...

. Some social scientists claim that widespread use of methodology that attributes causality to spurious correlations have been detrimental to the integrity of the social sciences, although improvements stemming from better methodologies have been noted. A potential effect of scientific studies that erroneously conflate correlation with causality is an increase in the number of scientific findings whose results are not reproducible by third parties. Such non-reproducibility is a logical consequence of findings that correlation only temporarily being overgeneralized into mechanisms that have no inherent relationship, where new data does not contain the previous, idiosyncratic correlations of the original data. Debates over the effect of malpractice versus the effect of the inherent difficulties of searching for causality are ongoing. Critics of widely practiced methodologies argue that researchers have engaged statistical manipulation in to publish articles that supposedly demonstrate evidence of causality but are actually examples of spurious correlation being touted as evidence of causality: such endeavors may be referred to as P hacking. To prevent this, some have advocated that researchers preregister their research designs prior to conducting to their studies so that they do not inadvertently overemphasize a nonreproducible finding that was not the initial subject of inquiry but was found to be statistically significant during data analysis.

References

Bibliography

External links

NIPS 2013 Workshop on CausalityCausal inference at the Max Planck Institute for Intelligent Systems Tübingen
{{Authority control Graphical models Regression analysis Inductive reasoning Inference Philosophy of statistics