In
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
, a dummy variable (also known as indicator variable or just dummy) is one that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. For example, if we were studying the relationship between gender and income, we could use a dummy variable to represent the gender of each individual in the study. The variable would take on a value of 1 for males and 0 for females.
Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.
As with any addition of variables to a model, the addition of dummy variables will increases the within-sample model fit (
coefficient of determination), but at a cost of fewer
degrees of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.
Dummy variables are useful in various cases. For example, in
econometric time series analysis
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
, dummy variables may be used to indicate the occurrence of wars, or major
strikes
Strike may refer to:
People
*Strike (surname)
Physical confrontation or removal
*Strike (attack), attack with an inanimate object or a part of the human body intended to cause harm
*Airstrike, military strike by air forces on either a suspected ...
. It could thus be thought of as a
truth value
In logic and mathematics, a truth value, sometimes called a logical value, is a value indicating the relation of a proposition to truth, which in classical logic has only two possible values ('' true'' or ''false'').
Computing
In some prog ...
represented as a numerical value 0 or 1 (as is sometimes done in computer programming).
Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the
panel data fixed effects estimator dummies are created for each of the units in
cross-sectional data (e.g. firms or countries) or periods in a
pooled time-series. However in such regressions either the
constant term has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:
If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect
multicollinearity,
so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.
See also
*
Binary regression
In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of ...
*
Chow test
*
Hypothesis testing
*
Indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x ...
*
Linear discriminant function
*
Multicollinearity
References
Further reading
*
*
External links
*
*
*
{{DEFAULTSORT:Dummy Variable (Statistics)
Regression variable selection