In
statistics, errors-in-variables models or measurement error models are
regression models that account for
measurement errors in the
independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the
dependent variables
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, or responses.
In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to
inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For
simple linear regression the effect is an underestimate of the coefficient, known as the ''
attenuation bias''. In
non-linear models the direction of the bias is likely to be more complicated.
Motivating example
Consider a simple linear regression model of the form
:
where
denotes the ''true'' but
unobserved regressor. Instead we observe this value with an error:
:
where the measurement error
is assumed to be independent of the true value
.
If the
′s are simply regressed on the
′s (see
simple linear regression), then the estimator for the slope coefficient is
:
which converges as the sample size
increases without bound:
:
Variances are non-negative, so that in the limit the estimate is smaller in magnitude than the true value of
, an effect which statisticians call ''attenuation'' or
regression dilution. Thus the ‘naïve’ least squares estimator is
inconsistent in this setting. However, the estimator is a
consistent estimator
In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...
of the parameter required for a best linear predictor of
given
: in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient, although that would assume that the variance of the errors in observing
remains fixed. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the
′s to the actually observed
′s, in a simple linear regression, is given by
:
It is this coefficient, rather than
, that would be required for constructing a predictor of
based on an observed
which is subject to noise.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous).
Jerry Hausman
Jerry Allen Hausman (born May 5, 1946) is the John and Jennie S. MacDonald Professor of Economics at the Massachusetts Institute of Technology and a notable econometrician. He has published numerous influential papers in microeconometrics. Haus ...
sees this as an ''iron law of econometrics'': "The magnitude of the estimate is usually smaller than expected."
Specification
Usually measurement error models are described using the
latent variables approach. If
is the response variable and
are observed values of the regressors, then it is assumed there exist some latent variables
and
which follow the model's “true”
functional relationship
In mathematics, a function from a set to a set assigns to each element of exactly one element of .; the words map, mapping, transformation, correspondence, and operator are often used synonymously. The set is called the domain of the functi ...
, and such that the observed quantities are their noisy observations:
:
where
is the model's
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and
are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of
's are zero.
The variables
,
,
are all ''observed'', meaning that the statistician possesses a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
of
statistical unit
In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country ...
s
which follow the
data generating process described above; the latent variables
,
,
, and
are not observed however.
This specification does not encompass all the existing errors-in-variables models. For example in some of them function
may be
non-parametric
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
or semi-parametric. Other approaches model the relationship between
and
as distributional instead of functional, that is they assume that
conditionally on
follows a certain (usually parametric) distribution.
Terminology and assumptions
* The observed variable
may be called the ''manifest'', ''indicator'', or
''proxy'' variable.
* The unobserved variable
may be called the ''latent'' or ''true'' variable. It may be regarded either as an unknown constant (in which case the model is called a ''functional model''), or as a random variable (correspondingly a ''structural model'').
* The relationship between the measurement error
and the latent variable
can be modeled in different ways:
** ''Classical errors'':
the errors are
independent of the latent variable. This is the most common assumption, it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
** ''Mean-independence'':
the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of
heteroscedasticity or other effects in the measurement errors.
**
''Berkson's errors'':
the errors are independent of the ''observed'' regressor ''x''. This assumption has very limited applicability. One example is round-off errors: for example if a person's
age* is a
continuous random variable, whereas the observed
age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed
age. Another possibility is with the fixed design experiment: for example if a scientist decides to make a measurement at a certain predetermined moment of time
, say at
, then the real measurement may occur at some other value of
(for example due to her finite reaction time) and such measurement error will be generally independent of the "observed" value of the regressor.
** ''Misclassification errors'': special case used for the
dummy regressors. If
is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to
type I and type II errors in statistical testing. In this case the error
may take only 3 possible values, and its distribution conditional on
is modeled with two parameters: