
In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, an errors-in-variables model or a measurement error model is a
regression model that accounts for
measurement errors in the
independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the
dependent variables, or responses.
In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to
inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
the effect is an underestimate of the coefficient, known as the ''
attenuation bias''. In
non-linear models the direction of the bias is likely to be more complicated.
Motivating example
Consider a simple linear regression model of the form
:
where
denotes the ''true'' but
unobserved regressor. Instead, we observe this value with an error:
:
where the measurement error
is assumed to be independent of the true value
.
A practical application is the standard school science experiment for
Hooke's law
In physics, Hooke's law is an empirical law which states that the force () needed to extend or compress a spring by some distance () scales linearly with respect to that distance—that is, where is a constant factor characteristic of ...
, in which one estimates the relationship between the weight added to a spring and the amount by which the spring stretches.
If the
′s are simply regressed on the
′s (see
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
), then the estimator for the slope coefficient is
:
which converges as the sample size
increases without bound:
:
This is in contrast to the "true" effect of
, estimated using the
,:
:
Variances are non-negative, so that in the limit the estimated
is smaller than
, an effect which statisticians call ''attenuation'' or
regression dilution
Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.
Consider fitting a straight line ...
. Thus the ‘naïve’ least squares estimator
is an
inconsistent estimator for
. However,
is a
consistent estimator of the parameter required for a best linear predictor of
given the observed
: in some applications this may be what is required, rather than an estimate of the 'true' regression coefficient
, although that would assume that the variance of the errors in the estimation and prediction is identical. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the
′s to the actually observed
′s, in a simple linear regression, is given by
:
It is this coefficient, rather than
, that would be required for constructing a predictor of
based on an observed
which is subject to noise.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous).
Jerry Hausman sees this as an ''iron law of econometrics'': "The magnitude of the estimate is usually smaller than expected."
Specification
Usually, measurement error models are described using the
latent variables approach. If
is the response variable and
are observed values of the regressors, then it is assumed there exist some latent variables
and
which follow the model's "true"
functional relationship
In mathematics, a function from a set to a set assigns to each element of exactly one element of .; the words ''map'', ''mapping'', ''transformation'', ''correspondence'', and ''operator'' are sometimes used synonymously. The set is called ...
, and such that the observed quantities are their noisy observations:
:
where
is the model's
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and
are those regressors which are assumed to be error-free (for example, when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of
's are zero.
The variables
,
,
are all ''observed'', meaning that the statistician possesses a
data set
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
of
statistical units
which follow the
data generating process described above; the latent variables
,
,
, and
are not observed, however.
This specification does not encompass all the existing errors-in-variables models. For example, in some of them, function
may be
non-parametric or semi-parametric. Other approaches model the relationship between
and
as distributional instead of functional; that is, they assume that
conditionally on
follows a certain (usually parametric) distribution.
Terminology and assumptions
* The observed variable
may be called the ''manifest'', ''indicator'', or
''proxy'' variable.
* The unobserved variable
may be called the ''latent'' or ''true'' variable. It may be regarded either as an unknown constant (in which case the model is called a ''functional model''), or as a random variable (correspondingly a ''structural model'').
* The relationship between the measurement error
and the latent variable
can be modeled in different ways:
** ''Classical errors'':
the errors are
independent of the latent variable. This is the most common assumption; it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
** ''Mean-independence'':
the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of
heteroscedasticity
In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
or other effects in the measurement errors.
**
''Berkson's errors'':
the errors are independent of the ''observed'' regressor ''x''. This assumption has very limited applicability. One example is round-off errors: for example, if a person's
age* is a
continuous random variable
In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spa ...
, whereas the observed
age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed
age. Another possibility is with the fixed design experiment: for example, if a scientist decides to make a measurement at a certain predetermined moment of time
, say at
, then the real measurement may occur at some other value of
(for example due to her finite reaction time) and such measurement error will be generally independent of the "observed" value of the regressor.
** ''Misclassification errors'': special case used for the
dummy regressors. If
is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to
type I and type II errors
Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
in statistical testing. In this case the error
may take only 3 possible values, and its distribution conditional on
is modeled with two parameters: