Influential Observations
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the
result A result (also called upshot) is the final consequence of a sequence of actions or events expressed qualitatively or quantitatively. Possible results include advantage, disadvantage, gain, injury, loss, value and victory. There may be a range ...
of the calculation. In particular, in
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
an influential observation is one whose deletion has a large effect on the parameter estimates.


Assessment

Various methods have been proposed for measuring influence. Assume an estimated regression \mathbf = \mathbf \mathbf + \mathbf, where \mathbf is an ''n''×1 column vector for the response variable, \mathbf is the ''n''×''k''
design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...
of explanatory variables (including a constant), \mathbf is the ''n''×1 residual vector, and \mathbf is a ''k''×1 vector of estimates of some population parameter \mathbf \in \mathbb^. Also define \mathbf \equiv \mathbf \left(\mathbf^ \mathbf \right)^ \mathbf^, the
projection matrix In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes t ...
of \mathbf. Then we have the following measures of influence: # \text_ \equiv \mathbf - \mathbf_ = \frac, where \mathbf_ denotes the coefficients estimated with the ''i''-th row \mathbf_ of \mathbf deleted, h_ = \mathbf_ \left( \mathbf^ \mathbf \right)^ \mathbf_^ denotes the ''i''-th value of matrix's \mathbf main diagonal. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each variable and each observation (if there are ''N'' observations and ''k'' variables there are N·k DFBETAs). Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure):


Outliers, leverage and influence

An
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
may be defined as a
data point In statistics, a unit of observation is the unit described by the data that one analyzes. A study may treat groups as a unit of observation with a country as the unit of analysis, drawing conclusions on group characteristics from data collected at ...
that differs significantly from other observations. A
high-leverage point In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. ''High-leverage points'', if any, are outliers with respect to ...
are observations made at extreme values of independent variables. Both types of atypical observations will force the regression line to be close to the point. In Anscombe's quartet, the bottom right image has a point with high leverage and the bottom left image has an outlying point.


See also

*
Influence function (statistics) Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...
*
Outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
*
Leverage Leverage or leveraged may refer to: *Leverage (mechanics), mechanical advantage achieved by using a lever * ''Leverage'' (album), a 2012 album by Lyriel *Leverage (dance), a type of dance connection *Leverage (finance), using given resources to ...
**
Partial leverage In regression analysis, partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, if ''h'i'' is the ''i''th element of the diagonal of the hat matrix, PL is ...
*
Regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
* *
Anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority ...


References


Further reading

* * {{cite book , first=Peter , last=Kennedy , author-link=Peter Kennedy (economist) , chapter=Robust Estimation , title=A Guide to Econometrics , location=Cambridge , publisher=The MIT Press , edition=Fifth , year=2003 , isbn=0-262-61183-X , pages=372–388 , chapter-url=https://books.google.com/books?id=B8I5SP69e4kC&pg=PA372 Actuarial science Regression diagnostics Robust statistics