
In
non-parametric statistics
Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric s ...
, the Theil–Sen estimator is a method for
robustly fitting a line to sample points in the plane (
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
) by choosing the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
of the
slope
In mathematics, the slope or gradient of a Line (mathematics), line is a number that describes the direction (geometry), direction of the line on a plane (geometry), plane. Often denoted by the letter ''m'', slope is calculated as the ratio of t ...
s of all lines through pairs of points. It has also been called Sen's slope estimator,
slope selection,
the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after
Henri Theil and
Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively,
[; ] and after
Maurice Kendall because of its relation to the
Kendall tau rank correlation coefficient
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's Ï„ coefficient (after the Greek letter Ï„, tau), is a statistic used to measure the ordinal association between two measured quantities. A Ï„ test is a ...
.
[
Theil–Sen regression has several advantages over ]Ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship ...
regression. It is insensitive to outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It can be used for significance tests even when residuals are not normally distributed. It can be significantly more accurate than non-robust simple linear regression (least squares) for skewed
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.
For a unimodal ...
and heteroskedastic data, and competes well against least squares even for normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
data in terms of statistical power
In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
. It has been called "the most popular nonparametric technique for estimating a linear trend".[. ] There are fast algorithms for efficiently computing the parameters.
Definition
As defined by , the Theil–Sen estimator of a set of two-dimensional points is the median of the slopes determined by all pairs of sample points. extended this definition to handle the case in which two data points have the same coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct coordinates.[
Once the slope has been determined, one may determine a line from the sample points by setting the -intercept to be the median of the values . The fit line is then the line with coefficients and in slope–intercept form.] As Sen observed, this choice of slope makes the Kendall tau rank correlation coefficient
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's Ï„ coefficient (after the Greek letter Ï„, tau), is a statistic used to measure the ordinal association between two measured quantities. A Ï„ test is a ...
become approximately zero, when it is used to compare the values with their associated residuals . Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.[; .]
A confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.[.]
Variations
A variation of the Theil–Sen estimator, the repeated median regression of , determines for each sample point , the median of the slopes of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical.
A different variant pairs up sample points by the rank of their -coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the Theil–Sen estimator.
Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose -coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.
For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs.[.]
Statistical properties
The Theil–Sen estimator is an unbiased estimator
In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
of the true slope in simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
. For many distributions of the response error, this estimator has high asymptotic efficiency relative to least-squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
estimation. Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators.
The Theil–Sen estimator is more robust than the least-squares estimator because it is much less sensitive to outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It has a breakdown point
Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regr ...
of
:
meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy.[, pp. 67, 164.] However, the breakdown point decreases for higher-dimensional generalizations of the method. A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the repeated median estimator of Siegel.
The Theil–Sen estimator is equivariant
In mathematics, equivariance is a form of symmetry for functions from one space with symmetry to another (such as symmetric spaces). A function is said to be an equivariant map when its domain and codomain are acted on by the same symmetry group, ...
under every linear transformation
In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
of its response variable, meaning that transforming the data first and then fitting a line, or fitting a line first and then transforming it in the same way, both produce the same result. However, it is not equivariant under affine transformation
In Euclidean geometry, an affine transformation or affinity (from the Latin, '' affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.
More general ...
s of both the predictor and response variables.[.]
Algorithms
The median slope of a set of sample points may be computed exactly by computing all lines through pairs of points, and then applying a linear time median finding algorithm. Alternatively, it may be estimated by sampling pairs of points. This problem is equivalent, under projective duality
In projective geometry, duality or plane duality is a formalization of the striking symmetry of the roles played by points and lines in the definitions and theorems of projective planes. There are two approaches to the subject of duality, one th ...
, to the problem of finding the crossing point in an arrangement of lines that has the median -coordinate among all such crossing points.
The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in computational geometry. Several different methods are known for computing the Theil–Sen estimator exactly in time, either deterministically[; ; .] or using randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
s.[; ; .] Siegel's repeated median estimator can also be constructed in the same time bound. In models of computation in which the input coordinates are integers and in which bitwise operation
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operatio ...
s on integers take constant time, the Theil–Sen estimator can be constructed even more quickly, in randomized expected time .
An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets.
Implementations
In the R statistics package, both the Theil–Sen estimator and Siegel's repeated median estimator are available through the mblm
library.
A free standalone Visual Basic Visual Basic is a name for a family of programming languages from Microsoft. It may refer to:
* Visual Basic (.NET), the current version of Visual Basic launched in 2002 which runs on .NET
* Visual Basic (classic), the original Visual Basic suppo ...
application for Theil–Sen estimation, KTRLine
, has been made available by the US Geological Survey
The United States Geological Survey (USGS), founded as the Geological Survey, is an agency of the U.S. Department of the Interior whose work spans the disciplines of biology, geography, geology, and hydrology. The agency was founded on March ...
.
The Theil–Sen estimator has also been implemented in Python as part of the SciPy
SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing.
SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier ...
and scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support ...
libraries.[; ]
Applications
Theil–Sen estimation has been applied to astronomy
Astronomy is a natural science that studies celestial objects and the phenomena that occur in the cosmos. It uses mathematics, physics, and chemistry in order to explain their origin and their overall evolution. Objects of interest includ ...
due to its ability to handle censored regression model Censored regression models are a class of models in which the dependent variable is censored above or below a certain threshold. A commonly used likelihood-based model to accommodate to a censored sample is the Tobit model, but quantile
In sta ...
s. In biophysics
Biophysics is an interdisciplinary science that applies approaches and methods traditionally used in physics to study biological phenomena. Biophysics covers all scales of biological organization, from molecular to organismic and populations ...
, suggest its use for remote sensing
Remote sensing is the acquisition of information about an physical object, object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring inform ...
applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors". For measuring seasonal environmental data such as water quality
Water quality refers to the chemical, physical, and biological characteristics of water based on the standards of its usage. It is most frequently used by reference to a set of standards against which compliance, generally achieved through tr ...
, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data. In computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, the Theil–Sen method has been used to estimate trends in software aging
In software engineering, software aging is the tendency for software to Software failure, fail or cause a system failure after running continuously for a certain time, or because of ongoing changes in systems surrounding the software. Software a ...
. In meteorology
Meteorology is the scientific study of the Earth's atmosphere and short-term atmospheric phenomena (i.e. weather), with a focus on weather forecasting. It has applications in the military, aviation, energy production, transport, agricultur ...
and climatology
Climatology (from Greek , ''klima'', "slope"; and , '' -logia'') or climate science is the scientific study of Earth's climate, typically defined as weather conditions averaged over a period of at least 30 years. Climate concerns the atmospher ...
, it has been used to estimate the long-term trends of wind occurrence and speed.
See also
*
* Regression dilution
Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.
Consider fitting a straight line ...
, for another problem affecting estimated trend slopes
Notes
References
* .
* .
* .
* .
* .
* .
* .
* .
* .
* .
* .
* .
* .
* .
*
* .
* .
* .
*
* .
* .
* .
* .
*
* .
* .
* .
*
* .
* .
* .
* .
* .
* .
*
* .
* .
* .
* .
{{DEFAULTSORT:Theil-Sen estimator
Robust regression
Computational geometry