Theil–Sen estimator
   HOME

TheInfoList



OR:

In
non-parametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distr ...
, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane (
simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...
) by choosing the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
of the
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
s of all lines through pairs of points. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after
Henri Theil Henri (Hans) Theil (October 13, 1924 – August 20, 2000) was a Dutch econometrician and professor at the Netherlands School of Economics in Rotterdam, known for his contributions to the field of econometrics. Biography Born in Amsterdam, The ...
and
Pranab K. Sen Pranab Kumar Sen (born 7 November 1937 in Calcutta, India)Curriculum vitae
, retriev ...
, who published papers on this method in 1950 and 1968 respectively,; and after Maurice Kendall because of its relation to the
Kendall tau rank correlation coefficient In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a n ...
. This estimator can be computed efficiently, and is insensitive to
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It can be significantly more accurate than non-robust simple linear regression (least squares) for
skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimoda ...
and
heteroskedastic In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The ...
data, and competes well against least squares even for normally distributed data in terms of
statistical power In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H_0) when a specific alternative hypothesis (H_1) is true. It is commonly denoted by 1-\beta, and represents the chances ...
. It has been called "the most popular nonparametric technique for estimating a linear trend"..


Definition

As defined by , the Theil–Sen estimator of a set of two-dimensional points is the median of the slopes determined by all pairs of sample points. extended this definition to handle the case in which two data points have the same coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct coordinates. Once the slope has been determined, one may determine a line from the sample points by setting the -intercept to be the median of the values . The fit line is then the line with coefficients and in slope–intercept form. As Sen observed, this choice of slope makes the
Kendall tau rank correlation coefficient In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a n ...
become approximately zero, when it is used to compare the values with their associated residuals . Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.; . A
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval..


Variations

A variation of the Theil–Sen estimator, the
repeated median regression In robust statistics, repeated median regression, also known as the repeated median estimator, is a robust linear regression algorithm. The estimator has a breakdown point of 50%. Although it is equivariant under scaling, or under linear transforma ...
of , determines for each sample point , the median of the slopes of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical. A different variant pairs up sample points by the rank of their -coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the Theil–Sen estimator. Variations of the Theil–Sen estimator based on
weighted median In statistics, a weighted median of a sample is the 50% weighted percentile. It was first proposed by F. Y. Edgeworth in 1888. Like the median, it is useful as an estimator of central tendency, robust against outliers. It allows for non-unifor ...
s have also been studied, based on the principle that pairs of samples whose -coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight. For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs..


Statistical properties

The Theil–Sen estimator is an
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
of the true slope in
simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...
. For many distributions of the response error, this estimator has high
asymptotic efficiency In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achi ...
relative to
least-squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
estimation. Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators. The Theil–Sen estimator is more
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
than the least-squares estimator because it is much less sensitive to
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It has a
breakdown point Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such ...
of :1-\frac\approx 29.3\%, meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy., pp. 67, 164. However, the breakdown point decreases for higher-dimensional generalizations of the method. A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the repeated median estimator of Siegel. The Theil–Sen estimator is
equivariant In mathematics, equivariance is a form of symmetry for functions from one space with symmetry to another (such as symmetric spaces). A function is said to be an equivariant map when its domain and codomain are acted on by the same symmetry grou ...
under every
linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
of its response variable, meaning that transforming the data first and then fitting a line, or fitting a line first and then transforming it in the same way, both produce the same result. However, it is not equivariant under
affine transformations In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generally, ...
of both the predictor and response variables..


Algorithms and implementation

The median slope of a set of sample points may be computed exactly by computing all lines through pairs of points, and then applying a linear time median finding algorithm. Alternatively, it may be estimated by sampling pairs of points. This problem is equivalent, under
projective duality In geometry, a striking feature of projective planes is the symmetry of the roles played by points and lines in the definitions and theorems, and (plane) duality is the formalization of this concept. There are two approaches to the subject of du ...
, to the problem of finding the crossing point in an
arrangement of lines In music, an arrangement is a musical adaptation of an existing composition. Differences from the original composition may include reharmonization, melodic paraphrasing, orchestration, or formal development. Arranging differs from orchestr ...
that has the median -coordinate among all such crossing points. The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in
computational geometry Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems ar ...
. Several different methods are known for computing the Theil–Sen estimator exactly in time, either deterministically; ; . or using
randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
s.; ; . Siegel's repeated median estimator can also be constructed in the same time bound. In models of computation in which the input coordinates are integers and in which
bitwise operation In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operati ...
s on integers take constant time, the Theil–Sen estimator can be constructed even more quickly, in randomized expected time O(n\sqrt). An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets. In the R statistics package, both the Theil–Sen estimator and Siegel's repeated median estimator are available through the mblm library. A free standalone
Visual Basic Visual Basic is a name for a family of programming languages from Microsoft. It may refer to: * Visual Basic .NET (now simply referred to as "Visual Basic"), the current version of Visual Basic launched in 2002 which runs on .NET * Visual Basic (cl ...
application for Theil–Sen estimation, KTRLine, has been made available by the
US Geological Survey The United States Geological Survey (USGS), formerly simply known as the Geological Survey, is a scientific agency of the United States government. The scientists of the USGS study the landscape of the United States, its natural resources, an ...
. The Theil–Sen estimator has also been implemented in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
as part of the
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal ...
and
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
libraries.;


Applications

Theil–Sen estimation has been applied to
astronomy Astronomy () is a natural science that studies astronomical object, celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and chronology of the Universe, evolution. Objects of interest ...
due to its ability to handle
censored regression model Censored regression models are a class of models in which the dependent variable is censored above or below a certain threshold. A commonly used likelihood-based model to accommodate to a censored sample is the Tobit model, but quantile and nonp ...
s. In
biophysics Biophysics is an interdisciplinary science that applies approaches and methods traditionally used in physics to study biological phenomena. Biophysics covers all scales of biological organization, from molecular to organismic and populations. ...
, suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors". For measuring seasonal environmental data such as
water quality Water quality refers to the chemical, physical, and biological characteristics of water based on the standards of its usage. It is most frequently used by reference to a set of standards against which compliance, generally achieved through tr ...
, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data. In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the Theil–Sen method has been used to estimate trends in
software aging In software engineering, software aging is the tendency for software to fail or cause a system failure after running continuously for a certain time, or because of ongoing changes in systems surrounding the software. Software aging has several c ...
. In
meteorology Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did not ...
and
climatology Climatology (from Greek , ''klima'', "place, zone"; and , '' -logia'') or climate science is the scientific study of Earth's climate, typically defined as weather conditions averaged over a period of at least 30 years. This modern field of stud ...
, it has been used to estimate the long-term trends of wind occurrence and speed.


See also

*
Regression dilution Regression dilution, also known as regression attenuation, is the Bias (statistics), biasing of the linear regression regression slope, slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable. ...
, for another problem affecting estimated trend slopes


Notes


References

*. *. *. *. *. *. *. *. *. *. *. *. *. *. *. *. *. * *. *. *. *. * *. *. *. * *. *. *. *. *. *. * *. *. *. *. {{DEFAULTSORT:Theil-Sen estimator Robust regression Computational geometry