Segmented regression, also known as piecewise regression or broken-stick regression, is a method in
regression analysis in which the
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are ''breakpoints''.
Segmented linear regression is segmented regression whereby the relations in the intervals are obtained by
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
.
Segmented linear regression, two segments

Segmented linear regression with two segments separated by a ''breakpoint'' can be useful to quantify an abrupt change of the response function (Yr) of a varying influential factor (x). The breakpoint can be interpreted as a ''critical'', ''safe'', or ''threshold'' value beyond or below which (un)desired effects occur. The breakpoint can be important in decision making
The figures illustrate some of the results and regression types obtainable.
A segmented regression analysis is based on the presence of a set of ( y, x ) data, in which y is the
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
and x the
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
.
The
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
method applied separately to each segment, by which the two regression lines are made to fit the data set as closely as possible while minimizing the ''sum of squares of the differences'' (SSD) between observed (y) and calculated (Yr) values of the dependent variable, results in the following two equations:
* Yr = A
1.x + K
1 for x < BP (breakpoint)
* Yr = A
2.x + K
2 for x > BP (breakpoint)
where:
:Yr is the expected (predicted) value of y for a certain value of x;
:A
1 and A
2 are
regression coefficient
In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...
s (indicating the slope of the line segments);
:K
1 and K
2 are ''regression constants'' (indicating the intercept at the y-axis).
The data may show many types or trends, see the figures.
The method also yields two
correlation coefficients (R):
*
for x < BP (breakpoint)
and
*
for x > BP (breakpoint)
where:
:
is the minimized SSD per segment
and
:
Ya1 and
Ya2 are the average values of y in the respective segments.
In the determination of the most suitable trend,
statistical tests
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
must be performed to ensure that this trend is reliable (significant).
When no significant breakpoint can be detected, one must fall back on a regression without breakpoint.
Example

For the blue figure at the right that gives the relation between yield of mustard (Yr = Ym, t/ha) and
soil salinity
Soil salinity is the salt (chemistry), salt content in the soil; the process of increasing the salt content is known as salinization (also called salination in American and British English spelling differences, American English). Salts occur nat ...
(x = Ss, expressed as electric conductivity of the soil solution EC in dS/m) it is found that:
BP = 4.93, A
1 = 0, K
1 = 1.74, A
2 = −0.129, K
2 = 2.38, R
12 = 0.0035 (insignificant), R
22 = 0.395 (significant) and:
* Ym = 1.74 t/ha for Ss < 4.93 (breakpoint)
* Ym = −0.129 Ss + 2.38 t/ha for Ss > 4.93 (breakpoint)
indicating that soil salinities < 4.93 dS/m are safe and soil salinities > 4.93 dS/m reduce the yield @ 0.129 t/ha per unit increase of soil salinity.
The figure also shows confidence intervals and uncertainty as elaborated hereunder.
Test procedures

The following ''statistical tests'' are used to determine the type of trend:
# significance of the breakpoint (BP) by expressing BP as a function of ''regression coefficients'' A
1 and A
2 and the means Y
1 and Y
2 of the y-data and the means X
1 and X
2 of the x data (left and right of BP), using the laws of
propagation of errors in additions and multiplications to compute the
standard error
The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...
(SE) of BP, and applying
Student's t-test
Student's ''t''-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's ''t''- ...
# significance of A
1 and A
2 applying Student's t-distribution and the ''standard error'' SE of A
1 and A
2
# significance of the difference of A
1 and A
2 applying Student's t-distribution using the SE of their difference.
# significance of the difference of Y
1 and Y
2 applying Student's t-distribution using the SE of their difference.
#A more formal statistical approach to test for the existence of a breakpoint, is via the pseudo score test which does not require estimation of the segmented line.
In addition, use is made of the
correlation coefficient
A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two c ...
of all data (Ra), the
coefficient of determination
In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used in t ...
or coefficient of explanation,
confidence intervals of the regression functions, and
ANOVA
Analysis of variance (ANOVA) is a family of statistical methods used to compare the means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variation ''w ...
analysis.
The coefficient of determination for all data (Cd), that is to be maximized under the conditions set by the significance tests, is found from:
*
where Yr is the expected (predicted) value of y according to the former regression equations and Ya is the average of all y values.
The Cd coefficient ranges between 0 (no explanation at all) to 1 (full explanation, perfect match).
In a pure, unsegmented, linear regression, the values of Cd and Ra
2 are equal. In a segmented regression, Cd needs to be significantly larger than Ra
2 to justify the segmentation.
The
optimal value of the breakpoint may be found such that the Cd coefficient is
maximum
In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...
.
No-effect range

Segmented regression is often used to detect over which range an explanatory variable (X) has no effect on the dependent variable (Y), while beyond the reach there is a clear response, be it positive or negative.
The reach of no effect may be found at the initial part of X domain or conversely at its last part. For the "no effect" analysis, application of the
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
method for the segmented regression analysis may not be the most appropriate technique because the aim is rather to find the longest stretch over which the Y-X relation can be considered to possess zero slope while beyond the reach the slope is significantly different from zero but knowledge about the best value of this slope is not material. The method to find the no-effect range is progressive partial regression
[Partial Regression Analysis, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. Free download from the webpag]
/ref> over the range, extending the range with small steps until the regression coefficient gets significantly different from zero.
In the next figure the break point is found at X=7.9 while for the same data (see blue figure above for mustard yield), the least squares method yields a break point only at X=4.9. The latter value is lower, but the fit of the data beyond the break point is better. Hence, it will depend on the purpose of the analysis which method needs to be employed.
Software implementations
* In Python (programming language), Python - there is th
piecewise-regression
package.
* In R (programming language)
R is a programming language for statistical computing and Data and information visualization, data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science.
The core R language is ...
- there is th
segmented
package.
See also
* Chow test
The Chow test (), proposed by econometrician Gregory Chow in 1960, is a statistical test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis ...
* Simple regression
* Linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
* Ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship ...
* Multivariate adaptive regression splines
In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatic ...
* Local regression
Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression.
Its most common methods, initially developed for scatterplot smoothing, are LOESS (locally ...
* Regression discontinuity design
* Stepwise regression
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
* SegReg (software) for segmented regression
References
{{DEFAULTSORT:Segmented Regression
Regression models