Proportional hazards models are a class of survival models in

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

. Survival models relate the time that passes, before some event occurs, to one or more

covariate Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

s that may be

associated Associated may refer to: *Associated, former name of Avon, Contra Costa County, California * Associated Hebrew Schools of Toronto, a school in Canada *Associated Newspapers, former name of DMG Media, a British publishing company See also *Associati ...

with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the

hazard rate Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysi ...

. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as

accelerated failure time model In the statistics, statistical area of survival analysis, an accelerated failure time model (AFT model) is a parametric statistics, parametric model that provides an alternative to the commonly used proportional hazards models. Whereas a proportio ...

s do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated (or decelerated).

Background

Survival models can be viewed as consisting of two parts: the underlying baseline

hazard function Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering. The failure rate of a ...

, often denoted

\lambda_0(t)

, describing how the risk of event per time unit changes over time at ''baseline'' levels of covariates; and the effect parameters, describing how the hazard varies in response to explanatory covariates. A typical medical example would include covariates such as treatment assignment, as well as patient characteristics such as age at start of study, gender, and the presence of other diseases at start of study, in order to reduce variability and/or control for confounding. The ''proportional hazards condition'' states that covariates are multiplicatively related to the hazard. In the simplest case of stationary coefficients, for example, a treatment with a drug may, say, halve a subject's hazard at any given time

t

, while the baseline hazard may vary. Note however, that this does not double the lifetime of the subject; the precise effect of the covariates on the lifetime depends on the type of

\lambda_0(t)

. The

is not restricted to binary predictors; in the case of a continuous covariate

x

, it is typically assumed that the hazard responds exponentially; each unit increase in

x

results in proportional scaling of the hazard.

The Cox model

Introduction

Sir David Cox observed that if the proportional hazards assumption holds (or, is assumed to hold) then it is possible to estimate the effect parameter(s), denoted

\beta_i

below, without any consideration of the full hazard function. This approach to survival data is called application of the ''Cox proportional hazards model'', sometimes abbreviated to ''Cox model'' or to ''proportional hazards model''. However, Cox also noted that biological interpretation of the proportional hazards assumption can be quite tricky. Let be the realized values of the covariates for subject ''i''. The hazard function for the Cox proportional hazards model has the form ::

\begin
\lambda(t, X_i) &= \lambda_0(t)\exp(\beta_1X_ + \cdots + \beta_pX_) \\
               &= \lambda_0(t)\exp(X_i \cdot \beta)
\end

This expression gives the hazard function at time ''t'' for subject ''i'' with covariate vector (explanatory variables) ''X''_''i''. Note that between subjects, the baseline hazard

\lambda_0(t)

is identical (has no dependency on ''i''). The only difference between subjects' hazards comes from the baseline scaling factor

\exp(X_i \cdot \beta)

Why it's called "proportional"

To start, suppose we only have a single covariate,

x

, and therefore a single coefficient,

\beta_1

. Consider the effect of increasing

x

by 1: ::

\begin
\lambda(t, x+1) &= \lambda_0(t)\exp(\beta_1(x+1)) \\
&= \lambda_0(t)\exp(\beta_1x+\beta_1)\\
&= \Bigl( \lambda_0(t)\exp(\beta_1x) \Bigr) \exp(\beta_1) \\
&= \lambda(t, x) \exp(\beta_1)
\end

We can see that increasing a covariate by 1 scales the original hazard by the constant

\exp(\beta_1)

. Rearranging things slightly, we see that: ::

\frac = \exp(\beta_1)

The right-hand-side is constant over time (no term has a

t

in it). This relationship,

x/y = \text

, is called a proportional relationship. More generally, consider two subjects, i and j, with covariates

X_i

and

X_j

respectively. Consider the ratio of their hazards: ::

\begin
\frac&=\frac\\
&=\frac\\
&=\exp((X_i - X_j) \cdot \beta)
\end

The right-hand-side isn't dependent on time, as the only time-dependent factor,

\lambda_0(t)

, was cancelled out.

Absence of an intercept term

Often there is an intercept term (also called a constant term or bias term) used in regression models. The Cox model lacks one because the baseline hazard,

\lambda_0(t)

, takes the place of it. Let's see what would happen if we did include an intercept term anyways, denoted

\beta_0

: ::

\begin
\lambda(t, X_i) &= \lambda_0(t)\exp(\beta_1X_ + \cdots + \beta_pX_ + \beta_0)\\
               &= \lambda_0(t)\exp(X_i \cdot \beta)\exp(\beta_0) \\ 
               &= \left ( \exp(\beta_0)\lambda_0(t)\right ) \exp(X_i \cdot \beta) \\
               &= \lambda^*_0(t)\exp(X_i \cdot \beta)
\end

where we've redefined

\exp(\beta_0)\lambda_0(t)

to be a new baseline hazard,

\lambda^*_0(t)

. Thus, the baseline hazard incorporates all parts of the hazard that are not dependent on the subjects' covariates, which includes any intercept term (which is constant for all subjects, by definition).

Likelihood for unique times

The Cox partial likelihood, shown below, is obtained by using Breslow's estimate of the baseline hazard function, plugging it into the full likelihood and then observing that the result is a product of two factors. The first factor is the partial likelihood shown below, in which the baseline hazard has "canceled out". The second factor is free of the regression coefficients and depends on the data only through the censoring pattern. The effect of covariates estimated by any proportional hazards model can thus be reported as

hazard ratio In survival analysis, the hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions characterised by two distinct levels of a treatment variable of interest. For example, in a clinical study of a drug, the treated populati ...

s. The likelihood of the event to be observed occurring for subject ''i'' at time ''Y''_''i'' can be written as: ::

L_i(\beta)  
 =\frac 
 =\frac
 =\frac,

where ) and the summation is over the set of subjects ''j'' where the event has not occurred before time ''Y''_''i'' (including subject ''i'' itself). Obviously 0 < ''L''_''i''(β) ≤ 1. This is a partial likelihood: the effect of the covariates can be estimated without the need to model the change of the hazard over time. Treating the subjects as if they were statistically independent of each other, the joint probability of all realized events is the following partial likelihood, where the occurrence of the event is indicated by ''C''_''i'' = 1: ::

L(\beta) = \prod_ L_i(\beta) .

The corresponding log partial likelihood is ::

\ell(\beta) = \sum_ \left(X_i \cdot \beta - \log \sum_\theta_j\right).

This function can be maximized over ''β'' to produce maximum partial likelihood estimates of the model parameters. The partial score function is ::

\ell^\prime(\beta) = \sum_ \left(X_i - \frac\right),

and the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

of the partial log likelihood is ::

\ell^(\beta) = -\sum_ \left(\frac - \frac\right).

Using this score function and Hessian matrix, the partial likelihood can be maximized using the

Newton-Raphson In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...

algorithm. The inverse of the Hessian matrix, evaluated at the estimate of ''β'', can be used as an approximate variance-covariance matrix for the estimate, and used to produce approximate

standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error ...

s for the regression coefficients.

Likelihood when there exist tied times

Several approaches have been proposed to handle situations in which there are ties in the time data. ''Breslow's method'' describes the approach in which the procedure described above is used unmodified, even when ties are present. An alternative approach that is considered to give better results is ''Efron's method''. Let ''t''_''j'' denote the unique times, let ''H''_''j'' denote the set of indices ''i'' such that ''Y''_''i'' = ''t''_''j'' and ''C''_''i'' = 1, and let ''m''_''j'' = , ''H''_''j'', . Efron's approach maximizes the following partial likelihood. ::

L(\beta) = \prod_j \frac.

The corresponding log partial likelihood is ::

\ell(\beta) = \sum_j \left(\sum_ X_i \cdot \beta -\sum_^\log\left(\sum_\theta_i - \frac \sum_\theta_i\right)\right),

the score function is ::

\ell^\prime(\beta) = \sum_j \left(\sum_ X_i -\sum_^\frac\right),

and the Hessian matrix is ::

\ell^(\beta) = -\sum_j \sum_^ \left(\frac - \frac\right),

where ::

\phi_ = \sum_\theta_i - \frac\sum_\theta_i

Z_ = \sum_\theta_iX_i - \frac\sum_\theta_iX_i.

Note that when ''H''_''j'' is empty (all observations with time ''t''_''j'' are censored), the summands in these expressions are treated as zero.

Examples

Below are some worked examples of the Cox model in practice.

A single binary covariate

Suppose the endpoint we are interested is patient survival during a 5-year observation period after a surgery. Patients can die within the 5 year period, and we record when they died, or patients can live past 5 years, and we only record that they lived past 5 years. The surgery was performed at one of two hospitals, A or B, and we'd like to know if the hospital location is associated with 5-year survival. Specifically, we'd like to know the relative increase (or decrease) in hazard from a surgery performed at hospital A compared to hospital B. Provided is some (fake) data, where each row represents a patient: T is how long the patient was observed for before death or 5 years (measured in months), and C denotes if the patient died in the 5-year period. We've encoded the hospital as a binary variable denoted X: 1 if from hospital A, 0 from hospital B. Our single-covariate Cox proportional model looks like the following, with

\beta_1

representing the hospital's effect, and i indexing each patient: ::

\overbrace^ = \underbrace_\cdot\overbrace^

Using statistical software, we can estimate

\beta_1

to be 2.12. The hazard ratio is the exponential of this value,

\exp(\beta_1) = \exp(2.12)

. To see why, consider the ratio of hazards, specifically: ::

\frac = \frac = \exp(\beta_1)

Thus, the hazard ratio of hospital A to hospital B is

\exp(2.12) = 8.32

. Putting aside statistical significance for a moment, we can make a statement saying that patients in hospital A are associated with a 8.3x higher risk of death occurring in any short period of time compared to hospital B. There are important caveats to mention about the interpretation: # a 8.3x higher risk of death does not mean that 8.3x more patients will die in hospital B: survival analysis examines how quickly events occur, not simply whether they occur. # More specifically, "risk of death" is a measure of a rate. A rate has units, like meters per second. However, a relative rate doesn't: a bicycle can go 2 times faster than another bicycle (the reference bicycle), without specifying any units. Likewise, the risk of death (rate of death) in hospital A is 8.3 times higher (faster) than the risk of death in hospital B (the reference group). # the inverse quantity,

1/8.32 = \frac = \exp(-2.12) = 0.12

is the hazard ratio of hospital B relative to hospital A. # We haven't made any inferences about probabilities of survival between the hospitals. This is because we would need an estimate of the baseline hazard rate,

\lambda_0(t)

, as well as our

\beta_1

estimate. However, standard estimation of the Cox proportional hazard model does not directly estimate the baseline hazard rate. # Because we have ignored the only time varying component of the model, the baseline hazard rate, our estimate is timescale-invariant. For example, if we had measured time in years instead of months, we would get the same estimate. # It's tempting to say that the hospital caused the difference in hazards between the two groups, but since our study is not causal (that is, we don't know how the data was generated), we stick with terminology like "associated".

A single continuous covariate

To demonstrate a less traditional use case of survival analysis, the next example will be an economics question: what is the relationship between a companies' price-to-earnings ratio (P/E) on their 1-year IPO anniversary and their future survival? More specifically, if we consider a company's "birth event" to be their 1-year IPO anniversary, and any bankruptcy, sale, going private, etc. as a "death" event the company, we'd like to know the influence of the companies' P/E ratio at their "birth" (1-year IPO anniversary) on their survival. Provided is a (fake) dataset with survival data from 12 companies: T represents the number of days between 1-year IPO anniversary and death (or an end date of 2022-01-01, if did not die). C represents if the company died before 2022-01-01 or not. P/E represents the companies price-to-earnings ratio at their 1-year IPO anniversary. Unlike the previous example where there was a binary variable, this dataset has a continuous variable, P/E. However, the model looks similar: ::

\lambda(t, P_) = \lambda_0(t)\cdot\exp(\beta_1 P_)

where

P_i

represents a company's P/E ratio. Running this dataset through a Cox model produces an estimate of the value of the unknown

\beta_1

, which is -0.34. Therefore an estimate of the entire hazard is: ::

\lambda(t, P_) = \lambda_0(t)\cdot\exp(-0.34 P_)

Since the baseline hazard,

\lambda_0(t)

, was not estimated, the entire hazard is not able to be calculated. However, consider the ratio of the companies i and j's hazards: ::

\begin
\frac 
 &= \frac \\ 
 &= \exp(-0.34 (P_ - P_))
\end

All terms on the right are known, so calculating the ratio of hazards between companies is possible. Since there is no time-dependent term on the right (all terms are constant), the hazards are proportional to each other. For example, the hazard ratio of company 5 to company 2 is

\exp(-0.34 (6.3 - 3.0)) = 0.33

. This means that, within the interval of study, company 5's risk of "death" is 0.33 ≈ 1/3 as large as company 2's risk of death. There are important caveats to mention about the interpretation: # The hazard ratio is the quantity

\exp(\beta_1)

, which is

\exp(-0.34) = 0.71

in the above example. From the last calculation above, an interpretation of this is as the ratio of hazards between two "subjects" that have their variables differ by one unit: if

P_ = P_ + 1

, then

\exp(-\beta_1 (P_ - P_) = \exp(-\beta_1 (1))

. The choice of "differ by one unit" is convenience, as it communicates precisely the value of

\beta_1

. # The baseline hazard can be represented when the scaling factor is 1, i.e.

P=0

$\lambda(t, P_=0) = \lambda_0(t)\cdot\exp(-0.34 \cdot 0) = \lambda_0(t)$

Can we interpret the baseline hazard as the hazard of a "baseline" company who's P/E happens to be 0? This interpretation of the baseline hazard as "hazard of a baseline subject" is imperfect, as it is possible that the covariate being 0 is impossible. In this application, a P/E of 0 is meaningless (it means the company's stock price is 0, i.e., they are "dead"). A more appropriate interpretation would be "the hazard when all variables are nil". # It's tempting to want to understand and interpret a value like

\exp(-\beta_1 P_)

to represent the hazard of a company. However, consider what this is actually representing:

\exp(-\beta_1 P_) = \exp(-\beta_1 (P_-0))= \frac = \frac

. There is implicitly a ratio of hazards here, comparing company i's hazard to an imaginary baseline company with 0 P/E. However, as explained above, a P/E of 0 is impossible in this application, so

\exp(-\beta_1 P_)

is meaningless in this example. Ratios between plausible hazards are meaningful, however.

Time-varying predictors and coefficients

Extensions to time dependent variables, time dependent strata, and multiple events per subject, can be incorporated by the counting process formulation of Andersen and Gill. One example of the use of hazard models with time-varying regressors is estimating the effect of unemployment insurance on unemployment spells. In addition to allowing

time-varying covariate A time-varying covariate (also called time-dependent covariate) is a term used in statistics, particularly in survival analysis. It reflects the phenomenon that a covariate is not necessarily constant through the whole study Time-varying covariates ...

s (i.e., predictors), the Cox model may be generalized to time-varying coefficients as well. That is, the proportional effect of a treatment may vary with time; e.g. a drug may be very effective if administered within one month of

morbidity A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. Diseases are often known to be medical conditions that a ...

, and become less effective as time goes on. The hypothesis of no change with time (stationarity) of the coefficient may then be tested. Details and software ( R package) are available in Martinussen and Scheike (2006). In this context, it could also be mentioned that it is theoretically possible to specify the effect of covariates by using additive hazards, i.e. specifying ::

\lambda(t, X_i) = \lambda_0(t) + \beta_1X_ + \cdots + \beta_pX_ = \lambda_0(t) + X_i \cdot \beta.

If such additive hazards models are used in situations where (log-)likelihood maximization is the objective, care must be taken to restrict

\lambda(t\mid X_i)

to non-negative values. Perhaps as a result of this complication, such models are seldom seen. If the objective is instead

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

the non-negativity restriction is not strictly required.

Specifying the baseline hazard function

The Cox model may be specialized if a reason exists to assume that the baseline hazard follows a particular form. In this case, the baseline hazard

\lambda_0(t)

is replaced by a given function. For example, assuming the hazard function to be the ''Weibull'' hazard function gives the ''Weibull proportional hazards model''. Incidentally, using the Weibull baseline hazard is the only circumstance under which the model satisfies both the proportional hazards, and accelerated failure time models. The generic term ''parametric proportional hazards models'' can be used to describe proportional hazards models in which the hazard function is specified. The Cox proportional hazards model is sometimes called a ''

semiparametric model In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components. A statistical model is a parameterized family of distributions: \ indexed by a parameter \theta. * A parametric model is a model i ...

'' by contrast. Some authors use the term ''Cox proportional hazards model'' even when specifying the underlying hazard function, to acknowledge the debt of the entire field to David Cox. The term ''Cox regression model'' (omitting ''proportional hazards'') is sometimes used to describe the extension of the Cox model to include time-dependent factors. However, this usage is potentially ambiguous since the Cox proportional hazards model can itself be described as a regression model.

Relationship to Poisson models

There is a relationship between proportional hazards models and

Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ...

models which is sometimes used to fit approximate proportional hazards models in software for Poisson regression. The usual reason for doing this is that calculation is much quicker. This was more important in the days of slower computers but can still be useful for particularly large data sets or complex problems. Laird and Olivier (1981) provide the mathematical details. They note, "we do not assume he Poisson modelis true, but simply use it as a device for deriving the likelihood." McCullagh and Nelder's book on generalized linear models has a chapter on converting proportional hazards models to

generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...

Under high-dimensional setup

In high-dimension, when number of covariates p is large compared to the sample size n, the LASSO method is one of the classical model-selection strategies. Tibshirani (1997) has proposed a Lasso procedure for the proportional hazard regression parameter. The Lasso estimator of the regression parameter β is defined as the minimizer of the opposite of the Cox partial log-likelihood under an L¹-norm type constraint. ::

\ell(\beta) = \sum_j \left(\sum_ X_i \cdot \beta -\sum_^\log\left(\sum_\theta_i - \frac\sum_\theta_i\right)\right)
+ \lambda \, \beta\, _1
,

There has been theoretical progress on this topic recently.

Software implementations

* Mathematica: CoxModelFit function. * R: coxph() function, located in the survival package. * SAS: phreg procedure * Stata: stcox command * Python: CoxPHFitter located in the lifelines library. * SPSS: Available under Cox Regression. * Matlab: coxphfit function * Julia: Available in the Survival.jl library. * JMP: Available in Fit Proportional Hazards platform.

Notes

References

* * * * * * {{DEFAULTSORT:Proportional Hazards Models Survival analysis Semi-parametric models Poisson point processes