The Kaplan–Meier estimator, also known as the product limit estimator, is a
non-parametric
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
used to estimate the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by
frugivore
A frugivore is an animal that thrives mostly on raw fruits or succulent fruit-like produce of plants such as roots, shoots, nuts and seeds. Approximately 20% of mammalian herbivores eat fruit. Frugivores are highly dependent on the abundance an ...
s. The
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
is named after
Edward L. Kaplan and
Paul Meier, who each submitted similar manuscripts to the ''
Journal of the American Statistical Association
The ''Journal of the American Statistical Association (JASA)'' is the primary journal published by the American Statistical Association, the main professional body for statisticians in the United States. It is published four times a year in March, ...
''. The journal editor,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
, convinced them to combine their work into one paper, which has been cited almost 61,000 times since its publication in 1958.
The
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
(the probability that life is longer than
) is given by:
:
with
a time when at least one event happened, ''d''
''i'' the ''number of events'' (e.g., deaths) that happened at time
, and
the ''individuals known to have survived'' (have not yet had an event or been censored) up to time
.
Basic concepts
A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population. The value of the survival function between successive distinct sampled observations ("clicks") is assumed to be constant.
An important advantage of the Kaplan–Meier curve is that the method can take into account some types of
censored data, particularly ''right-censoring'', which occurs if a patient withdraws from a study, is lost to follow-up, or is alive without event occurrence at last follow-up. On the plot, small vertical tick-marks state individual patients whose survival times have been right-censored. When no truncation or censoring occurs, the Kaplan–Meier curve is the
complement
A complement is something that completes something else.
Complement may refer specifically to:
The arts
* Complement (music), an interval that, when added to another, spans an octave
** Aggregate complementation, the separation of pitch-clas ...
of the
empirical distribution function
In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function ...
.
In
medical statistics
Medical statistics deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branch of statistics in the U ...
, a typical application might involve grouping patients into categories, for instance, those with Gene A profile and those with Gene B profile. In the graph, patients with Gene B die much quicker than those with Gene A. After two years, about 80% of the Gene A patients survive, but less than half of patients with Gene B.
To generate a Kaplan–Meier estimator, at least two pieces of data are required for each patient (or each subject): the status at last observation (event occurrence or right-censored), and the time to event (or time to censoring). If the survival functions between two or more groups are to be compared, then a third piece of data is required: the group assignment of each subject.
Problem definition
Let
be a random variable, which we think of as the time until an event of interest takes place. As indicated above, the goal is to estimate the
survival function
The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
The survival function is also known as the survivor function
or reliability function.
The term ...
underlying
. Recall that this function is defined as
:
, where
is the time.
Let
be independent, identically distributed random variables, whose common distribution is that of
:
is the random time when some event
happened. The data available for estimating
is not
, but the list of pairs
where for
,
is a fixed, deterministic integer, the censoring time of event
and
. In particular, the information available about the timing of event
is whether the event happened before the fixed time
and if so, then the actual time of the event is also available. The challenge is to estimate
given this data.
Derivation of the Kaplan–Meier estimator
Here, we show two derivations of the Kaplan–Meier estimator. Both are based on rewriting the survival function in terms of what is sometimes called hazard, or mortality rates. However, before doing this it is worthwhile to consider a naive estimator.
A naive estimator
To understand the power of the Kaplan–Meier estimator, it is worthwhile to first describe a naive estimator of the survival function.
Fix
and let
. A basic argument shows that the following proposition holds:
:Proposition 1: If the censoring time
of event
exceeds
(
), then
if and only if
.
Let
be such that
. It follows from the above proposition that
:
Let
and consider only those
, i.e. the events for which the outcome was not censored before time
. Let
be the number of elements in
. Note that the set
is not random and so neither is
. Furthermore,
is a sequence of independent, identically distributed
Bernoulli random variable
In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabili ...
s with common parameter
. Assuming that
, this suggests to estimate
using
:
where the second equality follows because
implies
, while the last equality is simply a change of notation.
The quality of this estimate is governed by the size of
. This can be problematic when
is small, which happens, by definition, when a lot of the events are censored. A particularly unpleasant property of this estimator, that suggests that perhaps it is not the "best" estimator, is that it ignores all the observations whose censoring time precedes
. Intuitively, these observations still contain information about
: For example, when for many events with
,