HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the one in ten rule is a
rule of thumb In English, the phrase ''rule of thumb'' refers to an approximate method for doing something, based on practical experience rather than theory. This usage of the phrase can be traced back to the 17th century and has been associated with various t ...
for how many predictor parameters can be estimated from data when doing
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
(in particular
proportional hazards models Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional haza ...
in
survival analysis Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysi ...
and
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
) while keeping the risk of
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
low. The rule states that one predictive variable can be studied for every ten events. For logistic regression the number of events is given by the size of the smallest of the outcome categories, and for survival analysis it is given by the number of uncensored events. For example, if a sample of 200 patients is studied and 20 patients die during the study (so that 180 patients survive), the one in ten rule implies that two pre-specified predictors can reliably be fitted to the total data. Similarly, if 100 patients die during the study (so that 100 patients survive), ten pre-specified predictors can be fitted reliably. If more are fitted, the rule implies that overfitting is likely and the results will not predict well outside the
training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.


Improvements

A "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%. Other studies, however, show that the one in ten rule may be too conservative as a general recommendation and that five to nine events per predictor can be enough, depending on the research question. More recently, a study has shown that the ratio of events per predictive variable is not a reliable statistic for estimating the minimum number of events for estimating a logistic prediction model. Instead, the number of predictor variables, the total sample size (events + non-events) and the events fraction (events / total sample size) can be used to calculate the expected prediction error of the model that is to be developed. One can then estimate the required sample size to achieve an expected prediction error that is smaller than a predetermined allowable prediction error value. Alternatively, three requirements for prediction model estimation have been suggested: the model should have a global shrinkage factor of ≥ .9, an absolute difference of ≤ .05 in the model's apparent and adjusted Nagelkerke R2, and a precise estimation of the overall risk or rate in the target population. The necessary sample size and number of events for model development are then given by the values that meet these requirements.


References

{{Reflist, 30em Rules of thumb Regression variable selection