A hierarchy of evidence (or levels of evidence) is a heuristic used to

rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * ...

the relative strength of results obtained from scientific research. There is broad agreement on the relative strength of large-scale,

epidemiological studies Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evid ...

. More than 80 different hierarchies have been proposed for assessing medical evidence. The design of the study (such as a case report for an individual patient or a blinded

randomized controlled trial A randomized controlled trial (or randomized control trial; RCT) is a form of scientific experiment used to control factors not under direct experimental control. Examples of RCTs are clinical trials that compare the effects of drugs, surgical te ...

) and the endpoints measured (such as survival or

quality of life Quality of life (QOL) is defined by the World Health Organization as "an individual's perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards ...

) affect the strength of the evidence. In

clinical research Clinical research is a branch of healthcare science that determines the safety and effectiveness ( efficacy) of medications, devices, diagnostic products and treatment regimens intended for human use. These may be used for prevention, treatm ...

, the best evidence for treatment efficacy is mainly from

meta-analyses A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...

s (RCTs). Systematic reviews of completed, high-quality randomized controlled trials – such as those published by the

Cochrane Collaboration Cochrane (previously known as the Cochrane Collaboration) is a British international charitable organisation formed to organise medical research findings to facilitate evidence-based choices about health interventions involving health profess ...

– rank the same as systematic review of completed high-quality observational studies in regard to the study of side effects. Evidence hierarchies are often applied in evidence-based practices and are integral to evidence-based medicine (EBM).

Definition

In 2014, Stegenga defined a hierarchy of evidence as "rank-ordering of kinds of methods according to the potential for that method to suffer from systematic bias". At the top of the hierarchy is a method with the most freedom from systemic bias or best internal validity relative to the tested medical intervention's hypothesized efficacy. In 1997, Greenhalgh suggested it was "the relative weight carried by the different types of primary study when making decisions about clinical interventions". The

National Cancer Institute The National Cancer Institute (NCI) coordinates the United States National Cancer Program and is part of the National Institutes of Health (NIH), which is one of eleven agencies that are part of the U.S. Department of Health and Human Services. ...

defines levels of evidence as "a

ranking system A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In order theory, mathematics, this is known as a Strict weak ordering ...

used to describe the strength of the results measured in a

clinical trial Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, diet ...

or research study. The design of the study ..and the endpoints measured ..affect the strength of the evidence."

Examples

A large number of hierarchies of evidence have been proposed. Similar protocols for evaluation of research quality are still in development. So far, the available protocols pay relatively little attention to whether outcome research is relevant to efficacy (the outcome of a treatment performed under ideal conditions) or to effectiveness (the outcome of the treatment performed under ordinary, expectable conditions).

GRADE

The GRADE approach (Grading of Recommendations Assessment, Development and Evaluation) is a method of assessing the certainty in evidence (also known as quality of evidence or confidence in effect estimates) and the strength of recommendations. The GRADE began in the year 2000 as a collaboration of methodologists, guideline developers, biostatisticians, clinicians, public health scientists and other interested members. Over 100 organizations (including the

World Health Organization The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health. The WHO Constitution states its main objective as "the attainment by all peoples of the highest possible level of ...

, the UK National Institute for Health and Care Excellence (NICE), the Canadian Task Force for Preventive Health Care, the Colombian Ministry of Health, among others) have endorsed and/or are using GRADE to evaluate the quality of evidence and strength of health care recommendations. (See examples of clinical practice guidelines using GRADE online). GRADE rates quality of evidence as follows:

Guyatt and Sackett

In 1995, Guyatt and Sackett published the first such hierarchy. Greenhalgh put the different types of primary study in the following order: # Systematic reviews and

of "RCTs with definitive results". # RCTs with definitive results (confidence intervals that do not overlap the threshold clinically significant effect) # RCTs with non-definitive results (a point estimate that suggests a clinically significant effect but with confidence intervals overlapping the threshold for this effect) #

Cohort studies A cohort study is a particular form of longitudinal study that samples a cohort (a group of people who share a defining characteristic, typically those who experienced a common event in a selected period, such as birth or graduation), performing ...

# Case-control studies # Cross sectional surveys #

Case reports In medicine, a case report is a detailed report of the symptoms, signs, diagnosis, treatment, and follow-up of an individual patient. Case reports may contain a demographic profile of the patient, but usually describe an unusual or novel occurrence ...

Saunders et al.

A protocol suggested by Saunders et al. assigns research reports to six categories, on the basis of research design, theoretical background, evidence of possible harm, and general acceptance. To be classified under this protocol, there must be descriptive publications, including a manual or similar description of the intervention. This protocol does not consider the nature of any comparison group, the effect of confounding variables, the nature of the statistical analysis, or a number of other criteria. Interventions are assessed as belonging to Category 1, well-supported, efficacious treatments, if there are two or more randomized controlled outcome studies comparing the target treatment to an appropriate alternative treatment and showing a significant advantage to the target treatment. Interventions are assigned to Category 2, supported and probably efficacious treatment, based on positive outcomes of nonrandomized designs with some form of control, which may involve a non-treatment group. Category 3, supported and acceptable treatment, includes interventions supported by one controlled or uncontrolled study, or by a series of single-subject studies, or by work with a different population than the one of interest. Category 4, promising and acceptable treatment, includes interventions that have no support except general acceptance and clinical anecdotal literature; however, any evidence of possible harm excludes treatments from this category. Category 5, innovative and novel treatment, includes interventions that are not thought to be harmful, but are not widely used or discussed in the literature. Category 6, concerning treatment, is the classification for treatments that have the possibility of doing harm, as well as having unknown or inappropriate theoretical foundations.

Khan et al.

A protocol for evaluation of research quality was suggested by a report from the Centre for Reviews and Dissemination, prepared by Khan et al. and intended as a general method for assessing both medical and psychosocial interventions. While strongly encouraging the use of randomized designs, this protocol noted that such designs were useful only if they met demanding criteria, such as true randomization and concealment of the assigned treatment group from the client and from others, including the individuals assessing the outcome. The Khan et al. protocol emphasized the need to make comparisons on the basis of "intention to treat" in order to avoid problems related to greater attrition in one group. The Khan et al. protocol also presented demanding criteria for nonrandomized studies, including matching of groups on potential confounding variables and adequate descriptions of groups and treatments at every stage, and concealment of treatment choice from persons assessing the outcomes. This protocol did not provide a classification of levels of evidence, but included or excluded treatments from classification as evidence-based depending on whether the research met the stated standards.

U.S. National Registry of Evidence-Based Practices and Programs

An assessment protocol has been developed by the U.S. National Registry of Evidence-Based Practices and Programs (NREPP). Evaluation under this protocol occurs only if an intervention has already had one or more positive outcomes, with a probability of less than .05, reported, if these have been published in a peer-reviewed journal or an evaluation report, and if documentation such as training materials has been made available. The NREPP evaluation, which assigns quality ratings from 0 to 4 to certain criteria, examines reliability and validity of outcome measures used in the research, evidence for intervention fidelity (predictable use of the treatment in the same way every time), levels of missing data and attrition, potential confounding variables, and the appropriateness of statistical handling, including sample size.

History

Canada

The term was first used in a 1979 report by the "Canadian Task Force on the Periodic Health Examination" (CTF) to "grade the effectiveness of an intervention according to the quality of evidence obtained". The task force used three levels, subdividing level II: * Level I: Evidence from at least one

, * Level II1: Evidence from at least one well designed

cohort study A cohort study is a particular form of longitudinal study that samples a cohort (a group of people who share a defining characteristic, typically those who experienced a common event in a selected period, such as birth or graduation), performing ...

or case control study, preferably from more than one center or research group. * Level II2: Comparisons between times and places with or without the intervention * Level III: Opinions of respected authorities, based on clinical experience, descriptive studies or reports of expert committees. The CTF graded their recommendations into a 5-point A–E scale: A: Good level of evidence for the recommendation to consider a condition, B: Fair level of evidence for the recommendation to consider a condition, C: Poor level of evidence for the recommendation to consider a condition, D: Fair level evidence for the recommendation to exclude the condition, and E: Good level of evidence for the recommendation to exclude condition from consideration. The CTF updated their report in 1984, in 1986 and 1987.

United States

In 1988, the

United States Preventive Services Task Force The United States Preventive Services Task Force (USPSTF) is "an independent panel of experts in primary care and prevention that systematically reviews the evidence of effectiveness and develops recommendations for clinical preventive services". ...

(USPSTF) came out with its guidelines based on the CTF using the same 3 levels, further subdividing level II. Appendix A * Level I: Evidence obtained from at least one properly designed

. * Level II-1: Evidence obtained from well-designed controlled trials without

randomization Randomization is the process of making something random. Randomization is not haphazard; instead, a random process is a sequence of random variables describing a process whose outcomes do not follow a deterministic pattern, but follow an evolution d ...

. * Level II-2: Evidence obtained from well-designed cohort or case-control analytic studies, preferably from more than one center or research group. * Level II-3: Evidence obtained from multiple time series designs with or without the intervention. Dramatic results in uncontrolled trials might also be regarded as this type of evidence. * Level III: Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees. Over the years many more grading systems have been described.

United Kingdom

In September 2000, the Oxford (UK) CEBM Levels of Evidence published its guidelines for 'Levels' of evidence regarding claims about prognosis, diagnosis, treatment benefits, treatment harms, and screening. It not only addressed therapy and prevention, but also diagnostic tests, prognostic markers, or harm. The original CEBM Levels was first released for Evidence-Based On Call to make the process of finding evidence feasible and its results explicit. As published in 2009 they are: * 1a: Systematic reviews (with homogeneity) of randomized controlled trials * 1b: Individual randomized controlled trials (with narrow confidence interval) * 1c: All or none (when all patients died before the treatment became available, but some now survive on it; or when some patients died before the treatment became available, but none now die on it.) * 2a: Systematic reviews (with homogeneity) of cohort studies * 2b: Individual cohort study or low quality randomized controlled trials (e.g. <80% follow-up) * 2c: "Outcomes" Research; ecological studies * 3a: Systematic review (with homogeneity) of case-control studies * 3b: Individual case-control study * 4:

Case series A case series (also known as a clinical series) is a type of medical research study that tracks subjects with a known exposure, such as patients who have received a similar treatment, or examines their medical records for exposure and outcome. Cas ...

(and poor quality cohort and case-control studies) * 5: Expert opinion without explicit critical appraisal, or based on physiology, bench research or "

first principles In philosophy and science, a first principle is a basic proposition or assumption that cannot be deduced from any other proposition or assumption. First principles in philosophy are from First Cause attitudes and taught by Aristotelians, and nua ...

" In 2011, an international team redesigned the Oxford CEBM Levels to make it more understandable and to take into account recent developments in evidence ranking schemes. The Levels have been used by patients, clinicians and also to develop clinical guidelines including recommendations for the optimal use of phototherapy and topical therapy in

psoriasis Psoriasis is a long-lasting, noncontagious autoimmune disease characterized by raised areas of abnormal skin. These areas are red, pink, or purple, dry, itchy, and scaly. Psoriasis varies in severity from small, localized patches to complete ...

and guidelines for the use of the BCLC staging system for diagnosing and monitoring hepatocellular carcinoma in Canada.

Global

In 2007, the World Cancer Research Fund grading system described 4 levels: Convincing, probable, possible and insufficient evidence. All Global Burden of Disease Studies have used it to evaluate epidemiologic evidence supporting causal relationships.

Proponents

In 1995 Wilson et al., in 1996 Hadorn et al. and in 1996 Atkins et al. have described and defended various types of grading systems.

Criticism

More than a decade after it was established, use of evidence hierarchies was increasingly criticized in the 21st century. In 2011, a systematic review of the critical literature found 3 kinds of criticism: procedural aspects of EBM (especially from Cartwright, Worrall and Howick), greater than expected fallibility of EBM (Ioaanidis and others), and EBM being incomplete as a

philosophy of science Philosophy of science is a branch of philosophy concerned with the foundations, methods, and implications of science. The central questions of this study concern what qualifies as science, the reliability of scientific theories, and the ult ...

(Ashcroft and others). Many critics have published in journals of philosophy, ignored by the clinician proponents of EBM. Rawlins and Bluhm note, that EBM limits the ability of research results to inform the care of individual patients, and that to understand the causes of diseases both population-level and laboratory research are necessary. EBM hierarchy of evidence does not take into account research on the safety and efficacy of medical interventions. RCTs should be designed "to elucidate within-group variability, which can only be done if the hierarchy of evidence is replaced by a network that takes into account the relationship between epidemiological and laboratory research" The hierarchy of evidence produced by a study design has been questioned, because guidelines have "failed to properly define key terms, weight the merits of certain non-randomized controlled trials, and employ a comprehensive list of study design limitations". Stegenga has criticized specifically that meta-analyses are placed at the top of such hierarchies. The assumption that RCTs ought to be necessarily near the top of such hierarchies has been criticized by Worrall and Cartwright. In 2005, Ross Upshur noted that EBM claims to be a normative guide to being a better physician, but is not a philosophical

doctrine Doctrine (from la, doctrina, meaning "teaching, instruction") is a codification of beliefs or a body of teachings or instructions, taught principles or positions, as the essence of teachings in a given branch of knowledge or in a belief syste ...

. He pointed out that EBM supporters displayed "near-evangelical fervor" convinced of its superiority, ignoring critics who seek to expand the borders of EBM from a philosophical point of view. Borgerson in 2009 wrote that the justifications for the hierarchy levels are not absolute and do not epistemically justify them, but that "medical researchers should pay closer attention to social mechanisms for managing pervasive biases". La Caze noted that

basic science Basic research, also called pure research or fundamental research, is a type of scientific research with the aim of improving scientific theories for better understanding and prediction of natural or other phenomena. In contrast, applied researc ...

resides on the lower tiers of EBM though it "plays a role in specifying experiments, but also analysing and interpreting the data." Concato argued in 2004, that it allowed RCTs too much authority and that not all research questions could be answered through RCTs, either because of practical or because of ethical issues. Even when evidence is available from high-quality RCTs, evidence from other study types may still be relevant. Stegenga opined that evidence assessment schemes are unreasonably constraining and less informative than other schemes now available. In his 2015 PhD Thesis dedicated to the study of the various hierarchies of evidence in medicine, Christopher J Blunt concludes that although modest interpretations such as those offered by La Caze's model, conditional hierarchies like GRADE, and heuristic approaches as defended by Howick et al all survive previous philosophical criticism, he argues that modest interpretations are so weak they are unhelpful for clinical practice. For example, "GRADE and similar conditional models omit clinically relevant information, such as information about variation in treatments’ effects and the causes of different responses to therapy; and that heuristic approaches lack the necessary empirical support". Blunt further concludes that "hierarchies are a poor basis for the application of evidence in clinical practice", since the core assumptions behind hierarchies of evidence, that "information about average treatment effects backed by high-quality evidence can justify strong recommendations", is untenable, and hence the evidence from individuals studies should be appraised in isolation.

References

External links

Evidence levels with explanations
– entry in the Centre for Evidence-Based Medicine
Evidence-based medicine resources page
– with a diagram showing different levels of evidence forming a pyramid
Systematic database of 195 hierarchies of evidence in medicine up to 08/10/2020
by Christopher J Blunt for his PhD Thesis. {{NCI-cancer-dict Evidence Evidence-based practices Research Clinical research