History
The start of experimental benchmarking in social science is often attributed to Robert LaLonde. In 1986 he found that findings of econometric procedures assessing the effect of an employment program on trainee earnings did not recover the experimental findings. Experimental benchmarking is often conducted in medical research, such as Schnell‐Inderst et al. (2017) and Burden et al. (2017).Procedural Considerations
The most instructive experimental benchmarking designs are done on a large scale. They also compare experimental and non-experimental work that looks at the same outcome and the same population.Observational Designs That Can Be Assessed with Benchmarking
Non-experimental, or observational, research designs compare treated to untreated subjects while controlling for background attributes (called covariates). This estimation approach can also be called covariate adjustment. Covariates are attributes that exist prior to experimentation and therefore do not change based on treatment. Examples include age, gender, weight, and hair color. For example, if researchers are interested in the effect of smoking cessation classes on the number of cigarettes smoked a day, they may carry out covariate adjustment to control for ethnicity, income and the number of years the smoker has been smoking. Covariate adjustment can be carried out in a variety of ways. Gordon et al. (2018) illustrate many of these methods by means of online advertising data, such asSelected Examples of Experimental Benchmarking
Bloom et al. (2002) looks at the study of the impact of mandatory welfare-to-work programs to ask which non-experimental methods get closest to recovering the experimentally estimated effects of such programs. They also question if the most accurate non-experimental methods are accurate enough to take the place of experimental work. They ultimately argue that none of the methods approach the accuracy of experimental methods for recovering the parameter of interest. Dehijia and Wahba (1999) examine LaLonde's (1989) data with additional non-experimental findings. They argue that when there is enough subject pool overlap and unobservable covariates do not impact outcomes, non-experimental methods can indeed estimate treatment impact accurately. Glazerman, Levy and Myers (2003) perform experimental benchmarking in the context of employment services, welfare and job training. They determine that non-experimental methods may approximate experimental estimates, however these estimations can be biased enough to impact policy analysis and implementation. Gordon et al. (2018) utilizes data fromReferences
Bloom, H. S., Michalopoulos, C., Hill, C. J., & Lei, Y. (2002). Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs? MDRC Working Papers on Research Methodology. Burden, A., Roche, N., Miglio, C., Hillyer, E. V., Postma, D. S., Herings, R. M., Overbeek, J. A., Khalid, J. M., van Eickels, D., … Price, D. B. (2017). An evaluation of exact matching and propensity score methods as applied in a comparative effectiveness study of inhaled corticosteroids in asthma. Pragmatic and observational research, 8, 15-30. doi:10.2147/POR.S122563Dehejia, R. H., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association, 94(448), 1053-1062.Glazerman, S., Levy, D. M., & Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy of Political and Social Science, 589(1), 63-93.Gordon, Brett R., Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2018. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. papers.ssrn.com/sol3/papers.cfm?abstract_id=3033144Schnell‐Inderst, P., Iglesias, C. P., Arvandi, M. A. R. J. A. N., Ciani, O. R. I. A. N. A., Matteucci Gothe, R., Peters, J., ... & Siebert, U. (2017). A bias‐adjusted evidence synthesis of RCT and observational data: the case of total hip replacement. Health economics, 26, 46-69.Smith, Jeffrey, and Petra Todd. 2001. "Reconciling Conflicting Evidence on the Performance of Matching Methods?" American Economic Review, Papers and Proceedings 91(2): 112-118 {{reflistFurther reading
Medicine
Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 159–183. Stuart E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science : a review journal of the Institute of Mathematical Statistics, 25(1), 1-21.Social Sciences
Smith, Jeffrey, and Petra Todd. 2005. "Does Matching Overcome LaLonde's Critique of Nonexperimental Methods?" Journal of Econometrics 125(l-2):305-353 Design of experiments