In statistics, circular analysis is the selection of the details of a data analysis using the data that is being analysed. It is often referred to as double dipping, as one uses the same data twice. Circular analysis unjustifiably inflates the apparent statistical strength of any results reported and, at the most extreme, can lead to the apparently significant result being found in data that consists only of noise. In particular, where an experiment is implemented to study a postulated effect, it is a

misuse of statistics Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misu ...

to initially reduce the complete dataset by selecting a subset of data in ways that are aligned to the effects being studied. A second misuse occurs where the performance of a fitted model or

classification rule Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfe ...

is reported as a raw result, without allowing for the effects of model-selection and the tuning of parameters based on the data being analyzed.

Examples

At its most simple, it can include the decision to remove outliers, after noticing this might help improve the analysis of an experiment. The effect can be more subtle. In functional magnetic resonance imaging (fMRI) data, for example, considerable amounts of pre-processing is often needed. These might be applied incrementally until the analysis 'works'. Similarly, the classifiers used in a multivoxel pattern analysis of fMRI data require parameters, which could be tuned to maximise the classification accuracy. In geology, the potential for circular analysis has been noted in the case of maps of geological faults, where these may be drawn on the basis of an assumption that faults develop and propagate in a particular way, with those maps being later used as evidence that faults do actually develop in that way.

Solutions

Careful design of the analysis one plans to perform, prior to collecting the data, means the analysis choice is not affected by the data collected. Alternatively, one might decide to perfect the classification on one or two participants, and then use the analysis on the remaining participant data. Regarding the selection of classification parameters, a common method is to divide the data into two sets, and find the optimum parameter using one set and then test using this parameter value on the second set. This is a standard technique used (for example) by the princeton MVPA classification library.

Notes

References

* * * * {{Misuse of statistics Model selection Misuse of statistics