Systematic Reviews: All that glitters is not gold.

Photo credit:

This is a short article I submitted to Science-Based Medicine a couple of years ago, which never made it to their blog and then I forgot about it. But then I came across it and figured I should use it since I already did the work; and lately there seem to be a lot of systematic reviews being published.

The Oxford Levels of Evidence rates systematic reviews as the highest level of evidence in the Therapeutic category. I’m really not a fan of the Levels of Evidence classification because I think it’s a double-edged sword that cuts the wielder more than the enemy.

Just as all trials are not created equal, despite basically falling into the level I or II category (if a study randomly allocated patients, the lowest level of evidence is II–at the very worst, it would downrank to a III, no matter how poorly conducted), systematic reviews vary. I would argue that systematic reviews have the potential to have vary even more because there is an extra level of quality issue.

Before we get into that extra level of quality issue though, I think it’s worth talking about what makes a review systematic; as I’ve noticed an increase in the number of journal articles labelled as “systematic reviews” when, in fact, they were just “reviews”.

A systematic review is like any other experiment. It has a fairly set baseline design and a finite number of analysis options.  Of the qualities that define what makes a review “systematic”, the a priori (ie. before data is collected) search strategy, the a priori inclusion and exclusion criteria of eligible studies would, in my opinion, be the most important. Some argue that statistical synthesis is also necessary, but for some research questions, the literature is such that combining studies is not possible. Whether this a posteri result should be given a different name than “systematic review” has yet to be discussed.

And like any other experiment, a systematic review has the potential of being designed or executed improperly, hence the Oxford caveat (which I have yet to ever see enacted) that studies can be downgraded based on quality. Bumbles in systematic reviews would include issues such as an incomplete search strategy that fails to capture all studies that would meet inclusion criteria (basically, a sampling bias), or statistically combining studies with high heterogeneity or incompatible outcomes or exposures.

However, the systematic review can fall on a completely different level, unlike most randomized trials, and that is at the level of the individual trials. The distinguishing strength of a systematic review is the pooling and synthesis of individual trials to find an effect that no single trial can demonstrate, particularly when trials are in conflict with one another. However, the results of the pooling are only as robust as the quality of the individual trials themselves. The pooled estimate of effect therefore is only as good as the weakest study in the group, and this is amplified when the weakest study has the highest number if subjects.

The pooled estimate cannot distinguish good trials from poor trials. Much like one can inappropriately perform t-tests on non-continuous data, a pooled estimate can be calculated on any set of numbers. Pooled estimates that stem from poor trials, therefore, are still poor, and arguably despite the upgrade in level of evidence, shed no additional light on which clinical decisions can be made.

The Cochrane Collaboration attempts to get at this issue by introducing the risk of bias table, which systematically evaluates three common sources of bias but nevertheless, the risk of bias table doesn’t affect the forest plot (and pooled estimate) which is what everyone is reading. In reviews that include many trials, the risk of bias table can be onerous to wade through to determine which pooled estimates can be reliably interpreted.

Teachers of EBM tend to recommend to their students that they turn to systematic reviews first since they’re supposed to take all of the literature into account, and provide not only literary, but also statistical synthesis of information. The problem with this is that that is generally where instruction stops, and the Levels of Evidence take over as some sort of validated proxy for quality.

All that glitters is not gold. A systematic review can be a powerful tool, but has a higher chance of being blundered by inherent methodological error or sabotaged by poor protoplasmic data, and therefore requires, arguably, more attention and caution in interpretation.

  • Annemarie Jutel

    I’ve got additional concerns about the systematic review (see my article at on this subject. While the quality of method is an important component of intervention-based studies, not all everything we study is about interventions (particularly when we are studying social concepts) and when we are theorising particular explanations, it’s important that we can think more broadly than just method. The importance of argument, creative thinking, and linking ideas that have not previously been associated emerge more readily from traditional reviews of the literature. To solve problems in fitness, medicine, or anywhere, we need have have myriad ways of approaching what the field sees as important. I believe we have over-privileged the systematic review to a number of other worthy approaches… What do you think?

    • evidencebasedfitness

      Sorry for the incredibly late reply:

      The dominance of the systematic review of late, I feel (and it’s just my feeling; nothing more,) is from the assumption that evidence-based practice is something that can be didactically taught in a small number of hours. In order to fulfill an EBP mandate, schools reduce the practice of EBP (yeah, it’s looping on itself, I know) to the bare minimum, simplified version of it. This, in turn, results in user of EBP to resort to blind rules because to understand the underpinning of those rules requires a far more extensive education that just isn’t possible in a normal professional-degree curriculum. The knock-on effect of _that_, is that the blindly-followed hierarchy holds more sway than it should, merely from a numbers perspective. The masses appreciate and positively feedback on reviews as desirable, which then propels the method to dominance by some warped democracy.

      Any methodologist knows that all tools have their uses and that the art of research is in selecting the right tool for the job; not necessarily tying oneself to any single tool for all questions. Whether we _need_ a myriad of ways depends entirely on the questions that need to be answered; maybe we do, and maybe we don’t. But to loop back again, practicing evidence-based practice means understanding the why of the decision; not just the what, and similarly, that is what should be dictating the predominant methods in any field.

  • Pingback: Non-nutritive sweeteners: This is going to hurt. - Evidence Based Fitness()

Click Here to view the Full Version of our Website