Systematic Reviews: All that glitters is not gold.

Photo credit:

This is a short article I submitted to Science-Based Medicine a couple of years ago, which never made it to their blog and then I forgot about it. But then I came across it and figured I should use it since I already did the work; and lately there seem to be a lot of systematic reviews being published.

The Oxford Levels of Evidence rates systematic reviews as the highest level of evidence in the Therapeutic category. I’m really not a fan of the Levels of Evidence classification because I think it’s a double-edged sword that cuts the wielder more than the enemy.

Just as all trials are not created equal, despite basically falling into the level I or II category (if a study randomly allocated patients, the lowest level of evidence is II–at the very worst, it would downrank to a III, no matter how poorly conducted), systematic reviews vary. I would argue that systematic reviews have the potential to have vary even more because there is an extra level of quality issue.

Before we get into that extra level of quality issue though, I think it’s worth talking about what makes a review systematic; as I’ve noticed an increase in the number of journal articles labelled as “systematic reviews” when, in fact, they were just “reviews”.

A systematic review is like any other experiment. It has a fairly set baseline design and a finite number of analysis options.  Of the qualities that define what makes a review “systematic”, the a priori (ie. before data is collected) search strategy, the a priori inclusion and exclusion criteria of eligible studies would, in my opinion, be the most important. Some argue that statistical synthesis is also necessary, but for some research questions, the literature is such that combining studies is not possible. Whether this a posteri result should be given a different name than “systematic review” has yet to be discussed.

And like any other experiment, a systematic review has the potential of being designed or executed improperly, hence the Oxford caveat (which I have yet to ever see enacted) that studies can be downgraded based on quality. Bumbles in systematic reviews would include issues such as an incomplete search strategy that fails to capture all studies that would meet inclusion criteria (basically, a sampling bias), or statistically combining studies with high heterogeneity or incompatible outcomes or exposures.

However, the systematic review can fall on a completely different level, unlike most randomized trials, and that is at the level of the individual trials. The distinguishing strength of a systematic review is the pooling and synthesis of individual trials to find an effect that no single trial can demonstrate, particularly when trials are in conflict with one another. However, the results of the pooling are only as robust as the quality of the individual trials themselves. The pooled estimate of effect therefore is only as good as the weakest study in the group, and this is amplified when the weakest study has the highest number if subjects.

The pooled estimate cannot distinguish good trials from poor trials. Much like one can inappropriately perform t-tests on non-continuous data, a pooled estimate can be calculated on any set of numbers. Pooled estimates that stem from poor trials, therefore, are still poor, and arguably despite the upgrade in level of evidence, shed no additional light on which clinical decisions can be made.

The Cochrane Collaboration attempts to get at this issue by introducing the risk of bias table, which systematically evaluates three common sources of bias but nevertheless, the risk of bias table doesn’t affect the forest plot (and pooled estimate) which is what everyone is reading. In reviews that include many trials, the risk of bias table can be onerous to wade through to determine which pooled estimates can be reliably interpreted.

Teachers of EBM tend to recommend to their students that they turn to systematic reviews first since they’re supposed to take all of the literature into account, and provide not only literary, but also statistical synthesis of information. The problem with this is that that is generally where instruction stops, and the Levels of Evidence take over as some sort of validated proxy for quality.

All that glitters is not gold. A systematic review can be a powerful tool, but has a higher chance of being blundered by inherent methodological error or sabotaged by poor protoplasmic data, and therefore requires, arguably, more attention and caution in interpretation.

Click Here to view the Full Version of our Website