This entry’s title is most definitely not, "Ice, Ice Baby"

On Nov 12, 2007 In

But maybe I should call it, “Just because there are mistakes, doesn’t mean it’s all bad.”

Ice–the ubiquitous item in every trainer, coach, therapist, doctor’s arsenal. Whether you use a frozen bag of peas, a “magic bag”, or actual ice attached via several feet of cling wrap, or actual cold-water immersion, there are many reasons to use ice, or more fancily, “cryotherapy”. One of these reasons is delayed-onset muscle soreness, or DOMS. DOMS is the pain that you experience 1-2 days after a workout, usually after a significant change in your routine or program. It goes away on its own, but can, in some cases, hinder training, since athletes who are sore after strength training may not be able to train to the same level in their sport. Some sport teams encourage or even mandate contrast baths post-training, for a variety of reasons. I don’t know of any lay-people that fill their tubs with ice water, but I’m sure someone will tell me that they, or someone they know does. So, the question is, “Does cold-water immersion reduce DOMS?” Is it worth getting into a tub up to your waist in ice-cold water?

Sellwood KL, Brukner PB, Williams D et al. Ice-water immersion and delayed-onset muscle soreness: a randomised controlled trial. British Journal of Sport Medicine, 41: 392-397, 2007.


The mechanisms by which DOMS actually occurs are still relatively poorly understood. We do know that there are structural changes seen on microscopy and biochemical changes in serum levels of things like creatine kinase and prostaglandins. We also know that DOMS tends to manifest more after eccentric exercise. Ice-water immersion is used, particularly by high-level athletes to minimize DOMS, and it is theorized that it decreases inflammation and also causes blood vessel constriction, so as to prevent some of the swelling (which is also part of the inflammatory process). There have been other studies looking at ice-water immersion on DOMS, but they haven’t been very good, and the authors state that most of the studies are underpowered (i.e. not enough people), not blinded, have used resistance trained people (thus decreasing the likelihood of DOMS) as reasons why previous trials have tried and failed. So, these authors decided to put it to the test properly. In Oz, they seem to use a one-minute-on, one-minute off cycle, for three immersions.


This was a well-reported study. There are a few issues that I have concerns with, but on the whole, there were almost no elements in this paper that I found were missing.

This study recruited volunteers from the University of Melbourne using posters around the schools of physiotherapy and medicine. Subjects had to be older than 18 years of age. They could not have performed any eccentric quadriceps exercise for 3 months prior. They could not have any neurological disease in the lower limbs, could not have any current injury to the lower limbs, could not be diabetic or have a disease for which cold-water immersion would not be allowed (e.g. Raynaud’s phenomenon). They also had to understand English.

All subjects went through a protocol to determine their 1RM for a seated leg extension on their non-dominant leg. They then went through 5 sets of 10 reps of _eccentric_ leg extensions using 120% of their 1RM. The subjects got one minute of rest between sets.

Subjects were randomly allocated to receive either a) an ice-water bath, or b) a warm-water bath. The ice-water bath was “…melting ice water at 5 (plus or minus 1) degree Celsius.” (That’s 41 F, for the backward countries who refuse to join the rest of the civilised world 😛 )The warm water bath was 24 degrees Celsius. Subjects had to stand submerged up to the level of the anterior superior illiac spines (basically just below your belly button). Three sets of one-minute-in, one-minute-out were done. [Disappointingly, the authors were a little sparse on their reporting of randomization, stating that the sequence was generated using a random numbers table (which is fine), but didn’t say whether patients were allocated by any kind of blocking or whether it was just simple. The fact that they ended up with exactly 20 people in each group is somewhat fortuitous for them.] They did mention that the evaluators of the outcomes were blinded though, which is a plus, and also mentioned that subjects were not told which intervention was considered therapeutic (which is an excellent way, if you can ethically justify it–and you can, to blind patients in whom you cannot conceal the actual treatment from).

The subjects came back at 24, 48 and 72 hours after their eccentric workout, and filled out visual analogue scales rating their quad pain for:
-pain on sit-to-standing
-passive quadriceps stretch
-one-legged hop for distance (and distance was also recorded for this test as a measure of quad function)
-maximal isometric contraction

They were also tested for tenderness on pressure, which was assessed using a pressure algometer, which is basically a device that can measure how much pressure is being delivered through it.They exerted a force of 6 pounds per square cm (what an odd mix of metric and imperial…) on two standardized points of the quads and asked the subjects to rate their pain during pressure.

Subjects’ thigh circumference was also measured and recorded at two standardized reference points. Blood work was drawn to measure creatine kinase (CK) levels.

The authors calculated a sample size of 30 subjects to detect a 25% difference between the two groups (i.e. the cold-water group were expected to have at least 25% less pain than the warm-water group). They based their calculation on the fact that a previous study found that there was an average increase of 69mm on the 100mm VAS for pain at 48 hours after eccentric exercise.

[This will prove to be the Achilles heel later on. The decided to recruit 40 patients in case people dropped out–which is also good planning. However, they used an alpha level of 0.05 and a beta-level of 0.8 to calculate this sample size–which is puzzling because their alpha level when it came to analysing their data was 0.01. What saves them in the end here, is that with an alpha level of 0.01, they needed 21 subjects per group, and they ended up recruiting 40. Unfortunately, it doesn’t save them enough. Read on.]


The authors used an intention-to-treat analysis (which means that regardless of whether someone stayed in the study or not, or whether they went off on their own or not to immerse themselves in freezing water, they were part of the analysis and in the group they were randomly allocated to), which is pretty much the accepted standard in randomized controlled trials. The carried the last value forward for any missing values (also the going standard, which many studies don’t do).

[Again, disappointingly, they decided on an alpha level of 0.01 to protect against a type I error (finding a significant difference when one does not truly exist) because of the number of significance tests they were going to perform. I stopped counting at 50. On a conservative Bonferonni adjustment, an alpha level of 0.01 would be the appropriate adjustment for 5 significance tests. So, even with the more conservative alpha level, 50 tests is just downright inappropriate. This is a case of poor prioritization as to defining a single primary outcome. However, despite the gross error of judgement, it surprisingly doesn’t really affect the conclusions of the study all that much.]


First off, the authors reported a few demographic statistics with respect to age, body-mass index and so on, but then went on to state that, “No significant difference was noted between the participants in the two treatment groups at baseline…”

[It is a well-established caution that significance testing on baseline values in the context of a randomized controlled trial is inappropriate. This is for 2 reasons: 1) You cannot use classic significance tests to positively find “no difference”. You can only find that there is insufficient evidence that a difference exists. Absence of evidence is not the same as evidence of absence. 2) The null hypothesis of a significance test is not, “No difference exists between the two groups,” as most beginners will tend to tell you (for the reason stated in number 1 of this list), but rather that the probability of observing data as or more extreme than the observed data is lower than that of random chance. However, in the case of randomization, the group a person ends up in IS up to random chance! So the probability that your observed data is by random chance is…1! So the interpretation of a significant p-value in baseline comparisons is problematic at best, and completely non-sensical at worst.]

I’m not going to go through every significance test that they authors did here. The bottom line is that apart from a few significant p-values in tests that weren’t that important, the authors failed to find a significant difference in any of the outcomes that actually mattered. This is why the more conservative alpha level, though an inappropriate way to deal with multiple comparisons when you’re planning more than FIFTY tests, is not that big of a deal in this case. They did find that the ice-water group had “significantly” more pain at the 24h more than the warm-water group, but with over 50 tests, there’s bound to be a few spurious p-values. I certainly would not agree with the statement that ice-water immersion, “…may make athletes more sore the following day.” on this basis. That’s data fishing.


The highest median pain score in this study was 38mm (interquartile range 13.8-55.0mm), which is FAR below the score we expected to see compared to the previous study that had a mean pain score of 69mm. So, unfortunately, even though they recruited 40 subjects, if they wanted to detect a difference of 25%, they would have needed at least 45 people in each group (with an alpha of 0.05) or 67 in each group (with an alpha of 0.01). The problem with using a percentage as your criteria for practical relevance is that the estimation equation for sample size doesn’t care–it only cares about the absolute difference between the two groups (and the variance within each group). So, while 75% of 69mm is 51.75, for an absolute difference of 17.25, 75% of 38 is 28.5, for an absolute difference of 9.5. It is invariably tougher to detect a smaller difference that it is to detect a larger one. So, for all its efforts and criticism of previous underpowered studies, this one is, alas, underpowered. But, as with the closing sentence of many paragraphs in this review, this “mistake” is also somewhat moot.


As with any critical review of a study, it’s easy to poke holes in things. This is by far the excuse I have heard the most from people who would prefer not to use studies as evidence for why things work or don’t work. “You can find a study to prove anything,” is the second. But the trained reviewer understands that it’s not enough to poke holes in papers–you have to understand how the hole ultimately affects the study’s conclusions. In this study’s case, it wasn’t really that important that the authors did a bajillion significance tests (I think bajillion is somewhere higher than 50, but less than a gazillion) because they didn’t actually affect the study’s conclusion that they failed to find a difference in pain reduction between ice-water and tepid-water immersion.

The authors acknowledge the limitation of their study in that they failed to elicit as high a pain as other studies (and more importantly the study they based their sample size calculation on), or alternatively, maybe they just had tough-as-nails subjects who didn’t rate their pain very high. They said that the strength deficits were pretty small with respect to the DOMS, and also the CK levels didn’t rise as much as other studies, but honestly…5 sets of 10 reps of eccentric leg extensions at 120% of your 1RM (if it’s a true 1RM, and there is a debate as to whether an untrained individual can generate a true 1RM) seems like more than I would do or would recommend as a strength workout, so how much more would be comparable to a “trained” or “athletic” workout? And do those other studies demand workloads that are far in excess of what is actually done in the “real world” in an effort to create SUPER SOUL-ANNIHILATING DOMS!!!! ™ ?

And always, the authors hedged on the fact that maybe ice-baths have a psychological benefit to athletes (quite like taping!), and that even if there might not be any benefit (from both a physiological and a pain perspective), who’s to take that value away from an athlete who might become mentally crippled if he/she were unable to take an ice-bath or tape their ankles?

Looking at the numbers, I come away from this study with 2 thoughts (apart from the ones above): 1) Maybe DOMS isn’t that crippling for most people, and if it is, maybe we should be asking whether that kind of training load is necessary rather than trying to come up with ways of preventing or treating DOMS; and 2) Regardless of the power of the study, if it’s an accurate picture of what DOMS pain looks like, I don’t need a statistical test to tell me getting in a tub of ice-water up to my belly button is an experience I could probably go without, because the difference between warm and ice-water is tiny.

The bottom line:

Getting into a tub of testicle-shrinking ice-cold water is probably unnecessary for most of us in terms of preventing of treating DOMS. It’s probably more fun if there’s vodka and a sauna involved though (but they didn’t study that).

Click Here to view the Full Version of our Website