01 July 2018

This researcher compared two identical numbers. The effect size he obtained will shock you!

Here's an extract from an article that makes some remarkable claims about the health benefits of drinking green tea. The article itself seems to merit scrutiny for a number of reasons, but here I just want to look at a point that illustrates why (as previously noted by James Heathers in his inimitable style) rounding to one decimal place (or significant figure) when reporting your statistics is not a good idea.


The above image is taken from Table 3 of the article, on page 596. This table shows the baseline and post-treatment values of a large number of variables that were measured in the study.  The highlighted row shows the participants' waist–hip ratio in each of two groups and at each of two time points. As you can see, all of the (rounded) means are equal, as are all of the (rounded) SDs.

Does this mean that there was absolutely no difference between the participants? Not quite. You can see that the p value is different for the two conditions. This p value corresponds to the paired t test that will have been performed for the 39 participants in the treatment group across the period of the study, or for the 38 participants in the control group. The p values (corresponding to the respective t statistics) would likely be different even if the means and SDs were identical to many decimal places because the paired t test makes 39 (or 38) comparisons of individual values between baseline and the end of the study.

However, what I'm interested in here is the difference in mean waist–hip ratios between the groups at baseline (i.e., the first and fourth columns of numbers). The participants have been randomized to conditions, so presumably the authors decided not to worry about baseline differences [PDF], but it's interesting to see what those differences could have been (not least because these same numbers could also have been, say, the results obtained by the two groups on a psychological test after they had been assigned randomly to conditions without a baseline measurement).

We can calculate the possible range of differences(*) by noting that the rounded mean of 0.9 could have corresponded to an actual value anywhere between 0.85001 and 0.94999 (let's leave the question of how to round values of exactly 0.85 or 0.95 for now; it's complicated). Meanwhile, each of the rounded SDs of 0.1 could have been as low as 0.05001. (The lower the SD, the higher the effect.)  Let's put those numbers into this online effect size calculator (M1=0.94999, M2=0.85001, SD1=SD2=0.05001) and click "Compute" (**).

Yes, you are reading that right: An effect size of = (almost) 2 is possible for the baseline difference between the groups even though the reported means are identical. (For what it's worth, the p value here, with 75 degrees of freedom, is .0000000000004). Again, James has you covered if you want to know what an effect size of 2 means in the real world.

Now, you might think that this is a bit pathological, and you're probably right. So play around with the means and SDs until they look reasonable to you. For example, if you keep the extreme means but use the rounded SDs as if they were exactly correct, you get = 0.9998.  That's still a whopping effect size for the difference between numbers that are reported as being equal. And even if you bring the means in from the edge of the cliff, the effect size can still be pretty large. Means of 0.93 and 0.87 with SDs of 1.0 will give you d = 0.6 and p = .01, which is good enough for publication in most journals.

Conclusion: Always report, not just two decimal places, but also at least two significant figures (it's very frustrating to see standardized regression coefficients, in particular, reported as 0.02 with a standard error of 0.01). In fact, since most people read papers on their screens and black pixels use less energy to display than white ones, save the planet and your battery lifetime and report three or four decimals. After all, you aren't afraid of GRIM, are you?



(*) I did this calculation by hand. My f_range() function, described here, doesn't work in this case because the underlying code (from a module that I didn't write, and have no intention of fixing) chokes when trying to calculate the midpoint test statistic when the means and SDs are identical.

(**) This calculator seems to be making the simplifying assumption that the group sizes are identical, which is close enough as to make no difference in this case. You can also do the calculation of d by hand: just divide the difference between the means by the standard deviation, assuming you're using the same SD for both means, or see here.

[Update 2018-07-02 12:13 UTC: Removed link to a Twitter discussion of a different article, following feedback from an alert reader.]