A while back, someone asked me how we (Tim van der Zee, Jordan Anaya, and I) checked the validity of the F statistics when we analyzed the "pizza papers" from the Cornell Food and Brand Lab. I had an idea to write this up here, which has now floated to the top of the pile because I need to cite it in a forthcoming presentation. :-)
A quick note before we start: this technique applies to one- or two-way between-subjects ANOVAs. A one-way, two-condition ANOVA is equivalent to an independent samples t test; the F statistic is the square of the t statistic. I will sometimes mention only F statistics, but everything here applies to independent samples t tests too. On the other hand, you can't use this technique to check mixed (between/within) ANOVAs, or paired-sample t tests, as those require knowledge of every value in the dataset.
It turns out that, subject to certain limitations later), you can derive the F statistic for a between-subjects ANOVA from (only) the per-cell means, SDs, and sample sizes. You don't need the full dataset. There are some online calculators that perform these tests; however, they typically assume that the input means and SDs are exact, which is unrealistic. I can illustrate this point with the t test (!) calculator from GraphPad. Open that up and put these numbers in:
(Note that we are not using Welch's t test here, although Daniƫl Lakens will tell you --- and he is very probably right --- that we should usually do so; indeed, we should use Welch's ANOVA too. Our main reason for not doing that here is that the statistics you are checking will usually not have been made with the Welch versions of these tests; indeed, the code that I present below depends on an R package assumes that the non-Welch tests are used. You can usually detect if Welch's tests have been used, as the [denominator] degrees of freedom will not be integers.)
Having entered those numbers, click on "Calculate now" (I'm not sure why the "Clear the form" button is so large or prominent!) and you should get these results: t = 4.3816, df = 98. Now, suppose the article you are reading states that "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=4.33". Should you reach for your green ballpoint pen and start writing a letter to the editor about the difference in the t value? Probably not. The thing is, the reported means (4.05, 5.06) and SDs (1.12, 1.18) will have been rounded after being used to calculate the test statistic. The mean of 4.05 could have been anywhere from 4.045 to 4.055 (rounding rules are complex, but whether this is 4.0499999 or 4.0500000 doesn't matter too much), the SD of 1.12 could have been in the range 1.115 to 1.125, etc. This can make quite a difference. How much? Well, we can generate the maximum t statistic by making the difference between the means as large as possible, and both of the SDs as small as possible:
That gives a t statistic of 4.4443. To get the smallest possible t value, we make the difference between the means as small as possible, and both of the SDs as large as possible, in an analogous way. I'll leave the filling in of the form as an exercise for you, but the result is 4.3195.
So we now know that when we see a statement such as "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=<value>", any t value from 4.32 through 4.44 is plausible; values outside that range are, in principle, not possible. If you see multiple such values in an article, or even just one or two with a big discrepancy, it can be worth investigating further.
The online calculators I have seen that claim to do these tests have a few other limitations as well as the problem of rounded input values. First, the interface is a bit clunky (typically involving typing numbers into a web form, which you have to do again tomorrow if you want to re-run the analyses). Second, some of them use Java, and that may not work with your browser. What we needed, at least for the "Statistical Heartburn" analyses, was some code. I wrote mine in R and Jordan independently wrote a version in Python; we compared our results at each step of the way, so we were fairly confident that we had the right answers (or, of course, we could have both made the same mistakes).
My solution uses an existing R library called rpsychi, which does the basic calculations of the test statistics. I wrote a wrapper function called f_range(), which does the work of calculating the upper and lower bounds of the means and SDs, and outputs the minimum and maximum F (or t, if you set the parameter show.t to TRUE) statistics.
A quick note before we start: this technique applies to one- or two-way between-subjects ANOVAs. A one-way, two-condition ANOVA is equivalent to an independent samples t test; the F statistic is the square of the t statistic. I will sometimes mention only F statistics, but everything here applies to independent samples t tests too. On the other hand, you can't use this technique to check mixed (between/within) ANOVAs, or paired-sample t tests, as those require knowledge of every value in the dataset.
It turns out that, subject to certain limitations later), you can derive the F statistic for a between-subjects ANOVA from (only) the per-cell means, SDs, and sample sizes. You don't need the full dataset. There are some online calculators that perform these tests; however, they typically assume that the input means and SDs are exact, which is unrealistic. I can illustrate this point with the t test (!) calculator from GraphPad. Open that up and put these numbers in:
(Note that we are not using Welch's t test here, although Daniƫl Lakens will tell you --- and he is very probably right --- that we should usually do so; indeed, we should use Welch's ANOVA too. Our main reason for not doing that here is that the statistics you are checking will usually not have been made with the Welch versions of these tests; indeed, the code that I present below depends on an R package assumes that the non-Welch tests are used. You can usually detect if Welch's tests have been used, as the [denominator] degrees of freedom will not be integers.)
Having entered those numbers, click on "Calculate now" (I'm not sure why the "Clear the form" button is so large or prominent!) and you should get these results: t = 4.3816, df = 98. Now, suppose the article you are reading states that "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=4.33". Should you reach for your green ballpoint pen and start writing a letter to the editor about the difference in the t value? Probably not. The thing is, the reported means (4.05, 5.06) and SDs (1.12, 1.18) will have been rounded after being used to calculate the test statistic. The mean of 4.05 could have been anywhere from 4.045 to 4.055 (rounding rules are complex, but whether this is 4.0499999 or 4.0500000 doesn't matter too much), the SD of 1.12 could have been in the range 1.115 to 1.125, etc. This can make quite a difference. How much? Well, we can generate the maximum t statistic by making the difference between the means as large as possible, and both of the SDs as small as possible:
That gives a t statistic of 4.4443. To get the smallest possible t value, we make the difference between the means as small as possible, and both of the SDs as large as possible, in an analogous way. I'll leave the filling in of the form as an exercise for you, but the result is 4.3195.
So we now know that when we see a statement such as "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=<value>", any t value from 4.32 through 4.44 is plausible; values outside that range are, in principle, not possible. If you see multiple such values in an article, or even just one or two with a big discrepancy, it can be worth investigating further.
The online calculators I have seen that claim to do these tests have a few other limitations as well as the problem of rounded input values. First, the interface is a bit clunky (typically involving typing numbers into a web form, which you have to do again tomorrow if you want to re-run the analyses). Second, some of them use Java, and that may not work with your browser. What we needed, at least for the "Statistical Heartburn" analyses, was some code. I wrote mine in R and Jordan independently wrote a version in Python; we compared our results at each step of the way, so we were fairly confident that we had the right answers (or, of course, we could have both made the same mistakes).
My solution uses an existing R library called rpsychi, which does the basic calculations of the test statistics. I wrote a wrapper function called f_range(), which does the work of calculating the upper and lower bounds of the means and SDs, and outputs the minimum and maximum F (or t, if you set the parameter show.t to TRUE) statistics.
[[Update 2024-08-26 15:28 UTC: The rpsychi package is no longer maintained. You can get it from here — and see here if you need help with installing it.]]
Usage is intended to be relatively straightforward. The main parameters of f_range() are vectors or matrices of the per-cell means (m), SDs (s), and sample sizes (n). You can add show.t=TRUE to get t (rather than F) statistics, if appropriate; setting dp.p forces the number of decimal places used to that value (although the default almost always works); and title and labels are cosmetic. Here are a couple of examples from the "pizza papers":
1. Check Table 1 from "Lower Buffet Prices Lead to Less Taste Satisfaction"
n.lbp.t1 <- c(62, 60) m.lbp.t1.l1 <- c(44.16, 46.08) sd.lbp.t1.l1 <- c(18.99, 14.46) f_range(m=m.lbp.t1.l1, s=sd.lbp.t1.l1, n=n.lbp.t1, title="Age") m.lbp.t1.l3 <- c(68.52, 67.91) sd.lbp.t1.l3 <- c(3.95, 3.93) f_range(m=m.lbp.t1.l3, s=sd.lbp.t1.l3, n=n.lbp.t1, title="Height") m.lbp.t1.l4 <- c(180.84, 182.31) sd.lbp.t1.l4 <- c(48.37, 48.41) f_range(m=m.lbp.t1.l4, s=sd.lbp.t1.l4, n=n.lbp.t1, title="Weight") m.lbp.t1.l5 <- c(3.00, 3.28) sd.lbp.t1.l5 <- c(1.55, 1.29) f_range(m=m.lbp.t1.l5, s=sd.lbp.t1.l5, n=n.lbp.t1, title="Group size") m.lbp.t1.l6 <- c(6.62, 6.64) sd.lbp.t1.l6 <- c(1.85, 2.06) # Next line gives an F too small for rpsychi to calculate # f_range(m=m.lbp.t1.l6, s=sd.lbp.t1.l6, n=n.lbp.t1, title="Hungry then") m.lbp.t1.l7 <- c(1.88, 1.85) sd.lbp.t1.l7 <- c(1.34, 1.75) f_range(m=m.lbp.t1.l7, s=sd.lbp.t1.l7, n=n.lbp.t1, title="Hungry now")
2. Check Table 2 from "Eating Heavily: Men Eat More in the Company of Women"
lab.eh.t2 <- c("gender", "group", "gender x group") n.eh.t2 <- matrix(c(40, 20, 35, 10), ncol=2) m.eh.t2.l1 <- matrix(c(5.00, 2.69, 4.83, 5.54), ncol=2) sd.eh.t2.l1 <- matrix(c(2.99, 2.57, 2.71, 1.84), ncol=2) f_range(m=m.eh.t2.l1, s=sd.eh.t2.l1, n=n.eh.t2, title="Line 1", labels=lab.eh.t2) m.eh.t2.l2 <- matrix(c(2.99, 1.55, 1.33, 1.05), ncol=2) sd.eh.t2.l2 <- matrix(c(1.75, 1.07, 0.83, 1.38), ncol=2) f_range(m=m.eh.t2.l2, s=sd.eh.t2.l2, n=n.eh.t2, title="Line 2", labels=lab.eh.t2) m.eh.t2.l3 <- matrix(c(2.67, 2.76, 2.73, 1.00), ncol=2) sd.eh.t2.l3 <- matrix(c(2.04, 2.18, 2.16, 0.00), ncol=2) f_range(m=m.eh.t2.l3, s=sd.eh.t2.l3, n=n.eh.t2, title="Line 3", labels=lab.eh.t2) m.eh.t2.l4 <- matrix(c(1.46, 1.90, 2.29, 1.18), ncol=2) sd.eh.t2.l4 <- matrix(c(1.07, 1.48, 2.28, 0.40), ncol=2) f_range(m=m.eh.t2.l4, s=sd.eh.t2.l4, n=n.eh.t2, title="Line 4", labels=lab.eh.t2) m.eh.t2.l5 <- matrix(c(478.75, 397.5, 463.61, 111.71), ncol=2) sd.eh.t2.l5 <- matrix(c(290.67, 191.37, 264.25, 109.57), ncol=2) f_range(m=m.eh.t2.l5, s=sd.eh.t2.l5, n=n.eh.t2, title="Line 5", labels=lab.eh.t2) m.eh.t2.l6 <- matrix(c(2.11, 2.27, 2.20, 1.91), ncol=2) sd.eh.t2.l6 <- matrix(c(1.54, 1.75, 1.71, 2.12), ncol=2) f_range(m=m.eh.t2.l6, s=sd.eh.t2.l6, n=n.eh.t2, title="Line 6", labels=lab.eh.t2)
I mentioned earlier that there were some limitations on what this software can do. Basically, once you get beyond a 2x2 design (e.g., 3x2), there can be some (usually minor) discrepancies between the F statistics calculated by rpsychi and the numbers that might have been returned by the ANOVA software used by the authors of the article that you are reading, if the sample sizes are unbalanced across three or more conditions; the magnitude of such discrepancies will depend on the degree of imbalance. This issue is discussed in a section starting at the bottom of page 3 of our follow-up preprint.
A further limitation is that rpsychi has trouble with very small F statistics (such as 0.02). If you have a script that makes multiple calls to f_range(), it may stop when this occurs. The only workaround I know of for this is to comment out that call (as shown in the first example above).
Here is the R code for f_range(). It is released under a CC BY license, so you can do pretty much what you like with it. I decided not to turn it into an R package because I want it to remain "quick and dirty", and packaging it would require an amount of polishing that I don't want to put in at this point. This software comes with no technical support (but I will answer polite questions if you ask them via the comments on this post) and I accept no responsibility for anything you might do with it. Proceed with caution, and make sure you understand what you are doing (for example, by having a colleague check your reasoning) before you do... well, anything in life, really.
library(rpsychi) # Function to display the possible ranges of the F or t statistic from a one- or two-way ANOVA. f_range <- function (m, s, n, title=FALSE, show.t=FALSE, dp.p=-1, labels=c()) { m.ok <- m if (class(m.ok) == "matrix") { func <- ind.twoway.second useF <- c(3, 2, 4) default_labels <- c("col F", "row F", "inter F") } else { m.ok <- matrix(m) func <- ind.oneway.second useF <- 1 default_labels <- c("F") if (show.t) { default_labels <- c("t") } } # Determine how many DPs to use from input numbers, if not specified dp <- dp.p if (dp.p == -1) { dp <- 0 numbers <- c(m, s) for (i in numbers) { if (i != round(i, 0)) { dp <- max(dp, 1) j <- i * 10 if (j != round(j, 0)) { dp <- max(dp, 2) } } } } if (length(labels) == 0) { labels <- default_labels } # Calculate the nominal test statistic(s) (i.e., assuming no rounding error) f.nom <- func(m=m.ok, sd=s, n=n)$anova.table$F # We correct for rounding in reported numbers by allowing for the maximum possible rounding error. # For the maximum F estimate, we subtract .005 from all SDs; for minimum F estimate, we add .005. # We then add or subtract .005 to every mean, in all possible permutations. # (".005" is an example, based on 2 decimal places of precision.) delta <- (0.1 ^ dp) / 2 #typically 0.005 s.hi <- s - delta s.lo <- s + delta # Initialise maximum and minimum F statistics to unlikely values. f.hi <- rep(-1, length(useF)) f.lo <- rep(999999, length(useF)) f.hi <- f.nom f.lo <- f.nom # Generate every possible combination of +/- maximum rounding error to add to each mean. l <- length(m.ok) rawcomb <- combn(rep(c(-delta, delta), l), l) comb <- rawcomb[,!duplicated(t(rawcomb))] # Generate every possible set of test statistics within the bounds of rounding error, # and retain the largest and smallest. for (i in 1:ncol(comb)) { m.adj <- m.ok + comb[,i] f.hi <- pmax(f.hi, func(m=m.adj, sd=s.hi, n=n)$anova.table$F) f.lo <- pmin(f.lo, func(m=m.adj, sd=s.lo, n=n)$anova.table$F) } if (show.t) { f.nom <- sqrt(f.nom) f.hi <- sqrt(f.hi) f.lo <- sqrt(f.lo) } if (title != FALSE) { cat(title) } sp <- " " fdp <- 2 # best to report Fs to 2 DP always, I think dpf <- paste("%.", fdp, "f", sep="") for (i in 1:length(useF)) { j <- useF[i] cat(sp, labels[i], ": ", sprintf(dpf, f.nom[j]), " (min=", sprintf(dpf, f.lo[j]), ", max=", sprintf(dpf, f.hi[j]), ")", sep="") sp <- " " } if ((dp.p == -1) && (dp < 2)) { cat(" <<< dp set to", dp, "automatically") } cat("\n", sep="") }