14 October 2019

A curious edge case issue with PET-PEESE

(This post has been written in the spirit of "If you want to understand how something works, dig around inside it for a while and write up what you found". It might already be known somewhere, and the chances of it causing a problem in the real world might be small, but I wanted to get my analyses down in writing for my future self, and I thought it might be useful for somebody. Thanks to Daniƫl Lakens for his very helpful comments on an earlier draft of this post.)

PET-PEESE is a method for detecting and estimating the effects of publication bias in a meta-analysis. I don't do meta-analyses for a living (or indeed as a hobby), so up to now my interest in this topic has been fairly minimal, and doubtless it will go back to being so after this post.  I won't give any introduction to what PET-PEESE does here; you can find that in the blog posts that I link to below.

While trying to understand more about meta-analyses in general and PET-PEESE in particular, I read this blog post from 2015 by Will Gervais. He discusses a number of limitations of PET-PEESE, including possible bias when using it with Cohen's d as the effect size. The problem, in a nutshell, is that PET-PEESE involves regressing the effect size on its standard error, and since the calculation of the standard error for Cohen's d includes the effect size itself, there is by definition going to be at least some correlation there. Here's Will's version of the formula for the SE:
(Note that this page has a slightly different formula that is missing the final term in parentheses, but that doesn't make a lot of difference here.)

In a reply to Will's post (scroll down to near the end of the comments), Jake Westfall wrote, "I don't think that dependence [on d] ends up mattering much. You can see in the SE formula that the term involving d quickly vanishes to 0 as the sample size grows. In fact, for typical effect sizes (d = .2 to .6), the term involving d is effectively nil once we have reached just 10 participants per group."

That seems reasonable. After all, d is probably less (or not much bigger) than 1, and twice the combined sample size is going to be a fairly large number in comparison. So it seems as if the effect of the term with d-squared is not going to be very great.

I wrote some R code to investigate this, which you can find here. (I've reproduced it below as an image, so you can read along, but the definitive version is in that gist.) It builds the effect size in variables t1a and t1b (the two terms in the left pair of parentheses of the SE formula above) which are added together and multiplied by t2, the term in the right pair of parentheses. I used the ratio of t1b to t1a as a measure of the influence of the d-squared term, and indeed it's quite small. As the code stands, this term is only about 0.02 times the left-hand term, and the correlation between d and its SE is about .03, which is only a small bias on the PET test.

The code generates sample sizes n1 and n2 as random numbers. When generating n1 I had to add a minimum value, to avoid having a sample size of 0 (which would cause things to break) or 1 (which would be a bit silly). As it stands below, the code tries to make a mean n1 of 50 with a minimum of 20. However, while playing with the code, at one point I reduced the mean to 10, without changing the minimum from 20. This meant that n1 was accidentally forced to be 20 for every sample (because the maximum value that can be generated at line 14 is twice the target mean). Suddenly, although the ratio between the two terms inside the left-hand set of parentheses in the SE formula remained at 0.02, the correlation between d and the SE went up to .30.  You can try it yourself; just change 50 to 10 in line 12.

Things get even wilder if you have a bigger range of effect sizes. In line 19, put a # character before the *, so that the line is just

  d = runif(iter)

and hence the range of d is from 0 to 1. (Aside: I can't get Blogger.com's editor to stop eating left angle brackets, so please don't @ me about my use of = for assignment here.) Now the correlation between d and SE is about .65. Want more mayhem? Uncomment line 16, so that n2 is now exactly the same as n1 (instead of being a bit more or less), making all the pooled sample sizes the same. The correlation between d and SE now goes up to about .97 (!).

The effect also makes a big difference to the intercept, which is intended to be a measure of the true effect size. For example, after making the various changes mentioned above, try examining the wls regression object; the intercept can go below 6.  A plot() of this object is interesting; indeed, even a simple plot of d against se is quite spectacular:

Plot of se against d with n1=10 but no other edits to the supplied code (range of d: 0.2 to 0.6; n2 != n1)

In a 2017 article [PDF], Tom Stanley (the originator of PET-PEESE) noted that there can be some bias when all of the sample sizes are small. However, this issue identified here goes beyond that. If you change the minimum sample size to 1000 (line 11) you will see that the problem remains almost exactly the same.

In a normal meta-analysis, the biasing effect will of course be much less drastic than this, but with studies of sufficiently similar size, this problem has the potential to introduce some bias into the unsuspecting user's interpretation of the PET-PEESE regression line. (In the article just mentioned, Stanley recommends including a minimum of 20 effects in a PET-PEESE analysis, for other reasons.) Interested readers might try playing with the variable iter to see how various numbers of studies affect the result.

What's going on? In the formula for the SE, the term on the right that includes d-squared is of negligible magnitude compared to the one on its left, and yet it is driving the entire relationship. The answer appears (I think) to be our good friend, granularity. With homogeneous sample sizes, the numbers in the first term of the formula ((n1 + n2 )/ n1 * n2) are always the same, or at least, quite similar. Hence, the variance provided by the term containing d turns out to make a significant contribution. At least, that's what I think is happening; please feel free to play with the simulated data from my code and disagree with me (because I'm totally just winging it here).

Some time after Will's post, Uri Simonsohn blogged about PET-PEESE at Data Colada. Uri noted: "A more surprisingly consequential assumption involves the symmetry of sample sizes across studies. Whether there are more small than large n studies, or vice versa, PET PEESE’s performance suffers quite a bit."  I wrote the code shown here before I read Uri's post, but when I did read it, it made sense (I presume that the effect that Uri is describing could be the same as the one I observed in my simulated data).

In summary, it seems that when sample sizes are "too" homogeneous, PET-PEESE will be biased, in favour of suggesting that there is excessive publication bias, with this bias being an inverse function (which I am not smart enough to work out) of the variability of the sample sizes (n1 and n2).

How much of a problem is this in practice, for the typical meta-analysis? Probably not very much at all. I just find it curious that (assuming the above analysis is correct) a meta-analysis method could potentially fail if it was used in a field where sample sizes are highly homogeneous, which I suppose could happen if there was a "natural" sample size; say, the number of matches played in a football league with 20 teams over successive seasons. Of course, all analysis methods have limitations on the conditions where they can be used, but typically these arise when the input data are extremely variable. In ANOVA, we like it when all of the groups have similar variance; we don't have to worry about something suddenly heading off towards infinity if this similarity is less than 0.03 or whatever. In the title of this post I've referred to the problem as an "edge case", but it feels more like a sneaky hole lurking in the middle of the playing field.