(Note: this is more or less my first solo foray into unaided statistical and methodological criticism. Normally I hitch a ride on the coat-tails of my more experienced co-authors, hoping that they will spot and stop my misunderstandings. In this case, I haven't asked anybody to do that for me, so if this post turns out to be utter garbage, I will have only myself to blame. But it probably won't kill me, so according to the German guy with the fancy moustache, it will make me stronger.)
Among all the LaCour kerfuffle last week, this article by Hu et al. in Science seems to have slipped by with relatively little comment on social media. That's a shame, because it seems to be a classic example of how fluffy articles in vanity journals can arguably do more damage to the cause of science than outright fraud.
I first noticed Hu et al.'s article in the BBC app on my tablet. It was the third article in the "World News" section. Not the Science section, or the Health section (for some reason, the BBC's write-up was done by their Health correspondent, although what the study has to do with health is not clear); apparently this was the third most important news story in the world on May 29, 2015.
Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep. This is the kind of thing the media loves. Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study? (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see here for an excellent rebuttal.) After all, it's in Science (aka "the prestigious journal, Science"), so it must be true, right? Well, let's see.
Here's what Hu et al. did. First, they had their participants take the Implicit Association Test (IAT). The IAT is, very roughly speaking, a measure of the extent to which you unconsciously endorse stereotypically biased attitudes, e.g. (in this case) that women aren't good at science, or Black people are bad. If you've never taken the IAT, I strongly recommend that you try it (here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person. Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) was measured.
Next, Hu et al. had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here). The key point was that each form of the training ("anti-sexism" and "anti-racism") was associated with its own sound that was played to the participants when they did something right. You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.
After the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better. Then, they took a 90-minute nap. While they were asleep, one of the sounds associated with their training was selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).
Now comes the key part of the article. When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more. The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously tolerant towards Black people, but not more inclined to associate women with science. In other words, the sound that was played to them was somehow reinforcing the specific message that had been associated with that sound during the training period.
Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time. They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect was greater for the "cued" than the "uncued" training topic. In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect also persisted a whole week later.
So, what's the problem? Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause. However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.
Let's start with the size of the study. The authors reported a total of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound. That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women. In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].
The sample size problem gets worse when you examine the supplemental material and learn that the study was run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added. This raises all sorts of questions. First, there's a risk that something (even it was apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing. More importantly, though, we need to know why the researchers apparently chose to double their sample size. Could it be because they had results that were promising, but didn't attain statistical significance? They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*). That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would hold with a fresh batch of participants. The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is. That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample. Maybe. (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication. Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)
The gee-whiz bit of the article, which the cynic in me suspects was at least partly intended for rapid consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here. There are a few problems with the clarity of this Figure from the start; for example, the blue bars in 1B and 1F look like they're describing the same thing, but they're actually slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right sides represent gender and race bias, not (as in all the other charts) cued and uncued responses. On the other hand, the green bars in 1E and 1F both represent the same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).
Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training. If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this. There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material. Tables of descriptives seem to have fallen out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)
Of all the charts, Figure 1D perhaps looks the most convincing. It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist). However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap. This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa. In other words, there was more room for improvement on the cued task versus the uncued task, and that improvement duly took place. This suggests to me that regression to the mean is one possible explanation here. Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias. That's another consequence of the small sample size.
Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards. Most notable about 1E, however, is what it doesn't show. Remember, 1D shows the IAT results before and after the nap. 1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap. Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here. (Note: the significance stars are my estimate. I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three. In any case, as you'll see in the discussion of Figure 1F, this is all irrelevant anyway.)
So, this extra panel (Figure 1E½?) could have been written up something like this: "Cueing during sleep did not result in sustained counterbias reduction; indeed, the cued bias increased very substantially between
postnap and delayed testing [t(37) = something, P = very small], whereas the increase in the uncued bias during the week after postnap testing was considerably smaller [t(37) = something, P = 0.045 or thereabouts]." However, Hu et al. elected not to report this. I'm sure they had a good reason for that. Lack of space, probably.
Combining 1D and 1E, we get this chart (no significance stars this time). My "regression to the mean" hypothesis seems to find some support here.
Figure 1F shows that Hu et al. have committed a common fallacy in comparing two conditions on the basis of one showing a statistically significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1). They claimed that 1F shows that the effect of cued (versus uncued) training persisted after a week, because the improvement in IAT scores over baseline for the cued training (first blue column versus first green column) was statistically significant, whereas the corresponding improvement for the uncued training (second blue column versus second green column) was not. Yet, as Andrew Gelman has pointed out in several blog posts with similar titles over the past few years, the difference between “statistically significant” and “not statistically significant” is not in itself necessarily statistically significant. (He even wrote an article [PDF] on this, with Hal Stern.) The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) was different at the end of the study (the two green columns). And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.
One other point to end this rather long post. Have a look at Figure 2 and the associated description. Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:
Bottom line: I have difficulty believing that there is anything to see here. We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.
(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants. You can't even detect that men are taller than women with that.
Among all the LaCour kerfuffle last week, this article by Hu et al. in Science seems to have slipped by with relatively little comment on social media. That's a shame, because it seems to be a classic example of how fluffy articles in vanity journals can arguably do more damage to the cause of science than outright fraud.
I first noticed Hu et al.'s article in the BBC app on my tablet. It was the third article in the "World News" section. Not the Science section, or the Health section (for some reason, the BBC's write-up was done by their Health correspondent, although what the study has to do with health is not clear); apparently this was the third most important news story in the world on May 29, 2015.
Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep. This is the kind of thing the media loves. Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study? (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see here for an excellent rebuttal.) After all, it's in Science (aka "the prestigious journal, Science"), so it must be true, right? Well, let's see.
Here's what Hu et al. did. First, they had their participants take the Implicit Association Test (IAT). The IAT is, very roughly speaking, a measure of the extent to which you unconsciously endorse stereotypically biased attitudes, e.g. (in this case) that women aren't good at science, or Black people are bad. If you've never taken the IAT, I strongly recommend that you try it (here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person. Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) was measured.
Next, Hu et al. had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here). The key point was that each form of the training ("anti-sexism" and "anti-racism") was associated with its own sound that was played to the participants when they did something right. You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.
After the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better. Then, they took a 90-minute nap. While they were asleep, one of the sounds associated with their training was selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).
Now comes the key part of the article. When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more. The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously tolerant towards Black people, but not more inclined to associate women with science. In other words, the sound that was played to them was somehow reinforcing the specific message that had been associated with that sound during the training period.
Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time. They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect was greater for the "cued" than the "uncued" training topic. In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect also persisted a whole week later.
So, what's the problem? Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause. However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.
Let's start with the size of the study. The authors reported a total of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound. That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women. In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].
The sample size problem gets worse when you examine the supplemental material and learn that the study was run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added. This raises all sorts of questions. First, there's a risk that something (even it was apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing. More importantly, though, we need to know why the researchers apparently chose to double their sample size. Could it be because they had results that were promising, but didn't attain statistical significance? They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*). That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would hold with a fresh batch of participants. The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is. That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample. Maybe. (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication. Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)
The gee-whiz bit of the article, which the cynic in me suspects was at least partly intended for rapid consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here. There are a few problems with the clarity of this Figure from the start; for example, the blue bars in 1B and 1F look like they're describing the same thing, but they're actually slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right sides represent gender and race bias, not (as in all the other charts) cued and uncued responses. On the other hand, the green bars in 1E and 1F both represent the same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).
Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training. If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this. There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material. Tables of descriptives seem to have fallen out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)
Of all the charts, Figure 1D perhaps looks the most convincing. It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist). However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap. This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa. In other words, there was more room for improvement on the cued task versus the uncued task, and that improvement duly took place. This suggests to me that regression to the mean is one possible explanation here. Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias. That's another consequence of the small sample size.
Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards. Most notable about 1E, however, is what it doesn't show. Remember, 1D shows the IAT results before and after the nap. 1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap. Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here. (Note: the significance stars are my estimate. I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three. In any case, as you'll see in the discussion of Figure 1F, this is all irrelevant anyway.)
Combining 1D and 1E, we get this chart (no significance stars this time). My "regression to the mean" hypothesis seems to find some support here.
Figure 1F shows that Hu et al. have committed a common fallacy in comparing two conditions on the basis of one showing a statistically significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1). They claimed that 1F shows that the effect of cued (versus uncued) training persisted after a week, because the improvement in IAT scores over baseline for the cued training (first blue column versus first green column) was statistically significant, whereas the corresponding improvement for the uncued training (second blue column versus second green column) was not. Yet, as Andrew Gelman has pointed out in several blog posts with similar titles over the past few years, the difference between “statistically significant” and “not statistically significant” is not in itself necessarily statistically significant. (He even wrote an article [PDF] on this, with Hal Stern.) The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) was different at the end of the study (the two green columns). And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.
One other point to end this rather long post. Have a look at Figure 2 and the associated description. Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:
Neurophysiological activity during sleep—such as sleep spindles, slow waves, and rapid-eye-movement (REM) duration—can predict later memory performance (17). Accordingly, we explored possible relations between cueing-specific bias reduction and measures of sleep physiology. We found that only SWS × REM sleep duration consistently predicted cueing-specific bias reduction at 1 week relative to baseline (Fig. 2) [r(38) = 0.450, P = 0.005] (25).They don't tell us how many combinations of parameters they tried to come up with that lone significant result; nor, in the next couple of paragraphs, do they give us any theoretical justification other than handwaving why the product of SWS and REM sleep duration (whose units, the label on the horizontal access of Figure 2 notwithstanding, are "square minutes", whatever that might mean) --- as opposed to the sum of these two numbers, or their difference, or their ratio, or any one of a dozen other combinations --- should be physiologically relevant. Indeed, selecting the product has the unfortunate effect of making half of the results zero - I count 20 dots that aren't on the vertical axis, for 40 participants. I'm going to guess that if you remove those zeroes (which surely cannot have any physiological meaning), the regression line is going to be a lot flatter than it is at present.
Bottom line: I have difficulty believing that there is anything to see here. We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.
(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants. You can't even detect that men are taller than women with that.
"The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author)."
ReplyDeleteThis is not true. They cite this review article and in there are several studies demonstrating how this memory reactivation by auditory and olfactory cues work.
http://www.sciencedirect.com/science/article/pii/S136466131300020X
and e.g., http://www.sciencemag.org/content/326/5956/1079.short
Apart from that I think the main problem with the study is that the sample size is way too small to draw any strong conclusions.
Most of the other remarks are probably due to the very hard space restrictions in Science and would not be that bad if it weren't for the small sample size.
I still think it is an interesting study, but has to be treated with caution. Let's see if it replicates.
I am not a statistic expert either, but it does seem a bit sketchy. Specifically: the low sample, the dramatically low number of participants per condition, the high exclusion rate and more.
ReplyDeleteWhy did they use SEM bars in their graphs? They are very 'convenient' for small samples as they tend to be very small, is that the reason? A SD-bar or 95%CI bar would be more informative. Furthermore, if I remember correctly you cannot deduce anything from SEM bars which are not overlapping. If SEM bars overlap the difference cannot be significant, but if they don't the difference may or may not be significant.
Other than the low amount of participants, the whole experiment seems to falter on the high initial difference in IAT scores. Like you said, this has serious implications as the entire study might be explainable by regression to the mean.
It seems to me that a lot of the studies cited in that review article (which I admit I hadn't read until now) are rats, or refer to *conscious* memories (I don't think that there's any suggestion that people forget that they don't want to be sexist/racist), or use odour instead of sound as cues, etc.
ReplyDeleteBut I'm happy to agree with you that the main problem is that the sample size is way too small. The problem is that Science accepted it, which as far as the media is concerned, means that is now The Official Truth As Validated By One Of The World's Most Prestigious Journals. These findings will be added to the pile of Stuff People Put In Business Books and which orbit the periphery of academia for ever, occasionally bumping into a china teapot.
The irony is that it takes four times the number of studies to get in a journal with 1/4 the impact factor (JPSP).
ReplyDelete