23 June 2015

Mechanical Turk: Amazon's new charges are not the biggest problem

Twitter was buzzing, or something, this morning, with the news that Amazon is going to change the commission rates that it charges researchers who use Mechanical Turk (henceforth: MTurk) participants to take surveys, quizzes, personality tests, etc.

(This blog post contains some MTurk jargon.  My previous post was way too long because I spent too much time summarising what someone else had written, so if you don't know anything about MTurk concepts, read this.)

The changes to Amazon's rates, effective July 21, 2015, are listed here, but since that page will probably change after July, I took a screenshot:

Here's what this means.  Currently, if you hire 100 people to fill in your survey and want to give them $1 each, you pay Amazon $110 for "regular" workers and $130 for "Masters".  Under the new pricing scheme, this will be $140 and $145, respectively.  That's an increase of 27.3% and 11.5%, respectively.  (I'm assuming, first, that the wording about "10 or more assignments" means "10 or more instances of the HIT being executed, not necessarily by the same worker", and second, that any psychological survey will need more than 10 assignments.)

Twitter users were quite upset about this.  Someone portrayed this as a "400% increase", which is either a typo, or a miscalculation (Amazon's commission for "regular" workers is going from 10% to 40%, which even expressed as "$10 to $40 on a $100 survey" is actually a 300% increase), or a misunderstanding (the actual increase in cost for the customer is noted in the previous paragraph).  People are talking of using this incident as a reason to start a new, improved platform, possibly creating an international participant pool.

Frankly, I think there is a lot of heat and not much light being generated here.

First, researchers are going to have to face up to the fact that by using MTurk, they are typically exploiting sub-minimum wage labour.  (There are, of course, honourable exceptions, who try to ensure that online survey takers are fairly remunerated.)  The lowest wage rate I've personally seen in the literature was a study that paid over 100 workers the princely sum of $0.25 each for a task that took 20 minutes to complete.  Either those people are desperately poor, or they are children looking for pocket money, or they are people who just really, really like being involved in research, to an extent that might make some people wonder about selection bias.

I have asked researchers in the past how they felt about this exploitation, and the standard answer has been, "Well, nobody's forcing them to do it".  The irony of social psychologists --- who tend not to like it when someone points out that they overwhelmingly self-identify as liberal and this is not necessarily neutral for science --- invoking essentially the same arguments as exploitative corporations for not paying people adequately for their time, is wondrous to behold.  (It's not unique to academia, though.  I used to work at an international organisation, dedicated to human rights and the rule of law, where some managers who made six-figure tax-free salaries were constantly looking for ways to get interns to do the job of assistants, or have technical specialists agree to work for several months for nothing until funding "maybe" came through for their next contract.)

Second, I have doubts about the validity of the responses from MTurk workers.  Some studies have shown that they can perform as well as college students, although maybe it's best to take on the "Master"-level workers, whose price is only going up 11.5%; and I'm not sure that college students ought to be regarded as the best benchmark [PDF] here.  But there are technical problems, such as issues with non-independence of data [PDF] --- if you put three related surveys out there, there's a good chance that many of the same people may be answering them --- and the population of MTurk workers is a rather strange and unrepresentative bunch of people; the median participant in your survey has already completed 300 academic tasks, including 20 in the past week.  One worker completed 830,000 MTurk HITs in 9 years; if you don't want to work out how many minutes per HIT that represents assuming she worked for 16 hours a day, 365 days a year, here's the answer.  Workers are overwhelmingly likely to come from one of just two countries, the USA and India, presumably because those are the countries where you can get paid in real cash money; MTurk workers in other countries just get credit towards an Amazon gift card (which, when I tried to use it, could only be redeemed on the US site, amazon.com, thus incurring shipping and tax charges when buying goods in Europe).  Maybe this is better than having your participants being all from just one country, but since you don't know what the mix of countries is (unless you specify that the HIT will only be shown in one country), you can't even make claims about the degree of generalisability of your results.

Third, this increase really does not represent all that much money.  If you're only paying $33 to run 120 participants at $0.25, you can probably afford to pay $42.  That $9 increase is less than you'll spend on doughnuts at the office mini-party when your paper gets accepted (but it won't go very far towards building, running, and paying the electricity bill for your alternative, post-Amazon solution).  And let's face it, if these commission rates had been in place from the start, you'd have paid them; the actual increase is irrelevant, just like it doesn't matter when you pay $20 for shipping on a $2 item from eBay if the alternative is to spend $30 with "free" shipping.  All those people tweeting "Goodbye Amazon" aren't really going to switch to another platform.  At bottom, they're just upset because they discovered that a corporation with a monopoly will exploit it, as if they really, really thought that things were going to be different this time (despite everyone knowing that Amazon abuses its warehouse workers and has a history of aggressive tax avoidance).  Indeed, the tone of the protests is remarkable for its lack of direct criticism of Amazon, because that would require an admission that researchers have been complicit with its policies, to an extent that I would argue goes far beyond the average book buyer.  (Disclosure: I'm a hypocrite who orders books or other goods from Amazon about four times a year. I have some good and more bad justifications for that, but basically, I'm not very political, the points made above notwithstanding.)

Bottom line: MTurk is something that researchers can, and possibly (this is not a blog about morals) "should", be able to do without.  Its very existence as a publicly available service seems to be mostly a matter of chance; Amazon doesn't spend much effort on developing it, and it could easily disappear tomorrow.  It introduces new and arguably unquantifiable distortions into research in fields that already have enough problems with validity.  If this increase in prices led to people abandoning it, that might be a good thing.  But my guess is that they won't.

Acknowledgement: Thanks to @thosjleeper for the links to studies of MTurk worker performance.

05 June 2015

Dream on: Playing pinball in your sleep does not make you a better person

(Note: this is more or less my first solo foray into unaided statistical and methodological criticism.  Normally I hitch a ride on the coat-tails of my more experienced co-authors, hoping that they will spot and stop my misunderstandings.  In this case, I haven't asked  anybody to do that for me, so if this post turns out to be utter garbage, I will have only myself to blame.  But it probably won't kill me, so according to the German guy with the fancy moustache, it will make me stronger.)

Among all the LaCour kerfuffle last week, this article by Hu et al. in Science seems to have slipped by with relatively little comment on social media.  That's a shame, because it seems to be a classic example of how fluffy articles in vanity journals can arguably do more damage to the cause of science than outright fraud.

I first noticed Hu et al.'s article in the BBC app on my tablet.  It was the third article in the "World News" section.  Not the Science section, or the Health section (for some reason, the BBC's write-up was done by their Health correspondent, although what the study has to do with health is not clear); apparently this was the third most important news story in the world on May 29, 2015.

Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep.  This is the kind of thing the media loves.  Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study?  (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see here for an excellent rebuttal.)  After all, it's in Science (aka "the prestigious journal, Science"), so it must be true, right?  Well, let's see.

Here's what Hu et al. did.  First, they had their participants take the Implicit Association Test (IAT).  The IAT is, very roughly speaking, a measure of the extent to which you unconsciously endorse stereotypically biased attitudes, e.g. (in this case) that women aren't good at science, or Black people are bad.  If you've never taken the IAT, I strongly recommend that you try it (here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person.  Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) was measured.

Next, Hu et al. had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here).  The key point was that each form of the training ("anti-sexism" and "anti-racism") was associated with its own sound that was played to the participants when they did something right.  You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.

After the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better.  Then, they took a 90-minute nap.  While they were asleep, one of the sounds associated with their training was selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).

Now comes the key part of the article.  When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more.  The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously tolerant towards Black people, but not more inclined to associate women with science.  In other words, the sound that was played to them was somehow reinforcing the specific message that had been associated with that sound during the training period.

Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time.  They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect was greater for the "cued" than the "uncued" training topic.  In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect also persisted a whole week later.

So, what's the problem?   Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause.  However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.

Let's start with the size of the study.  The authors reported a total of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound.  That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women.  In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].

The sample size problem gets worse when you examine the supplemental material and learn that the study was run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added.  This raises all sorts of questions.  First, there's a risk that something (even it was apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing.  More importantly, though, we need to know why the researchers apparently chose to double their sample size.  Could it be because they had results that were promising, but didn't attain statistical significance?  They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*).  That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would hold with a fresh batch of participants.  The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is.  That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample.  Maybe.  (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication.  Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)

The gee-whiz bit of the article, which the cynic in me suspects was at least partly intended for rapid consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here.  There are a few problems with the clarity of this Figure from the start; for example, the blue bars in 1B and 1F look like they're describing the same thing, but they're actually slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right sides represent gender and race bias, not (as in all the other charts) cued and uncued responses.  On the other hand, the green bars in 1E and 1F both represent the same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).

Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training.  If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this.  There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material.  Tables of descriptives seem to have fallen out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)

Of all the charts, Figure 1D perhaps looks the most convincing.  It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist).  However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap.  This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa.  In other words, there was more room for improvement on the cued task versus the uncued task, and that improvement duly took place.  This suggests to me that regression to the mean is one possible explanation here.  Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias.  That's another consequence of the small sample size.

Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards.  Most notable about 1E, however, is what it doesn't show.  Remember, 1D shows the IAT results before and after the nap.  1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap.  Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here.  (Note: the significance stars are my estimate.  I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three.  In any case, as you'll see in the discussion of Figure 1F, this is all irrelevant anyway.)

So, this extra panel (Figure 1E½?) could have been written up something like this: "Cueing during sleep did not result in sustained counterbias reduction; indeed, the cued bias increased very substantially between postnap and delayed testing [t(37) = something, P = very small], whereas the increase in the uncued bias during the week after postnap testing was considerably smaller [t(37) = something, P = 0.045 or thereabouts]."  However, Hu et al. elected not to report this.  I'm sure they had a good reason for that.  Lack of space, probably.

Combining 1D and 1E, we get this chart (no significance stars this time).  My "regression to the mean" hypothesis seems to find some support here.

Figure 1F shows that Hu et al. have committed a common fallacy in comparing two conditions on the basis of one showing a statistically significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1).  They claimed that 1F shows that the effect of cued (versus uncued) training persisted after a week, because the improvement in IAT scores over baseline for the cued training (first blue column versus first green column) was statistically significant, whereas the corresponding improvement for the uncued training (second blue column versus second green column) was not.  Yet, as Andrew Gelman has pointed out in several blog posts with similar titles over the past few years, the difference between “statistically significant” and “not statistically significant” is not in itself necessarily statistically significant.  (He even wrote an article [PDF] on this, with Hal Stern.)  The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) was different at the end of the study (the two green columns).  And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.

One other point to end this rather long post.  Have a look at Figure 2 and the associated description.  Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:
Neurophysiological activity during sleep—such as sleep spindles, slow waves, and rapid-eye-movement (REM) duration—can predict later memory performance (17). Accordingly, we explored possible relations between cueing-specific bias reduction and measures of sleep physiology. We found that only SWS × REM sleep duration consistently predicted cueing-specific bias reduction at 1 week relative to baseline (Fig. 2) [r(38) = 0.450, P = 0.005] (25).
They don't tell us how many combinations of parameters they tried to come up with that lone significant result; nor, in the next couple of paragraphs, do they give us any theoretical justification other than handwaving why the product of SWS and REM sleep duration (whose units, the label on the horizontal access of Figure 2 notwithstanding, are "square minutes", whatever that might mean) --- as opposed to the sum of these two numbers, or their difference, or their ratio, or any one of a dozen other combinations --- should be physiologically relevant.  Indeed, selecting the product has the unfortunate effect of making half of the results zero - I count 20 dots that aren't on the vertical axis, for 40 participants.  I'm going to guess that if you remove those zeroes (which surely cannot have any physiological meaning), the regression line is going to be a lot flatter than it is at present.

Bottom line: I have difficulty believing that there is anything to see here.  We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.

(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants.  You can't even detect that men are taller than women with that.