14 December 2015

My (current) position on the PACE trial

I have written this post principally for people who have started following me (formally on Twitter, or in some other way) because of my somewhat peripheral involvement on the PACE trial discussions.

First off, while I try to be reasonably politically correct, I don't always get all the details right.  I've tried to be respectful to all involved here.  In particular, someone told me that "CFS/ME" is not always an appropriate label to use.  I hope anyone who thinks that will allow me a pass on that, from my position of ignorance.

I've learned a lot about CFS/ME over the past few weeks.  Some of what I've been told --- but above all, what I've observed --- about how some of the science has been conducted, has disturbed me.  The people whose opinions I tend to trust on most issues, who usually put science ahead of their personal political position, seem to be pretty much unanimous that the PACE trial data need to be released so that disinterested parties can examine them.

But I want to make it clear that I have no specific interest in CFS/ME.  I don't personally know anyone who suffers from it, and it's not something I've really ever thought about much.  I don't especially want to become an advocate for patients, except to the extent that, having had my own health problems in the last couple of years, I wish every sick person a speedy recovery and access to the finest medical treatment they can get.  So I'm not sure I can even call myself an "ally"; allies have to take a non-trivial position, and I don't think my position here is much more than trivial.  If the PACE trial data emerge tomorrow, I will not personally be reanalysing them.  I don't know enough about this kind of study to do so.

What I do care about is the integrity of science.  You can see this, I hope, if you Google some of the stuff I've been doing in psychology.  Science, imperfect though it is, is about the only rational game in town when it comes to solving the problems facing society, and when scientists put their own interests above those of the wider community, it usually doesn't turn out well.

So, on to the PACE trial... I want to say that I can understand a lot of defensiveness on the part of the PACE researchers.  They have heard stories of others being harassed and even receiving death threats.  Maybe some of them have experienced this themselves.  For the purposes of this post (please bear with me!), I'm going to assume --- because I have no evidence to the contrary, and people generally don't make these accusations lightly --- that the stories of CFS/ME researchers being harassed in the past are true; arguably, for the purposes of this discussion, it doesn't make any difference whether they are true or not.  (Of course, in another context, such claims are very important, but let me try to put that aside for now.)  Apart from anything else, given the size of the CFS/ME community, it would be unreasonable not to expect there to be some fairly unpleasant people to have also developed the condition.  We all know people like that, whatever our and their health status.  CFS/ME strikes people from all walks of life, including some saints and some sinners.

Now, with that said, I am unconvinced --- actually, "bewildered" would be a better word --- by the argument that releasing the data would somehow expose the researchers to (further) harassment.  Indeed, it seems to me that withholding the data plays directly into the hands of those who claim that the PACE researchers have "something to hide", and they are presumably the most likely to escalate their anger into harassment.  I actually don't believe that the researchers have anything to hide, in the sense of feeling guilty because they did something bad in their analyses.  I've seen enough cases like this in my working life to know that incompetence --- generally in the form of a misplaced sense of loyalty to a group rather than to the wider truth and public interest --- is always to be preferred as an alternative explanation to malice, first because malice is harder to prove, and second because it just almost always turns out to be the case than incompetence was behind a screw-up.

About the only reason I can sort of imagine for the argument that releasing the data might lead to harassment of the researchers, is if the alternative were for the question to somehow go away.  That's perhaps a reasonable argument with some political issues; for example, there is (I think) a legitimate debate to be had over whether it's helpful to reproduce, say, cartoons that might cause people to get over-excited, when they could just be left to one side.  But that's simply not going to happen here.  People with a chronic, debilitating condition, and no cure in sight, are not going to suddenly forget tomorrow that they have that condition.  So far, none of the replies to people who have asked for the data, and been told it will lead to harassment, have explained the mechanism by which that is supposed to happen.

The researchers' argument also seems to conflate the presence in the CFS/ME activist community of some unpleasant people --- which, again, for the sake of this discussion, I will assume is probably true --- to the idea that "anyone from the CFS/ME activist community who asks about PACE is probably trying to harass us".  This is not good logic.  It's what leads airline passengers to demand that Muslim passengers be thrown off their plane.  It's called the base rate fallacy, and avoiding it is supposed to be what scientists --- particularly, for goodness sake, scientists involved in epidemiology --- are good at.

A further problem with the arguments that a request for the data --- whether it comes from patients with scientific training, or scientists such as Jim Coyne --- is designed to be "vexatious" or to "lack serious purpose" or that its intent is "polemical" (all terms used by King's in their reply to Coyne), is that such arguments are utterly unfalsifiable.  Given the public profile of this matter, essentially anyone who asks for the data is going to have their credentials examined, and unless they meet the unspecified high standards of the researchers, they won't get to see the data.  (Yes, Jim Coyne --- who, full disclosure, is my PhD supervisor --- can be a bit shouty at times.  But this is not kindergarten.  Scientists don't get to withhold data from other scientists just because they don't play nice.  Ask any scientist if science is about robust disagreement and you will get a "Yes", but if that idealism isn't maintained when actual robust disagreement takes place, then we might as well conduct the whole process through everything-is-fine press releases.)

Actually, in their reply to Coyne, King's College did seem to give a hint as to who might be allowed to see the data, in their statement "We would expect any replication of data to be carried out by a trained Health Economist", with an nice piece of innuendo carried over from the preceding sentence that this health economist had better have a lot of free time, because the original analysis took a year to complete.  This suggests that unless you declare your qualifications as an unemployed health economist, you aren't going to be judged worthy to see the data (and if you come up with conclusions after a week, it might well be suggested that you didn't look hard enough). But the idea that it will take a year, or indeed need specialised training in health economics, to determine whether the Fisher's exact tests from the contingency tables were calculated correctly, or whether the results really show that people got better over the course of the study, is absurd.  Apart from anything else, science is about communicating your results in a coherent manner to the rest of the scientific community.  If you submit an article and then claim that its principal conclusions cannot be verified except by a few dozen highly trained specialists with a year's effort, that's an admission right there that your article has failed.  Of course there will be questions of interpretation, over things like what "getting better" means, but nobody should have to accept the researcher's claims that their interpretation is the right one.  There needs to be a debate, so that a consensus, if one is possible, can emerge.  (Who knows?  Maybe the evidence for CBT is overwhelming.  There are plenty of neutral scientists who can reach a fair conclusion about that, but right now, they are being deprived of the opportunity to do so.)

A further point about the failure to share data is that the researchers agreed, when they published in PLoS ONE, to make their data available to anyone who asked for it.  This is a condition of publishing in that journal.  You can't have the cake of "we're transparent, we published in an open access journal" and then eat that cake too with "but you can't see the data".  PLoS ONE must insist that the authors release the data as they agreed to do as a condition of publication, or else retract the article because their conditions of publication have been breached.  See Klaas van Dijk's formal request in this regard.

These data are undoubtedly going to come out at some point anyway.  The UK's Information Commissioner will see to that, even if PLoS ONE doesn't persuade the authors to release the data.  As the risk management specialist Peter Sandman points out, openness and transparency at the earliest possible stage translate into reduced pain and costs further down the line.

I want to end with a small apology.  I wrote a post yesterday on an unrelated topic (OK, it was also critical of some poor science, but the relation with the subject of this post was peripheral).  Two people submitted comments on that post which drew a link with the PACE trial.  After some thought, I decided not to publish those comments, as I wanted to keep discussion on that other post on-topic.  I apologise to the authors of those comments that Blogger.com's moderation system did not let me explain the reasons why they were not published.  I would happily publish those same comments on this post; indeed, I will publish pretty much any reasonable comments on this post.

13 December 2015

Digging further into the Bos and Cuddy study

*** Post updated 2015-12-19 20:00 UTC
*** See end of post for a solution that matches the reported percentages and chi-squares.
 A few days ago, I blogged about Professor Amy Cuddy's op-ed piece in the New York Times, in which she cited a non-published, non-peer reviewed study about "iPosture" by Bos and Cuddy of how people allegedly deferred more to authority when they used smaller (versus larger) computing devices, because using smaller devices caused them to hunch (sorry, "iHunch") more, and then something something assertiveness something something testosterone and cortisol something.  (The authors apparently didn't do anything as radical at to actually measure, or even observe, how much people hunched, if at all; they took it for granted that "smaller device = bigger iHunch", so that the only possible explanation for the behaviours they observed was the one they hypothesized.  As I noted in that other post, things are so much easier if you bypass peer review.)

Just for fun, I thought I'd try and reconstruct the contingency tables for "people staying on until the experimenter came and asked them to leave the room" from the Bos and Cuddy article, mainly because I wanted to make my own estimate of the effect size.  Bos and Cuddy reported this as "[eta] = .374", but I wanted to experiment with other ways of measuring it.

In their Figure 1, which I have taken the liberty of reproducing below (I believe that this is fair use, according to Harvard's Open Access Policy, which is to be found here), Bos and Cuddy reported (using the dark grey bars) the percentage of participants who left the room to go and collect their pay, before the experimenter returned.  Those figures are 50%, 71%, 88%, and 94%.  The authors didn't specify how many participants were in each condition, but they had 75 people and 4 conditions (phone, tablet, laptop, desktop), and they stated that they randomised each participant to one condition.  So you would expect to find three groups of 19 participants and one of 18.

However, it all gets a bit complicated here.  It's not possible to obtain all four of the percentages that were reported (50%, 71%, 88%, and 94%), rounded conventionally, from a whole number of participants out of 18 or 19.  Specifically, you can take 9 out of 18 and get 50%, or you can take 17 out of 18 and get 94% (0.9444, rounded down), but you can't get 71% or 88%, with either 18 or 19 as the cell size.  So that suggests that the groups must have been of uneven size.  I enumerated all the possible combinations of four cell sizes from 13 to 25 which added up to 75 and also allowed for the percentages of participants who left the room, correctly rounded, to be one of the integers we're looking for.  Here they those possible combinations, with the total numbers of participants first and the percentage and number of leavers in parentheses:

14 (50%=7), 21 (71%=15), 24 (88%=21), 16 (94%=15)
18 (50%=9), 24 (71%=17), 16 (88%=14), 17 (94%=16)
20 (50%=10), 21 (71%=15), 16 (88%=14), 18 (94%=17)
20 (50%=10), 14 (71%=10), 24 (88%=21), 17 (94%=16)
22 (50%=7), 21 (71%=15), 16 (88%=14), 16 (94%=15)

Well, I guess that's also "randomised" in a sense.  But if your sample sizes are uneven like this, and you don't report it, you're not helping people to understand your experiment.

But maybe they still round their numbers by hand at Harvard for some reason, and sometimes they make mistakes.  So let's see if we can get to within one point of those percentages (49% or 51% instead of 50%, 70% or 72% instead of 71%, etc).  And it turns out that we can, just, as shown in the figure below, in which yellow cells are accurately-reported percentages, and orange cells are "off by one".  We can take 72% for N=18 instead of 71%, and 89% for N=19 instead of 88%.  But then, we only have a sample size of 73.  So we could allow another error, replacing 94% for N=18 with 95% for N=19, and get up to a sample of 74.  Still not right.  So, even allowing for three of their four percentages to be misreported, the per-cell sample sizes must have been unequal.

However, if I was going to succeed in my original aim of reconstructing plausible contingency tables, there would be too many combinations to enumerate if I included these "off-by-one" percentages.  So I went back to the five possible combinations of numbers that didn't involve a reporting error in the percentages, and computed the chi-square values for the contingency tables implied by those numbers, using the online calculator here.  They came out between 10.26 and 12.37, with p values from .016 to .006; this range brackets the numbers reported by Bos and Cuddy (chi-square 11.03, p = .012), but none of them matches those values exactly; the closest is the last set (22, 21, 16, 16) with a chi-square of 11.22 and a p of .011.

So, I'm going to tentatively presume that in fact the sample sizes were all equal (give or take one for not having a number of participants divisible by four), and it's in fact the percentages on the dark grey bars in Bos and Cuddy's Figure 1 that are wrong.  For example, if I build this contingency table:

9 14 16 18
9 5 3 1
% Leavers 50% 74% 84% 95%

then the sample size adds up to 75, the per-condition sample sizes are equal, and the chi-square is 11.086 and the p value is .0113.  That was the closest I could get to the values of 11.03 and .012 in the article, although of course I could have missed something.  These numbers are close enough, I guess, although I'm not sure if I'd want to get on an aircraft built with this degree of attention to detail; we still have inaccuracies in three of the four percentages as well as the approximate chi-square statistic and p value.

Normally in circumstances like this, I'd think about leaving a comment on the article on PubPeer.  But it seems that, in bypassing the normal academic publishing process, Professor Cuddy has found a brilliant way of avoiding, not just regular peer review, but post-publication peer review as well.  In fact, unless the New York Times directs its readers to my blog (or another critical review) for some reason, Bos and Cuddy's study is impregnable by virtue of not existing in the literature.

PS:  This tweet, about the NY Times article, makes an excellent point:
Presumably we should all adopt the wide, expansive pose of the broadsheet newspaper reader. Come to think of it, in much of the English-speaking world at least, broadsheets are typically associated with higher status than tabloids.  Psychologists! I've got a study for you...

PPS: The implications of the light grey bars, showing the mean time taken to leave the room by those who didn't stay for the full 10 minutes, are left as an exercise for the reader.  In the absence of standard deviations (unless someone wants to reconstruct possible values for those from the ANOVA), perhaps we can't say very much, but it's interesting to try and construct numbers that match those means.

*** Update 2015-12-19 20:00 UTC: An alert reader has pointed out that there is another possible assignment of subjects to the conditions:
16 (50%=8), 24 (71%=17), 17 (88%=15), 18 (94%=17)
This gives the Chi-square of 11.03 and p of .012 reported in the article.
So I guess my only remaining complaint (apart from the fact that the article is being used to sell a book without having undergone peer review) is that the uneven cell sizes per condition was not reported.  This is actually a surprisingly common problem, even in the published literature.

A cute story to be told, and self-help books to be sold - so who needs fuddy-duddy peer review?

Daniel Kahneman's warning of a looming train wreck in social psychology took another step closer towards realisation today with the publication of this opinion piece in the New York Times.

In the article, entitled "Your iPhone Is Ruining Your Posture — and Your Mood", Professor Amy Cuddy of Harvard Business School reports on "preliminary research" (available here) that she performed with her colleague, Maarten Bos.  Basically, they gave some students some Apple gadgets to play with, ranging in size from an iPhone up to a full-size desktop computer.  The experimenter gave the participants some filler tasks, and then left, telling them that s/he would be back in five minutes to debrief and pay them, but that they could also come and get him/her at the desk outside.  S/he then didn't come back after five minutes as announced, but instead waited ten minutes.  The main outcome variable was whether the participants came to get their money, and if they did how long they waited before doing so, as a function of the size of the device that they had.  This was portrayed as a measure of their assertiveness, or lack thereof.

It turned out that, the smaller the device, the longer they waited, thus showing reduced assertiveness.  The authors' conclusion was that this was caused by the fact that, to use a smaller device, participants had to slouch over more.  The authors even have a cute name for this: the "iHunch".  And — drumroll please, here's the social priming bit — the fact that the participants with smaller devices were hunched over more made them more submissive to authority, which made them more reluctant to go and tell the researcher that they were ready to get paid their $10 participation fee and go home.

It's hard to know where to begin with this.  There are other plausible explanations, starting with the fact that a lot of people don't have an iPhone and might well enjoy playing with one compared to their Android phone, whereas a desktop computer is still just a desktop computer, even if it is a Mac.  And the effect size was pretty large: the partial eta-squared of the headline result is .177, which should be compared to Cohen's (1988) description of a partial eta-squared of .14 as a "large" effect.  Oh, and there were 75 participants in four conditions, making a princely 19 per cell.  In other words, all the usual suspect things about priming studies.

But what I find really annoying here is that we've gone straight from "preliminary research" to the New York Times without any of those awkward little academic niceties such as "peer review".  The article, in "working paper" form (1,000 words) is here; check out the date (May 2013) and ask yourself why this is suddenly front-page news when, after 30 months, the authors don't seem to have had time to write a proper article and send it to a journal, although one of them did have time to write 845 words for an editorial in the New York Times.  But perhaps those 845 words didn't all have to be written from scratch, because — oh my, surprise surprise — Professor Cuddy is "the author of the forthcoming book 'Presence: Bringing Your Boldest Self to Your Biggest Challenges.'"  Anyone care to take a guess as to whether this research will appear in that book, and whether its status as an unreviewed working paper will be prominently flagged up?

If this is the future — writing up your study pro forma and getting it into what is arguably the world's leading newspaper, complete with cute message that will appeal to anyone who thinks that everybody else uses their smartphone too much — then maybe we should just bring on the train wreck now.

*** Update 2015-12-17 09:50 UTC: I added a follow-up post here. ***

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

23 June 2015

Mechanical Turk: Amazon's new charges are not the biggest problem

Twitter was buzzing, or something, this morning, with the news that Amazon is going to change the commission rates that it charges researchers who use Mechanical Turk (henceforth: MTurk) participants to take surveys, quizzes, personality tests, etc.

(This blog post contains some MTurk jargon.  My previous post was way too long because I spent too much time summarising what someone else had written, so if you don't know anything about MTurk concepts, read this.)

The changes to Amazon's rates, effective July 21, 2015, are listed here, but since that page will probably change after July, I took a screenshot:

Here's what this means.  Currently, if you hire 100 people to fill in your survey and want to give them $1 each, you pay Amazon $110 for "regular" workers and $130 for "Masters".  Under the new pricing scheme, this will be $140 and $145, respectively.  That's an increase of 27.3% and 11.5%, respectively.  (I'm assuming, first, that the wording about "10 or more assignments" means "10 or more instances of the HIT being executed, not necessarily by the same worker", and second, that any psychological survey will need more than 10 assignments.)

Twitter users were quite upset about this.  Someone portrayed this as a "400% increase", which is either a typo, or a miscalculation (Amazon's commission for "regular" workers is going from 10% to 40%, which even expressed as "$10 to $40 on a $100 survey" is actually a 300% increase), or a misunderstanding (the actual increase in cost for the customer is noted in the previous paragraph).  People are talking of using this incident as a reason to start a new, improved platform, possibly creating an international participant pool.

Frankly, I think there is a lot of heat and not much light being generated here.

First, researchers are going to have to face up to the fact that by using MTurk, they are typically exploiting sub-minimum wage labour.  (There are, of course, honourable exceptions, who try to ensure that online survey takers are fairly remunerated.)  The lowest wage rate I've personally seen in the literature was a study that paid over 100 workers the princely sum of $0.25 each for a task that took 20 minutes to complete.  Either those people are desperately poor, or they are children looking for pocket money, or they are people who just really, really like being involved in research, to an extent that might make some people wonder about selection bias.

I have asked researchers in the past how they felt about this exploitation, and the standard answer has been, "Well, nobody's forcing them to do it".  The irony of social psychologists --- who tend not to like it when someone points out that they overwhelmingly self-identify as liberal and this is not necessarily neutral for science --- invoking essentially the same arguments as exploitative corporations for not paying people adequately for their time, is wondrous to behold.  (It's not unique to academia, though.  I used to work at an international organisation, dedicated to human rights and the rule of law, where some managers who made six-figure tax-free salaries were constantly looking for ways to get interns to do the job of assistants, or have technical specialists agree to work for several months for nothing until funding "maybe" came through for their next contract.)

Second, I have doubts about the validity of the responses from MTurk workers.  Some studies have shown that they can perform as well as college students, although maybe it's best to take on the "Master"-level workers, whose price is only going up 11.5%; and I'm not sure that college students ought to be regarded as the best benchmark [PDF] here.  But there are technical problems, such as issues with non-independence of data [PDF] --- if you put three related surveys out there, there's a good chance that many of the same people may be answering them --- and the population of MTurk workers is a rather strange and unrepresentative bunch of people; the median participant in your survey has already completed 300 academic tasks, including 20 in the past week.  One worker completed 830,000 MTurk HITs in 9 years; if you don't want to work out how many minutes per HIT that represents assuming she worked for 16 hours a day, 365 days a year, here's the answer.  Workers are overwhelmingly likely to come from one of just two countries, the USA and India, presumably because those are the countries where you can get paid in real cash money; MTurk workers in other countries just get credit towards an Amazon gift card (which, when I tried to use it, could only be redeemed on the US site, amazon.com, thus incurring shipping and tax charges when buying goods in Europe).  Maybe this is better than having your participants being all from just one country, but since you don't know what the mix of countries is (unless you specify that the HIT will only be shown in one country), you can't even make claims about the degree of generalisability of your results.

Third, this increase really does not represent all that much money.  If you're only paying $33 to run 120 participants at $0.25, you can probably afford to pay $42.  That $9 increase is less than you'll spend on doughnuts at the office mini-party when your paper gets accepted (but it won't go very far towards building, running, and paying the electricity bill for your alternative, post-Amazon solution).  And let's face it, if these commission rates had been in place from the start, you'd have paid them; the actual increase is irrelevant, just like it doesn't matter when you pay $20 for shipping on a $2 item from eBay if the alternative is to spend $30 with "free" shipping.  All those people tweeting "Goodbye Amazon" aren't really going to switch to another platform.  At bottom, they're just upset because they discovered that a corporation with a monopoly will exploit it, as if they really, really thought that things were going to be different this time (despite everyone knowing that Amazon abuses its warehouse workers and has a history of aggressive tax avoidance).  Indeed, the tone of the protests is remarkable for its lack of direct criticism of Amazon, because that would require an admission that researchers have been complicit with its policies, to an extent that I would argue goes far beyond the average book buyer.  (Disclosure: I'm a hypocrite who orders books or other goods from Amazon about four times a year. I have some good and more bad justifications for that, but basically, I'm not very political, the points made above notwithstanding.)

Bottom line: MTurk is something that researchers can, and possibly (this is not a blog about morals) "should", be able to do without.  Its very existence as a publicly available service seems to be mostly a matter of chance; Amazon doesn't spend much effort on developing it, and it could easily disappear tomorrow.  It introduces new and arguably unquantifiable distortions into research in fields that already have enough problems with validity.  If this increase in prices led to people abandoning it, that might be a good thing.  But my guess is that they won't.

Acknowledgement: Thanks to @thosjleeper for the links to studies of MTurk worker performance.

05 June 2015

Dream on: Playing pinball in your sleep does not make you a better person

(Note: this is more or less my first solo foray into unaided statistical and methodological criticism.  Normally I hitch a ride on the coat-tails of my more experienced co-authors, hoping that they will spot and stop my misunderstandings.  In this case, I haven't asked  anybody to do that for me, so if this post turns out to be utter garbage, I will have only myself to blame.  But it probably won't kill me, so according to the German guy with the fancy moustache, it will make me stronger.)

Among all the LaCour kerfuffle last week, this article by Hu et al. in Science seems to have slipped by with relatively little comment on social media.  That's a shame, because it seems to be a classic example of how fluffy articles in vanity journals can arguably do more damage to the cause of science than outright fraud.

I first noticed Hu et al.'s article in the BBC app on my tablet.  It was the third article in the "World News" section.  Not the Science section, or the Health section (for some reason, the BBC's write-up was done by their Health correspondent, although what the study has to do with health is not clear); apparently this was the third most important news story in the world on May 29, 2015.

Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep.  This is the kind of thing the media loves.  Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study?  (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see here for an excellent rebuttal.)  After all, it's in Science (aka "the prestigious journal, Science"), so it must be true, right?  Well, let's see.

Here's what Hu et al. did.  First, they had their participants take the Implicit Association Test (IAT).  The IAT is, very roughly speaking, a measure of the extent to which you unconsciously endorse stereotypically biased attitudes, e.g. (in this case) that women aren't good at science, or Black people are bad.  If you've never taken the IAT, I strongly recommend that you try it (here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person.  Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) was measured.

Next, Hu et al. had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here).  The key point was that each form of the training ("anti-sexism" and "anti-racism") was associated with its own sound that was played to the participants when they did something right.  You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.

After the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better.  Then, they took a 90-minute nap.  While they were asleep, one of the sounds associated with their training was selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).

Now comes the key part of the article.  When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more.  The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously tolerant towards Black people, but not more inclined to associate women with science.  In other words, the sound that was played to them was somehow reinforcing the specific message that had been associated with that sound during the training period.

Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time.  They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect was greater for the "cued" than the "uncued" training topic.  In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect also persisted a whole week later.

So, what's the problem?   Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause.  However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.

Let's start with the size of the study.  The authors reported a total of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound.  That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women.  In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].

The sample size problem gets worse when you examine the supplemental material and learn that the study was run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added.  This raises all sorts of questions.  First, there's a risk that something (even it was apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing.  More importantly, though, we need to know why the researchers apparently chose to double their sample size.  Could it be because they had results that were promising, but didn't attain statistical significance?  They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*).  That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would hold with a fresh batch of participants.  The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is.  That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample.  Maybe.  (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication.  Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)

The gee-whiz bit of the article, which the cynic in me suspects was at least partly intended for rapid consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here.  There are a few problems with the clarity of this Figure from the start; for example, the blue bars in 1B and 1F look like they're describing the same thing, but they're actually slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right sides represent gender and race bias, not (as in all the other charts) cued and uncued responses.  On the other hand, the green bars in 1E and 1F both represent the same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).

Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training.  If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this.  There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material.  Tables of descriptives seem to have fallen out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)

Of all the charts, Figure 1D perhaps looks the most convincing.  It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist).  However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap.  This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa.  In other words, there was more room for improvement on the cued task versus the uncued task, and that improvement duly took place.  This suggests to me that regression to the mean is one possible explanation here.  Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias.  That's another consequence of the small sample size.

Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards.  Most notable about 1E, however, is what it doesn't show.  Remember, 1D shows the IAT results before and after the nap.  1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap.  Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here.  (Note: the significance stars are my estimate.  I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three.  In any case, as you'll see in the discussion of Figure 1F, this is all irrelevant anyway.)

So, this extra panel (Figure 1E½?) could have been written up something like this: "Cueing during sleep did not result in sustained counterbias reduction; indeed, the cued bias increased very substantially between postnap and delayed testing [t(37) = something, P = very small], whereas the increase in the uncued bias during the week after postnap testing was considerably smaller [t(37) = something, P = 0.045 or thereabouts]."  However, Hu et al. elected not to report this.  I'm sure they had a good reason for that.  Lack of space, probably.

Combining 1D and 1E, we get this chart (no significance stars this time).  My "regression to the mean" hypothesis seems to find some support here.

Figure 1F shows that Hu et al. have committed a common fallacy in comparing two conditions on the basis of one showing a statistically significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1).  They claimed that 1F shows that the effect of cued (versus uncued) training persisted after a week, because the improvement in IAT scores over baseline for the cued training (first blue column versus first green column) was statistically significant, whereas the corresponding improvement for the uncued training (second blue column versus second green column) was not.  Yet, as Andrew Gelman has pointed out in several blog posts with similar titles over the past few years, the difference between “statistically significant” and “not statistically significant” is not in itself necessarily statistically significant.  (He even wrote an article [PDF] on this, with Hal Stern.)  The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) was different at the end of the study (the two green columns).  And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.

One other point to end this rather long post.  Have a look at Figure 2 and the associated description.  Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:
Neurophysiological activity during sleep—such as sleep spindles, slow waves, and rapid-eye-movement (REM) duration—can predict later memory performance (17). Accordingly, we explored possible relations between cueing-specific bias reduction and measures of sleep physiology. We found that only SWS × REM sleep duration consistently predicted cueing-specific bias reduction at 1 week relative to baseline (Fig. 2) [r(38) = 0.450, P = 0.005] (25).
They don't tell us how many combinations of parameters they tried to come up with that lone significant result; nor, in the next couple of paragraphs, do they give us any theoretical justification other than handwaving why the product of SWS and REM sleep duration (whose units, the label on the horizontal access of Figure 2 notwithstanding, are "square minutes", whatever that might mean) --- as opposed to the sum of these two numbers, or their difference, or their ratio, or any one of a dozen other combinations --- should be physiologically relevant.  Indeed, selecting the product has the unfortunate effect of making half of the results zero - I count 20 dots that aren't on the vertical axis, for 40 participants.  I'm going to guess that if you remove those zeroes (which surely cannot have any physiological meaning), the regression line is going to be a lot flatter than it is at present.

Bottom line: I have difficulty believing that there is anything to see here.  We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.

(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants.  You can't even detect that men are taller than women with that.

21 May 2015

What to do with people who commit scientific fraud?

Another story of apparent scientific fraud has hit the headlines.  I'm sure that most people who are reading this post will have seen that story and formed their own opinions on it.  It certainly doesn't look good.  And the airbrushing of history has already begun, as you can see by comparing the current state of this page on the website of the MidWest Political Science Association with how it looked back in March 2015 (search for "Fett" and look at the next couple of paragraphs).  Meanwhile, Michael LaCour hastily replaced his CV (which was dated 2015-02-09) with an older version (dated 2014-09-01) that omitted his impressive-looking list of funding sources (see here for the main difference between the two versions); at this writing (2015-05-22 10:37 UTC), his CV seems to be missing entirely from his site.

This rapidly- (aka "hastily-") written post is in response to some tweets calling for fraudsters to be banned from academia for life.  I have a few problems with that.

First, I'm not quite sure what banning someone would mean.  Are they to have "Do Not Hire In Any Academic Context" tattooed on their forehead?  In six languages?  Or should we have a central "Do Not Hire" repository, with DNA samples to prevent false identities (and fingerprints to prevent people impersonating their identical twin)?

Second, most fraudsters don't confess, nor are they subjected to any formal legal process (Diederik Stapel is a notable exception, having both confessed in a book [PDF] and been given a community service penalty, as well as what amounts to a 6-figure fine, by a court in the Netherlands).  As far as I can tell, these people tend to deny any involvement, get fired, disappear for a while, and then maybe turn up a few years later teaching mathematics at a private high school or something, once the publicity has died down and they've massaged their CVs sufficiently.  Should that be forbidden too?  How far do we let our dislike of people who have let us down extend to depriving them of any chance of earning a living in future?

After all, we rehabilitate people who kill other people; indeed, in some cases, we rehabilitate them as academics.  And as the case of Frank Abagnale shows, sometimes a fraudster can be very good at detecting fraud in others.  Perhaps we should give the few fraudsters who confess a shot at redemption.  Sure, we should treat their subsequent discoveries with skepticism, and we probably won't allow them to collect data unsupervised, but by simply casting them out, we miss an opportunity to learn, both about what drove (and enabled) them to do what they did, and how to prevent or mitigate future cases.  We study all kinds of unpleasant things, so why impose this blind spot on ourselves?

Let's face it, nobody likes being the victim of wrongdoing.  When I came downstairs a couple of years ago to find that my bicycle had been stolen from my yard overnight, the one time that I didn't lock it because it was raining so hard when I arrived home that I didn't want to stay out in the rain a second longer to do it, I was all in favour of the death penalty, or at the very least lifelong imprisonment with no possibility of parole, for bicycle thieves.  The inner reactionary in me had come out; I had become the conservative that apparently emerges whenever a liberal gets mugged.  Yet, we know from research (that we have to presume wasn't faked --- ha ha, just kidding!) that more severe punishments don't deter crime, and that what really makes a difference [PDF] is the perceived chance of being caught (and/or sentenced).  And here, academia does a really, really terrible job.

First, our publishing system is, to a first approximation, completely broken.  It rewards style over substance in a systematic way (and Open Access publishing, in and of itself, will not fix this).  As outside observers of any given article, we are fundamentally unable to distinguish between reviewers who insist on more rigour because our work needs more rigour, and those who have missed the point completely; anyone who has had an article rejected from a journal that has also recently published some piece of "obvious" garbage will know this feeling (especially if our article was critical of that same garbage, and seems to be being held to a totally different set of standards [PDF]).

Second, we --- society, the media, the general public, but also scientists among ourselves (I include myself in the set of "scientists" here mostly for syntactic convenience) --- lionize "brilliant" scientists when they discover something, even though that something --- if it's a true scientific discovery --- was surely just sitting there waiting to be discovered. (Maybe this confusion between scientists and inventors will get sorted out one day; I think it's a very fundamental problem. Perhaps we would be better off if Einstein hadn't been so photogenic.) And that's assuming that what the scientist has discovered is even, as the saying goes, "a thing", a truth; let's face it, in the social sciences, there are very few truths, only some trends, and very little from which one can make valid predictions about people with any worthwhile degree of reliability. (An otherwise totally irrelevant aside to illustrate this gap: one of the most insanely cool things I know of from "hard" science is that GPS uses both special and general relativity to make corrections to its timing, and those corrections go in opposite directions.) We elevate the people who make these "amazing discoveries" to superstar status. They get to fly business class to conferences and charge substantial fees to deliver a keynote speech in which they present their probably unreplicable findings.  They go on national TV and tell us how their massive effect sizes mean that we can change the world for $29.99.

Thus, we have a system that is almost perfectly set up to reward people who tell the world what it wants to hear.  Given those circumstances, perhaps the surprising thing is that we don't find out about more fraud.  We can't tell with any objectivity how much cheating goes on, but judging by what people are prepared to report about their own and (especially) their colleagues' behaviour, what gets discovered is probably only the tip of a very large and dense iceberg. It turns out that there are an awful lot of very hungry dogs eating a lot of homework.

I'm not going to claim that I have a solution, because I haven't done any research on this (another amusing point about reactions to the LaCour case is how little they have been based on data and how much they have depended on visceral reactions; much of this post also falls into that category, of course).  But I have two ideas.  First, we should work towards 100% publication of datasets, along with the article, first time, every time.  No excuses, and no need to ask the original authors for permission, either to look at the data or to do anything else with them; as the originators of the data, you'll get an acknowledgement in my subsequent article, and that's all.  Second, reviewers and editors should exercise extreme caution when presented with large effect sizes for social or personal phenomena that have not already been predicted by Shakespeare or Plato.  As far as most social science research is concerned, those guys already have the important things pretty well covered.

(Updated 2015-05-22 to incorporate the details of LaCour's CV updates.)

09 May 2015

Real-time emotion tracking by webcam

The European Commission is giving financial backing to a company that claims its technology can read your emotional state by just having you look into a webcam.  There is some sceptical reporting of this story here.

"Realeyes is a London based start-up company that tracks people's facial reactions through webcams and smartphones in order to analyse their emotions. ...
Realeyes has just received a 3,6 million euro funding from the European Commission to further develop emotion measurement technology. ...

The technology is based on six basic emotional states that, according to the research of Dr Paul Ekman, a research psychologist, are universal across cultures, ages and geographic locations. The automated facial coding platform records and then analyses these universal emotions: happiness, surprise, fear, sadness, disgust and confusion. ...
 [T]his technological development could be a very powerful tool not only for advertising agencies, but as well for improving classroom learning, increasing drivers’ safety, or to be used as a type of lie detector test by the police."

Of course, this is utterly stupid.  For one thing, it treats emotions as if they are real tangible things that everyone agrees upon, whereas emotions research is a messy field full of competing theories and models.  I don't know what Ekman's research says, or what predictions it makes, but if it really suggests that one can reduce everything about what a person is feeling at any given moment to one of six (or nine, or twelve) choices on a scale, then I don't think I live in that world (and I certainly don't want to). For another, without some form of baseline record of a person's face, it's going to be close to impossible to tell what distortions are being heaped on top of that by emotions.  Think of people you know whose "neutral" expression is basically a smile, and others who walk round with a permanent scowl on their faces.

Now, I don't really care much if this kind of thing is sold to gullible "brand-led" companies who are told that it will help them sell more upmarket branded crap to people.  If those companies want to waste their marketing and advertising dollars, they're welcome.  (After all, many of them are currently spraying those same dollars more or less uselessly in advertising on Twitter and Facebook.)  But I do care when public money is involved, or public policy is likely to be influenced.

Actually, it seems to me that the major problem here is not, as some seem to think, the "big brother" implications of technology actually telling purveyors of high-end perfumes or watches, or the authorities, how we're really feeling, although of course that would be intensely problematic in its own right.  A far bigger problem is how to deal with all of the false positives, because this stuff just won't work - whatever "work" might even mean in this context.  At least if a "traditional" (i.e., post-2011 or so) camera wrongly claims to have located you in a given place at a given time, it's plausible that you might be able to produce an alibi (for example, another facial recognition camera placing you in another city at exactly the same time, ha ha).  But when an "Emocam" says that you're looking fearful as you, say, enter the airport terminal, and therefore you must be planning to blow yourself up, there is literally nothing you can do to prove the contrary.  Dr. Ekman's "perfect" research, combined with XYZ defence contractor's "infallible" software, has spoken.
  • You are fearful.  What are you about to do?  Maybe we'd better shoot you before you deploy that suicide vest.
  • The computer says you are disgusted.  I am a member of a different ethnic group.  Are you disgusted at me?  Are you some kind of racist?
  • Welcome to this job interview.  Hmm, the computer says you are confused.  We don't want confused people working for us.
So now we're all going to have to learn another new skill: faking our emotions so as to fool the computer.  Not because we want to be deceptive, but because it will be messing with our lives on the basis of mistakes that, almost by definition, nobody is capable of correcting.  ("Well, Mr. Brown, you may be feeling happy now, but seventeen minutes ago, you were definitely surprised. We've had this computer here for three years now, and I've never seen it make a wrong judgement.")  I suspect that this is going to be possible although moderately difficult, which will just give an advantage to the truly determined (such as the kind of people that the police might be hoping to catch with their new "type of lie detector").

In a previous life, but still on this blog, I was a "computer guy".  In a blog post from that previous life, I recommended the remarkable book, "Digital Woes: Why We Should Not Depend on Software" by Lauren Ruth Wiener.  Everything that is wrong with this "emotion tracking" project is covered in that book, despite its publication date of 1993 and the fact that, as far as I have been able to determine, the word "Internet" doesn't appear anywhere in it.  I strongly recommend it to anyone who is concerned about the degree to which not only politicians, but also other decision-makers including those in private-sector organisations, so readily fall prey to the "Shiny infallible machine" narrative of the peddlers of imperfect technology.

01 May 2015

Violence against women: Another correlate of national happiness?

Introductory disclaimer: This blog post is intended to be about the selective interpretation of statistics. Many of the figures under discussion are about reported rates of violence against women, and any criticisms or suggestions regarding research in this field are solely in reference to research methods. Nothing in this commentary is in any way doubting the very real experiences of women facing violence and abuse, nor placing responsibility for the correct reporting of abuse on the women experiencing it. Violence against women and girls (VAWG) is an extremely serious issue, which is exactly why it deserves the most robust research methods in order to bring it to light.

Back in February 2014, I wrote a post in which I noted the seemingly high correlation between “national happiness” ratings for certain countries and per-capita consumption of antidepressants in those countries. Now I’ve found what I think is an even better example of the limitations of ranking countries based on some simplified metric. I’ve asked my friend Clare Elcombe Webber, a commissioner for VAWG services, to help me here. So from this point on, we’re writing in the plural...

A few months ago, this tweet from Joe Hancock (@jahoseph) appeared in Nick’s feed. It shows, for 28 EU countries, the percentage of women who report having been a victim of (sexual or other) violence since the age of 15. Guess which country tops this list? Yep, Denmark. Followed by Finland, Sweden, and the Netherlands. Remember them? The countries that are up there in the top 5 or 10 of almost every happiness survey ever performed? Down near the bottom: miserable old Portugal, ranked #22 out of 23 in happiness in the post linked to above. (The various lists of countries don’t match exactly between this blog post and the one linked to above because there are different membership criteria, with some reports coming from the OECD, EU, or UN. Portugal was kept off the bottom of the happiness list in the post about antidepressants by South Korea.)

This warranted some more investigating, along the lines of Nick’s previous exploration of the link between happiness and antidepressants. The original survey data page is here; click on “EU map” and use the dropdown list to choose the numbers you want. Joe’s tweet is based on the first drop-down option, “Physical and/or sexual violence by a partner or a non-partner since the age of 15”. While performing the tests that we describe later in this post, we also tried the next option, “Physical and/or sexual violence by a partner [i.e., not a non-partner] since the age of 15”, but this didn’t greatly change the results. In what follows, unless otherwise stated, we have used the numbers for VAWG perpetrated by both partners and non-partners.

First, Nick took his existing dataset with 23 countries for which the OECD supplied the antidepressant consumption numbers, and stripped it down to those 17 which are also EU members. Then, he ran the same Spearman correlations as before, looking for the correlations between UN World Happiness Index ranking and: /a/ antidepressant consumption (Nick did this last time, but the numbers will be slightly different with this new subset of 17 countries); /b/ violence reported by women. Here are the results, which are first sight are rather disturbing:
  • Antidepressant consumption correlated (Spearman’s rho) .572 (p = .016) with national happiness.
  • Violence against women correlated (Spearman’s rho) .831 (p < .0001) with national happiness.
Let’s repeat that: Among the 17 largest economies within the EU, the degree of violence since age 15 reported by women is very strongly correlated with national happiness survey outcomes. When things turn out to be correlated at .831, you generally start looking for reasons why you aren’t in fact measuring the same thing twice without knowing it.

Trying to look for some way of mitigating these figures, Nick tried another approach, this time with parametric statistics. He took the percentage of women reporting being the victims of violence in all 28 EU countries, and compared it with the points score (out of 10) from the UN Happiness Survey. Here is the least pessimistic result obtained from the various combinations:
  • Across all 28 EU countries, violence against women correlated (Pearson’s r) .497 (p=.007) with national happiness.
This is still not very good news. If you’re hoping to show that two phenomena in the social sciences are correlated, and you find a correlation of .497, you’re generally pretty pleased.

Of course, correlation is not the same as causation. Probably nobody would suggest that higher levels of violence against women makes for a happier society, or that higher levels of general societal happiness cause people to become more violent towards women.

So what is going on here? Maybe the methods are seriously flawed. We might have difficulty imagining why Austrian women would report rates of interpersonal violence barely half those experienced by Luxembourgers, or that Scandinavians are assaulting women at over twice the rate of Poles, or that the domestic violence problem in the UK is 70% worse than in next-door Ireland.

But perhaps there are some other factors that might help to explain these numbers. Remember, these are answers being given to an interviewer from the EU Fundamental Rights Agency (FRA); they are not extracted from, say, police databases of complaints filed. Thus, while we can perhaps assume that the reports ought not to be affected too much by the perceived level of danger or social shame involved in revealing one’s situation to the authorities (it’s easy to imagine that that people in countries with high levels of equality and openness—Denmark, say—might feel more able to file charges about violence than in some other countries that are perceived as being more “macho”), the degree to which these data reflect reality will depend to a large extend on people’s degree of willingness to admit being a victim to a stranger. While one would hope that the FRA had thought about that and done the maximum in terms of study and questionnaire design, training of interviewers, etc., to allow women to be frank about their experiences, this isn’t something we were able to find definitively in their reported methodology (available here).

There are huge issues, which have dogged this type of research for many decades, when it comes to asking women to disclose their experiences of abuse. The conventional wisdom amongst researchers and service providers is that victims of abuse are extremely unlikely to reveal their experiences to anyone, and short of the FRA interviewers spending months building rapport with each respondent (which, obviously, they did not do) there is little to be done to mitigate this. Here are just some possible reasons why experiences of abuse might not have been disclosed to researchers, and how this could impact on the results:

·      The sampling method involved visiting randomly selected addresses. A common tactic used by abusive partners is to isolate their victim, primarily as a way of stopping any disclosure or attempt to seek support; so it is not unlikely that women currently in abusive relationships were “not allowed” to take part in the research at all. (If we wish to make great leaps of logic here, we could theorise that this could lead to a higher apparent incidence of VAWG in countries with better support services, as women in those countries were more likely to have been able to leave an abusive situation, and therefore were more able to take part in the research. But we don’t have data for that…)

·      Many women do not identify their experiences as violent or abusive, even when most external observers would say that they plainly are. This may be a defence mechanism, allowing them to avoid having to face up to the truth about their partner, the fragility of their personal safety, or the frightening nature of the world. Admitting that they are the victims of violence or abuse would also imply that they may have to act to change their situation. Therefore, respondents could simply be lying; and, even if a measure of social desirability might be able to detect this (possibly a tall order for such a serious subject), it’s unlikely that the interviewer would administer such a measure. Alternatively, the degree to which women deny that their experiences are violent or abusive might have a substantial cultural component; perhaps women in more “traditional” countries are more likely to justify some behaviours towards them as “normal”.

·      It is not clear, from the methodological background of the report, how issues of confidentiality were explained to respondents. We can reasonably conjecture that if a respondent disclosed that they were currently at serious risk from someone, that the interviewer would have been ethically obliged to do something additional with this information. Many abusers make threats of violence or serious reprisals should their victim make a disclosure (something borne out by the fact that the majority of serious injuries or murders of women by men they know occur at or shortly after the point of separation or disclosure of the abuse to a third party), and this would significantly impact whether or not a woman would answer these questions truthfully. In addition, perceived fear of the authorities may discourage a woman from disclosing; in many countries, the police and social workers often do not have a glowing reputation for providing support, and women may feel that involving them would exacerbate their problems, rather than help to resolve them.

·      Finally, victims who have disclosed their abuse often talk of their feelings of guilt, or that they are to blame for abuse. This shame could be an additional barrier to giving a truthful answer.

We can make some—admittedly sweeping—inferences from the fact that the data do not tell us what we would intuitively expect. We could speculate that those countries we might expect to be more socially “advanced” in terms of attitudes to violence against women could have higher rates of disclosures of abuse in this research because women in those countries feel more able to recognise and name their experiences, or feel more confidence in the authorities being supportive, or have greater trust in the confidentiality of the survey; and therefore are more prepared to report having been the victims of violence. A further conjecture could be that in these countries, women are socially “trained” that these experiences are neither normal nor acceptable, and that victims of violence are entitled to be heard, without being stigmatised. (However, a skeptic might respond that, while these assumptions enable us to put a positive spin on this slightly unusual dataset, they are still only assumptions for which we have little evidence, and do little to address the initial observation, namely that the countries in the EU deemed to be happiest also reported the highest levels of violence against women.) We could add all sorts of social variables into the mix here: availability of relationship education, social stigma towards single mothers, the perception of the state as supportive (or not), and so on. Violence against women and girls is a melting pot of individual, social, and cultural variants, and to date researchers have not been able to neatly set out what it is which makes some men decide to be abusive towards women, nor what makes some communities turn a blind eye to such abuse or even place the blame on the women being abused. Respondents potentially have many more reasons to conceal their experiences of violence and abuse than they might in other research areas, and there is no straightforward way of controlling for these. (Psychologists have devised various ways of controlling for social desirability biases, but it is not clear to us that these take sufficient account of cross-cultural factors; see Saunders, 1991.)

However, let’s assume for a moment that it might be valid to take the numbers in the report as not being directly reflective of the underlying problem, but instead as presenting a combination of the actual prevalence, multiplied by a “willingness to acknowledge” factor. At a certain point, this could mean that you could see higher numbers in the survey for countries where there’s actually less of a problem. For example, let’s say that the true rate of violence against women in Denmark is 60%, and that 87% of Danish women are prepared to discuss their experiences of violence openly; multiply those together, and there’s the 52% reported rate from the EU survey. Meanwhile, perhaps the true rate in Poland is 76% (note: we have no evidence for this; we are choosing Poland here only because it is the country at the bottom end of the FRA’s list), but only 25% of Polish women are prepared to discuss it; again, multiply those numbers together and you get the reported rate of 19%. In fact, this line of reasoning is commonly used by people working on the front line of VAWG support. For example, in one London borough, reports to the police of domestic abuse in 2014 were over 40% higher than in 2013, and this is considered to be a good thing; it’s assumed that the majority of domestic abuse goes unreported, and thus additional reports are just that: additional reports, rather than additional instances. But without more data from other sources and approaches, we just don’t (and can’t) know.

Here’s the kicker, though: if you choose to take the line that these figures “can’t possibly be right”, and that in fact they may even show the opposite of the real problem, that raises the question of why it’s OK to look for an alternative explanation for the figures on violence (or other social issues, such as, perhaps, antidepressant usage), but not for those on other phenomena, such as (self-reported) happiness. What gives data on happiness the kind of objective quality that legitimises all the column inches, TV airtime of happiness gurus, and government policy initiatives to try and boost their country’s rank from 18 to 10 in the UN World Happiness Index, if you’re simultaneously prepared to try to look very hard for reasons to explain away numbers that appear to show that your favourite “happy” country is a hotbed of violence against women?

And, even more importantly: whatever your position, do you have evidence for it?

You can find the dataset for this post here. (Yes, the filename does give away how long we have been working on this post!) It also includes all the data you need to re-examine the post about antidepressants from February 2014.

Saunders, D. (1991). Procedures for adjusting self-reports of violence for social desirability bias. Journal of Interpersonal Violence, 3, 336–344. https://doi.org/10.1177/088626091006003006 (Full text available here.)

29 March 2015

Open Access journals: what's not to like? This, maybe...

When I used to work in an office, my boss used to say that any time he had a good idea, he could come to me and in ten minutes he'd know everything that might go wrong with it.  He often went ahead anyway, and his ideas often worked, but at least he was forewarned.  So in that spirit, here goes.

I'm a little concerned by all the hype around Open Access (OA) journals.

Yes, I know that traditional journal publishers are evil, and make more money and higher gross margins and have bigger car parking spaces than Apple, and I agree that when taxpayers fund research then taxpayers should have access to it.  However, I'm not sure that just because all of the above may be true, that our current model of OA journals is necessarily the solution.  I have a number of concerns of what may happen as the OA model takes hold and, as everybody tells me is going to happen, becomes dominant.  This post is intended to start a discussion on those concerns, if anyone's interested.

1. It's the economy, stupid

One of the strengths of the traditional publishing model is that, to a first approximation and allowing for all kinds of special circumstances, the editor-in-chief of a half-decent journal doesn't have to worry about filling it.  Indeed, many journals proudly promote their rejection rate on their home page, next to their average turnaround time.  "We reject 90% of submissions; don't waste our time unless you've got a good story to tell", is the message (with, of course, all of the predictable effects on publication bias that this implies).  Doubtless the editor-in-chief has some financial targets to meet, perhaps in terms of not blowing the production budget on full-page colour pictures of kittens, but this is not a job whose holder is principally tasked with revenue generation.  The money is coming in pretty steadily from sales of packages of journals to institutions around the world (even if some of these institutions are starting to take exception).  The editor gets to concentrate on, among other things, maintaining the journal's impact factor --- hopefully using methods that are a little less blatant than this.

With OA journals, funded principally by article processing charges paid by authors, things are likely to be a little different.  No matter how dedicated to academic integrity and the highest possible scientific standards the editorial staff want to be, money is right there in the equation every day, especially for an online journal with almost no physical limits to its size.  How many articles can we get through the review process this month?  Can we upsell the author to the full-colour package?  Why do so many people want hardship waivers?  (Oh, and I have yet to see any suggestion that OA journals will be less concerned about their impact factor than traditional journals.)

The idea, of course, is that authors will not be reaching into their own pockets to pay the article processing charges.  The intention is that the fee of $1,000 or so to publish the results should be budgeted for out of the project's funding.  After all, it's only a waffer-thin thousand bucks, the kind of money some projects probably have slopping around at the end anyway if the participants didn't eat all of the M&Ms.  But once funding agencies catch on, will they allow grant proposals to include specific line items for OA publishing, when publication in one of the high-prestige traditional journals --- which you promised them, earlier in the proposal, were definitely going to be interested in your groundbreaking project --- is free?  And what about the independent researcher with no budget, who may have something interesting to say, but no money?  Should such a person have to fund publication from their own pocket?

I'm afraid that the money always, always finds a way to affect things.  Someone, somewhere in the process, will be directly incentivised to increase revenue.  (In France, where I live, gambling is a state monopoly, which means that whatever arms-length construction they have put together, somewhere there is someone who essentially works for the government and yet has a performance target to sell more scratchcards to the urban poor, even though gambling is officially a social problem.)  How does this affect you as the editor-in-chief of an OA journal?  Maybe you ask your action editors to tell reviewers to be less picky about certain things.  Maybe you suggest to an author that splitting these results into two articles will be to everyone's advantage - after all, the publication fee is coming out of the grant money, and as it stands it is a pretty long paper manuscript for someone to have to wade through at one sitting...

The corollary of this is that the PI presenting an article for publication is a paying customer.  Now when I go to make a $1,000 purchase, I'm generally greeted with open arms.  I certainly don't expect to have to pass quality control checks before I'm allowed to spend my $1,000.  The psychology of the OA model is going to be interesting indeed.  (Compare what happened in the UK when public universities started to charge tuition fees; all of a sudden, the idea of a student being given a failing grade became, for many people, a consumer protection issue.  "I paid to come here and get a degree, how dare you tell me I can't have one?", ran the argument.  Too many unhappy punters, and the Vice-Chancellor is touring the stricter departments to ask them to be a little more, um, flexible in their marking criteria.)

I found a pertinent example shortly before putting this post (which has taken a while to draft) online.  Here is a note from Nandita Quaderi, who is "Publishing Director, Open Research" at Scientific Reports, which is part of Nature Publishing Group.  Nandita is pleased to announce that henceforth, "a selection of authors submitting a biology manuscript to Scientific Reports will be able to opt-in to a fast-track peer-review service".  Needless to say, this service comes "at an additional cost", being provided by a for-profit organisation called Research Square.   (An editor of Scientific Reports has resigned over this.)  So now, I'm paying to publish, and I'm paying to have my article reviewed.  What could possibly go wrong with the objectivity and rigour of the scientific process?

2. Access is not the biggest problem science faces right now

Another issue is that most OA journals do not address the ongoing problems of the peer review system.  I would argue that currently, failures of peer review are a bigger threat to science than paywalls.  If reviewers are allowing bad science through --- or erroneously recommending rejection of good articles --- then getting free access to the resulting error-filled literature is the least of our problems; and I have yet to see a coherent argument why the OA review process might be inherently any more rigorous than that at traditional journals.

Some online journals, such as The Winnower, have adopted a radical solution to this: anyone can publish an article, without any prior review process, with the idea that people will come along and review it afterwards.  This seems attractive at first sight, except that people typically have even less incentive to act as a reviewer once the article is "out there", even if it doesn't yet have the status of a citable article with a DOI (a status which, incidentally, the article's own author decides to award it, at a time of his or her own choosing).

It seems to me that OA journals are to some extent hitching a ride on the back of the traditional journals, which have created (and still sustain) the fundamental mode of operation that we know and love/hate: author sends in MS, editor checks it, editor selects reviewers, reviewers approve or request changes, editor finally accepts or rejects.  This system more or less works --- give or take the criticisms of peer review as "broken", which have a lot of merit but which, as I noted above, it seems to me that OA (in and of itself) doesn't do much to address --- because people generally have confidence in it.  Not necessarily absolute confidence, but we know how it's meant to work and how to spot when it isn't working.  We (like to) believe that the editors do not generally accept (too many) articles from themselves and their buddies (or at least, that they risk getting called out for it if they do), that they select reviewers who are competent in the relevant subfields, that the reviewers do an honest and unbiased job, etc.  (Of course, the reviewer who is doing "excellent quality control" with *your* article is an incompetent idiot who has failed to understand even the most basic concepts of *my* article, but that's part of the game.)

So, when something like Collabra, the new OA mega-journal from the University of California, launches, they can put pictures of respected people on the front page where they introduce their editorial board, thus sending a message that the review process will be every bit as rigorous as it is for a traditional journal.  Readers are reassured, and authors know they will need to submit work of a high standard.  But to me this only works because the majority of people who are being held up as examples of the quality of the journal have good reputations, which have been made within the traditional process.  How does this scale?  What does the publication process look like in 10 or 20 years time, if the traditional journals have mostly gone and we make our reputations with OA (web-)publishing, blogs, and social media presence?  (Yes, impact factor is broken. But where is the dominant, credible alternative that everyone will be prepared to switch to?)

This doesn't mean that Collabra will be full of articles promoting homeopathy after a few months.  But over time, the relationship between authors, reviewers, and journals will change, in ways that we can't necessarily predict.  That doesn't mean the sky will fall, but it does mean that there will be perverse situations that may or may not be worse than what we have to put up with now.

3. Ham, spam, and all points in between

I also worry that the line between "legitimate" and "spam" OA journals will start to blur.  Currently we can all point and laugh at the semi-literate invitations to publish in (or join the Editorial Board of) those pseudo-journals with plausible-sounding names, strange salutation styles in their e-mails, and an editorial address in a Regus suite in San Antonio, from which manuscripts are presumably forwarded to the journal's real staff in Cairo or Mumbai.  But these fraudulent (whatever that means...) journals will improve, and it will become hard to tell the "fake" from the "real".
A few weeks ago, I was asked to review an article by an OA journal that was part of a London-based publishing outfit.  I genuinely couldn't decide if they were spammers or genuine: the journals mentioned on their web site all seem to exist, and about a third of them are indexed in PubMed.  How good or bad is that?  I recommended rejection, as the article would have been of little interest to the readers of the journal, according to its own profile.  I wonder what the lead author did next (assuming that my recommendation to reject was the editor's verdict as well)?  Did he appeal, as a "paying customer", to the editor in chief?  Or did he maybe send the article to another OA journal, on the basis that he will eventually find somebody, somewhere, who wants $1,000? (*)

I think, though, that perhaps the bigger risk in the meeting of "legitimate" and "spam" journals is through the trimming of standards at the "legitimate" end. Look at what happened when the Saudis decided to throw some money at education, and suddenly King Abdulaziz University is ranked #7 in the world in mathematics.  Uh-huh.  Sure.  So what happens when that university, or others with rather more money to burn than academic integrity, starts its own OA mega-journal?  Exactly what will be the conditions of scientific neutrality under which the editor-in-chief reviews articles by, say, the children of minor members of the Saudi ruling family?  Perhaps someone will create an authoritative clearing house to administer a sliding scale of which journals are "real" versus "spam".  But who would run such an organisation?  The AAAS?  ISO?  Standard & Poor's?  Google?  And who would ultimately be responsible for the "legit"/"spam" decisions?

Historically, publisher-led journals seem to have been mostly spam-free; it would be interesting to establish why this was. High barrier to entry in the world of ink and paper?  Old-fashioned academic and intellectual integrity, despite the profits?  Risk of reputational damage if, say, Springer (cough) or Sage (cough) were to acquire a reputation for publishing garbage?  I don't know what the reasons are, but it created the current situation whereby --- whatever the other problems in the system --- a journal that exists in a print edition is generally regarded, at least by default, as having some degree of seriousness.  I worry that we will end up in a situation where we don't have a simple way to tell whether we can take a "journal" (in the widest possible sense) seriously or not.  In such situations, humans tend to apply some simple heuristics, which scammers have many centuries worth of experience exploiting.

4. A modest (and, as yet, barely sketched out) proposal

Do I have an alternative?  Well, when my boss came to me with his ideas, I usually didn't, but in this case I do have a tentative suggestion.  What if the funding agencies ran a few journals?  After all, these are (generally) the representatives of the taxpayers, who --- as the Open Access movement is right to point out --- pay for the research and ought to have free access to the results.  Yet currently, they rely on "the system" to work, and for researchers to muddle their way through that system.  In the traditional model, the readers pay, and in theOA model, the authors pay.  Both systems have their deficiencies.  Supposing we had a parallel model where nobody paid (except a general fund, set up to guarantee neutrality)?

Those of a libertarian bent might argue that the government shouldn't be involved in academic publishing, but the stable door closed on that when we started to take their money to do the research.  Some might also argue that an funding agency-sponsored journal might be highly politicised, but then, /a/ why should it be more politicised than the handing out of the money, /b/ the Rind/Lilienfeld saga showed that politicians can pressure "independent" journal publishers into submission too, and /c/ there will always be other outlets; I'm just modestly proposing a "third way".  (As a bonus, this would seem to be a good fit with the aims of the pre-registration movement.)


1. I'm aware that this is a rather long and at times rambling post.  It started life in a frenzied evening of writing just after I got out of hospital after a stay that lasted the better part of three weeks, and that still shows.  I should probably have scrapped it and started again, or at least sat down and rearranged the paragraphs, but I wanted to get the ideas out there within a reasonable time frame.  I hope some of them are useful.

2. I want to thank Rolf Zwaan for some helpful discussions on an earlier draft of this post.  Rolf disagreed with much of what I had written, and I've only made a few changes, so he probably still disagrees with a lot of it.  I should point out that my use of the example of Collabra (for whom Rolf is an editor) above is not based on any specific criticism of that journal, but merely as a salient example; Rolf's tweet about his appointment as an editor at Collabra was the spark for my writing of this post.

(*) Update 2016-11-28: I was re-reading this post because reasons, and I noticed this dangling question.  I googled the title of the article... sure enough, it was accepted, despite my recommendation to reject.