28 November 2017

Some problems in a field study of sexual attraction and hitchhiking

This is the first in a series of blog posts by me and James Heathers on the research of Dr. Nicolas Guéguen, of the Université Bretagne-Sud in France. We will be examining one of Dr. Guéguen's studies in each post. Cathleen O'Grady at Ars Technica has written an excellent overview of the situation.

We start with this article:
Guéguen, N. (2012). Color and women hitchhikers’ attractiveness: Gentlemen drivers prefer red. Color Research & Application37, 76–78. http://dx.doi.org/10.1002/col.20651

Brief summary of the article
The author's hypothesis was that male (but not female) drivers would be were more likely to offer a lift to a female hitchhiker if she was wearing a red (versus any other colour) t-shirt. The independent variables were the colour of the woman's t-shirt and the sex of the driver, and the dependent variable was the driver's decision to stop or not.

Participants were drivers on a road in Brittany (western France). A female confederate wearing one of six colours of t-shirt (black, white, red, blue, green, or yellow) stood by the side of the road posing as a hitchhiker (with a male confederate stationed out of sight nearby for security purposes). She noted whether each driver who stopped to offer her a lift was male or female. In order to establish the number of drivers of each sex who drove along the road (whether or not they stopped), two other confederates were stationed 500 metres away in a car by the side of the road, facing the approaching traffic. As each car passed, they noted whether the driver was a man or a woman. Using the count of male vs female drivers who stopped, and the count of male vs female drivers who passed, it was found that male drivers were considerably more likely to stop when the hitchhiker was wearing a red t-shirt compared to any other colour.

There are some puzzling aspects to this article.

1. Number of volunteers
The article states (p. 77) that the five female confederates were chosen by a group of men who
rated the facial attractiveness of “18 female volunteers with the same height, with the same breast size (95 cm of bust measurement and bra with a ‘B’ size cup), and same hair color”.  It is interesting to think about how many there women must be in the volunteer pool at the Université Bretagne-Sud in order for 18 with the same height, bra size, and hair colour to put themselves forward to stand for hours on end (see point 4, below) to stop passing drivers.  Once the attractiveness of the participants had been established, the five who were rated closest to the middle of the scale were chosen, and "precautions were taken to verify that the rates of attractiveness were not statistically different between the confederates", whatever that means.  Oh, and "All of the women stated that they were heterosexuals"—presumably to ensure that they gave off the right vibes through the windscreens of approaching cars.

2. Two different sample sizes
There is a curious inconsistency between Table 1 and the main text.  In Table 1, the numbers of male and female drivers are listed as 3,474 and 1,776, respectively.  However, these two numbers sum to 5,250, rather than 4,800 (which was the sample size reported elsewhere in the article, with 3,024 male and 1,776 female drivers).  It is not clear how such an error might creep in by accident, since it requires two very different digits (4 instead of 0 and 7 instead of 2) to be mistyped.

3. The colours of the t-shirts
The article reports the colours of the T-shirts worn by the hitchhikers in very precise terms, even going so far as to give their HSL (Hue, Saturation, Luminance) values.  However, in several cases, those values do not correspond to the reported colours.  As the table here shows, the colour described by the HSL values corresponding to “red” is probably best described as a salmon-pink colour, while “yellow” is a very pale pink, and “blue” is pure white.

Stated colour
Hex colour

It would be interesting to learn how these HSL numbers were obtained, since several of them are so badly wrong. Indeed, it is not clear why it was considered necessary to report the colours with such precision; it would surely have been enough to state that bright, unambiguous examples of each color had been selected. For that matter, given how long it must have taken to test so many drivers (see the next point) and that the author had a clear hypothesis about the effects of the colour red, it is not clear why so many different colours of t shirt were tested.

4. How long did all this take?
The article states (p. 77) that “Each hitchhiker was instructed to test 960 drivers. After the passage of 240 drivers, the confederate stopped and was replaced by another confederate.” No indication is given of how long it took for 240 drivers to pass. However, the article also tells us (p. 77) that the research was conducted "at the entry of a famous peninsula of Brittany in France". So perhaps we can get a clue from another Guéguen article:

Guéguen, N. (2007b). Bust size and hitchhiking: A field study. Perceptual and Motor Skills, 105, 1294–1298. http://dx.doi.org/10.2466/pms.105.4.1294-1298

Yes, you read that right. Dr. Guéguen did indeed conduct a study to see whether women with larger breasts get more offers of lifts from men. (I bet you can't guess what the result was.) Anyway, that studywhich had a very similar procedure to the one we're discussing here, except that the, er, manipulated (!) independent variable was the apparent size of the female hitchhiker's breasts, rather than the colour of her t-shirtwas conducted "at the entry of a famous peninsula ('Presqu'Île de Rhuys') of Brittany in France". Assuming that the “famous peninsula” mentioned in both studies is the same place (which would make sense, if that location is indeed particularly propitious for lone female hitchhikers), and assuming similar traffic flows to those reported in the "bust size" study, in which the passage of 100 cars took “about 40 to 50 minutes” (p. 1296), we assume that it took between 1.5 and 2 hours for 240 cars to pass. In order to test 4,800 drivers, then, a total of 30 to 40 hours of testing would be required. The "t-shirt colour" article also states (p. 77) that the experiment “took place during summer weekends on clear sunny afternoons between 2 and 5 PM.” With three hours being available on Saturday and three more on Sunday, the experiment would thus have taken between five and seven complete weekends, assuming that every hour of testing time was sunny (a contingency that is far from guaranteed in Brittany). Yet none of the confederates who gave up multiple weekends to accomplish this Herculean task on behalf of psychological science are listed as co-authors, or even acknowledged in any way, in the resulting article.

Additionally, the design of the experiment appears to require that only drivers who were alone in their cars should be counted, since the purported effect of the red t-shirt was to increase the sexual attractiveness of the wearer. You might expect that if a male driver's wife is in the car, it could affect any sexually-motivated enthusiasm he might have for offering a lift to the hitchhiker, whatever the colour of her t-shirt; alternatively, a female driver might have been willing to help the hitchhiker had all of the seats in her car not been full of children. We have seen a statement by Dr. Guéguen in which he confirmed that "L’expérience n’incluait effectivement que des personnes seules. Les automobiles avec plusieurs personnes ne sont pas prises en compte dans l’étude" ("The experiment only included people [driving] on their own. Cars containing multiple people were not counted in the study"). So the figure calculated above for the number of hours and days taken to test the required number of drivers needs to be multiplied by some factor to take into account the percentage of cars with multiple occupants. Given that the study was carried out on sunny weekend afternoons in summer in an area with a substantial number of tourists, it seems reasonable that perhaps half of the cars driving along the road on a summer afternoon might have had more than one occupant, which would either double the number of weekends required for collection of the data to between 10 and 14, or in any case more than compensate in our calculations for any growth in local traffic since the "variable bust size" study was conducted.

Another way to think about the time involved is to consider the interactions of the hitchhiker with the drivers who stopped. Even if it took an average of only two minutes to catch up to the car where it stopped (probably some distance along the road from her), introduce herself, explain that there was an experiment taking place, "warmly thank" the driver, and return to her starting point, that would require nearly 10 hours (i.e., four afternoons) just for the 579 drivers (450 male, 129 female) who were reported as having stopped, even assuming that in every case a new driver then stopped immediately afterwards. If drivers were only stopping at the rate of one every five minutes overall (12 per hour), it would take 48 solid hours to test 579 drivers.

5. Problems with recording the sex of the drivers
As mentioned in the introduction, there were two observers whose job was to observe every passing car and record its driver's sex. (Per the previous point, it is worth thinking about the challenge of determining whether or not a driver is on their own in the vehicle, which requires, for example, determining whether a car driving past at around 20 metres per second does or does not have a small child in the back seat.) The article states (p. 77) that “[t]he convergence between the two observers’ evaluation was high (r = 0.97)”.

There is a major problem here. In order for a correlation coefficient to be calculated, we need more information than the simple total numbers of male and female drivers. Specifically, the two observers would need to independently record both the sex of each driver and the sequence in which those drivers were observed; for example, with ten drivers and disagreement about the sex of the third, the correlation between MFFFMMMFMM and MFMFMMMFMM would be .80. However, the article reports (p. 77) that each of the observers “used two hand-held counters, one to count the female motorists and the other to count the male motorists”. The term “hand-held counters” suggests simple mechanical devices, such as those used to count attendees at sporting events (such as this). But without synchronized timestamps across all four of these counters, or some other form of sequential tracking, it is not possible to establish the order in which the drivers passed each of the observers. More sophisticated methods of collecting and correlating these data can be imagined (for example, using laptop computers), but of course both observers had their hands full with the counters. With just a count of male and female drivers from each observer, stating a correlation coefficient makes no sense. It is therefore entirely unclear how the author could have established the correlation coefficient that he reported.

In view of the above points, it is not clear that how the study can actually have taken place as described in the article. As noted above, we have seen a statement by Dr. Guéguen (with whom we have been indirect contact for almost two years now, via the good offices of the French Psychological Society, about a number problems in several of his published articles; more on this to come in a subsequent post) concerning the question of whether only drivers who were on their own were tested. That statement did not, however, provide any specific or relevant answers to any of the other issues about this article that I have discussed here.

[[Update 2017-11-29 22:08 UTC: Added link to Cathleen O'Grady's article. ]]


  1. This looks like the beginning of a promising series.

    Just a random thought regarding the time needed to collect the data: could the test have not taken place on both sides of the road at the same time? One hitchhiker for each direction? That would cut the time in half, making it fit into a single summer.

    1. Well, it could. But you might have thought the author would have mentioned that. As you will see as the story unfolds, there are many places where the method has lots of irrelevant details (e.g., the exact colours of the t shirts) in some places and yet is missing some obvious things in others.

  2. The traffic in and out of Quiberon Peninsula might have offered more rich pickings ... sometimes slowing to an absolute crawl.
    (but otherwise I agree; and who knew that Bretagne had so many famous penisula)

  3. So do you assume that these studies were actually conducted? And are there any other paper from this author with such irregularities?

    1. >>So do you assume that these studies
      >>were actually conducted?

      We (James and I) are disappointed that despite several opportunities to do so (more to come in a forthcoming post, but see https://arstechnica.com/science/2017/11/researchers-find-oddities-in-high-profile-gender-studies/ in the meantime), Dr Guéguen has not provided us with any evidence that these studies took place as described.

      >>And are there any other paper from this
      >>author with such irregularities?

      Yes. Lots and lots and lots. Watch this space, but also, feel free to pick any of his articles (many are on ResearchGate) and read them with a critical eye.

  4. Thanks for bringing light to this, and for being imminently fair in the process. There is often in a tell when someone is cutting corners (in any context), and providing immense detail in non-essential areas, while being vague in the more central aspects (a sleight of hand, if you will) is one. Good pick up!

  5. Thanks Nick (and James). This is great work.

    What are all the journals these questionable articles have been published in? The two cited above—Color Research & Application and Perceptual and Motor Skills—well... let's just say I'm not familiar with these journals, in part because it seems unlikely they are influential in any social science!

    Could this be evidence to suggest the peer review system DID work in this case? Either the articles were submitted to better journals and rejected, or the author chose NOT to submit them to better journals knowing they would be rejected. Instead, he chose journals with lower standards. (As you probably know, many of these small journals can struggle to meet their page requirements for each issue, necessitating reduced standards in one form or another.)

    I'm hoping your work (and coverage on places like Ars) will lead to better media scrutiny and the recognition that not all journals (and articles published) are created equal, even when the findings are "sexy."

    1. @Michael Braun: Someone contacted me via DM to say that he thought he remembered desk-rejecting one or more articles by Guéguen when he was an editor at . But I don't think that gets the peer-review system out of the woods, because (a) as far as the people who recycle press releases are concerned pretty well all journals are equal, and (b) the only measure we have of a "good" journal is the impact factor, which has problems of its own.

      One problem is that there is no way for journals to put out an alert saying "Attention colleagues, author X is hawking a paper on topic Y which looks very dubious because , please be on the lookout for this manuscript and exercise extreme caution". Even without going to so-called "predatory" journals (FWIW, I don't believe in the binary categorisation of journals as predatory/legitimate; I think it's a rather smooth continuum), you can get pretty much anything published if you have enough patience and you already have however many "high-impact" publications you need for your particular career aims.

      We were told, anonymously IIRC, that the Cornell Food and Brand Lab has a person whose job includes submitting articles to a list of journals in rotation; when journal A rejects they just turn round and submit to journal B.

    2. Great points, Nick. Thanks for your reply.

    3. >>We were told, anonymously IIRC, that the Cornell Food and Brand Lab has a person whose job includes submitting articles to a list of journals in rotation; when journal A rejects they just turn round and submit to journal B.

      This is common practice I think, although it isn't usually formalized as an actual list. But a given lab will generally have an informal ranking of "try here" journals, submitting to the highest first and then working down for each paper...

  6. The french wikipedia page of Dr Guéguen was updated with a section on these scientific fraud assumptions.

  7. My earlier comment got snaffled so here is a quick summary:

    The correlations could be derived from blocked of counts (e.g., between breaks or sunny spells). This would however still be misleading as it would likely inflate r over what you'd get from raw data in sequence.

    In the UK I'd expect these studies to have evidence of ethical approval (and indeed major ethics concerns to be addressed - including informed consent and risk/safety assessment for collecting these data on a busy road). I'm not familiar with the equivalent French process - but the procedures appear deeply flawed if (as the Ars Technica article suggests) he reviewed ethics within his own lab.

    1. French colleagues have told me that ethical standards are not high in France, unless you are performing a psychological intervention on the participant. A more likely source of concern is insurance; for example, if the undergraduate confederates drove to the beach in their own cars, would this count as a "work-related journey"? You need a specific clause in your insurance policy in France for any work-related usage apart from your daily commute, so this is absolutely something that would normally require checking, which would be recorded somewhere.

  8. I'm loving this investigation.

    My theory is the academic sets experiments as assessment. His students do them and he writes them up as his own work.

    Some students are good. They do the work and collect the data and report. Thanks to the unccoperative nature of the real world, their findings fail to bother the null hypothesis.

    Some students are bad. The evening before the assignment is due, they invent an eye-catching experiment that they claim to have done. They, having failed to pay attention in class, are unaware that substantial effect sizes are unlikely. So they report strong results.

    The lecturer, knowing how the media works, and insouciant with respect to the file drawer effect, publishes the invented studies most often.

    1. @Jase: Even if that were true, it would be no excuse (the author, and only the authors, are responsible for the veracity of the contents of an article). But in any case we are fairly convinced that the scenario you mention is not correct. Stay tuned.

  9. Unsolicited advice: avoid the sarcasm and snide comments, however mild. They only undermine your strong case.

    1. It's an interesting question. I often start with a very sarcastic/snide piece and trim it down. In this case I started with a fairly academic piece (more will be revealed shortly) and added some stuff to spice it up a bit. I think that there is a level of absurdity beyond which it is impossible to keep a straight face, but I appreciate that tastes will vary on this.

  10. Great post.

    One major feature of the Guéguen case is that the people who (ostensibly) collected the data were not listed as authors.

    We've seen this before in similar cases, such as Jens Förster, Michael LaCour, and more IIRC.

    Could this be a pattern? Could "silent" data collectors even be a warning sign of misconduct in itself?

    Having the data collectors named as authors creates a check against outright data fabrication, because few people would sign up as authors knowing that they had in fact done nothing. Of course, author-data collectors can still manipulate and fabricate data, but it does seem to rule out outright, whole-cloth invention of studies.

    The Stapel case doesn't fit my theory, in that Stapel did claim to have collected all of the data personally (IIRC?), but this ought perhaps to have raised questions in itself, as professors don't generally do that.

    1. Stapel certainly reported many cases of faking the data and then claiming that he had collected them personally, for example because "I have this contact at the school and they trust me but they wouldn't be comfortable with grad students turning up and testing the kids".

      What Stapel does have in common with Guéguen (and also with Brian Wansink, in the surveys from his UIUC days) is a remarkable absence of funding. Stapel claimed that he paid his high school confederates a few Euros to help him. Wansink claims to have financed one or more 2,000-person mailing-list surveys from his own pocket (that must have cost close to $20,000 by the time someone has been paid to enter the data, given that according to his recent claims about how the veterans survey went wrong the number of variables exceeded the limits of Excel in the year 2000, meaning there were more than 256). Guéguen claims that he never has to pay his students a penny in expenses (more details to come in a week or two). Of course, when the story is "no expenses were paid", there is no reason to ask the accounts department for the paper trail...

  11. Hans van Maanen, the author of https://www.volkskrant.nl/archief/847-eenzame-eters-en-nog-meer-taalfouten~a3261426/ is a science journalist / science writer. See https://vanmaanen.org/hans/ for backgrounds.

  12. As someone who used to work in the paint mixing business, we used a color picked which had a scanner in it you would put over a sample (generally a paint chip or piece of clothing) that would output a series of values meant to replicate that color. It was... of limited value. It was of even MORE limited value in any area with bright light or bulbs that weren't 3500k (pure white) or when used incorrectly.

    We would normally test a sample 3 or 4 times (likely showing my hardware store had a higher level of academic rigor than Dr. Guéguen's alleged team) and then pull up the values and eyeball which was closest based on personal opinion (I maintain my statement about academic rigor relative to Dr. Guéguen). I would not be surprised to learn he obtained his values with a similar device which he used once in a well lit area and then recorded which is what gave him such pale and odd looking results. I WOULD be surprised to learn he tested black and white AT ALL since getting pure black or pure white with one of those things doesn't happen.

    None of this is to defend the study. Just provide some insight into how those HSL values most likely came about.

    1. Thanks - learning stuff like this is a good enough reason on its own to write a blog!

  13. I know this is a very late comment and likely no one cares, but I think it's likely the colors innocuous. I think they are HSV instead of HSL. Using HSV space gives sensible colors, aside from yellow, which can be explained with a simple "29" in place of the reported "19". I have seen *many* art blogs that get the two spaces confused or backwards, so I can easily imagine a scientist doing the same.

    The rest of it is more of an issue, but I do think that one thing is innocuous.