09 August 2022

An interesting lack of randomness in a published dataset: Scott and Dixson (2016)

Martin Enserink has just published the third instalment in an ongoing story of strange results and possible → likely → confirmed misconduct in the field of marine biology, and more specifically the purported effects of climate change on the behaviour of fish. The first two instalments are here (2020) and here (2021).

After Martin's 2021 article, I wrote this blog post describing a few analyses that I had contributed to this investigation. Today I want to look at a recently-corrected article from the same lab, mentioned by Martin in his latest piece (see the section entitled "A corrected paper"), and in particular at the data file that was released as part of the correction, as I think that it illustrates an interesting point about the forensic investigation of data.

Here is the article:

Scott, A., & Dixson, D. L. (2016). Reef fishes can recognize bleached habitat during settlement: Sea anemone bleaching alters anemonefish host selection. Proceedings of the Royal Society B, 283, 20152694. https://doi.org/10.1098/rspb.2015.2694

A correction notice was issued for this article on July 8, 2022, and that correction was accompanied by a data file, which can be downloaded from here

I suggest that you read Martin's articles to get an idea of the types of experiments being conducted here, as the Scott and Dixson article is typical of many coming from the same lab. Basically, 20 (usually) fish were each tested 24 times to see if they "preferred" (chose to swim in) one of two flumes (streams of water), A and B, and then this set of 24 trials was repeated a second time, so each pair of flumes was tested by 960 trials. In some cases the fish would be expected to have no preference between the flumes, and in others they should have a preference for water type A over B (for example, if B contained the odour of a predator or some other chemical suggesting an unfavourable environment),. The fact that in many case the fish preferred water B, either when they were expected to have no preference or (even worse) when they were expected to prefer water A, was taken by the authors as an indication that something had gone wrong in the fish's ability to make adaptive choices in their environment.

Here are the two main issues that I see with this (claimed) dataset.


This isn't what a dataset looks like

As I noted in my earlier post, this isn't what a dataset looks like. You don't collect data in multiple 2-D panels and lay those out in a further 2-D arrangement of 18 x 5 panels. If for some reason you did that, you would need to write some quite sophisticated software—probably a couple of hundred lines of Python or R code—to read the data in a form that is ready to generate the figures and/or tables that you would need for your article. (The code that I wrote to read the dataset from the Dixson et al. article that was the subject of my earlier blog post is around 300 lines long, including reasonable spacing.) That code would have to be able to cope with the inevitable errors that sneak into files when you are collecting data, such as occasional offsets, or a different number of fish in each chunk (the chunks on lines 78 through 94 only have 17 fish rather than 20; incidentally, the article says that each experiment was run on 18 to 20 fish), or an impossible value such as we see at cell DU46.

So there would seem to be two possibilities. Either the authors have some code that reads this file and reliably extracts the data in a form suitable for running the analyses; or, they have another data file which is more suited to reading into SPSS or R without having to strip away all of the formatting that makes the Excel sheet relatively visually appealing. Either way, they can surely provide one or other of those to us so that we can see how they dealt with the problems that I listed above. (I will leave it up to the reader to decide if there are any other possibilities.)


There is too little variation... in the unremarkable results

In my earlier blog post on this topic I analysed another dataset from the same lab (Dixson et al., 2014) in which there were numerous duplications, whereby the sequence of the numbers of choices of one or other flume for the 20 fish in one experiment were often very similar to those in another experiment, when there was no reason for that to be the case.

In the current dataset there are a few sets of repeated numbers of this kind (see image), but I don't think that they are necessarily a problem by themselves, for a couple of reasons.


Were these lines (in green) copied, or are the similarities caused by the limited range of the data? My hunch is that it's the latter, but it doesn't really matter.


First, these lines only represent sequences of a few identical numbers at a time, whereas in the 2014 dataset there were often entire duplicated groups of 20 fish.

Second, for most of these duplications, the range of the numbers is severely restricted because they (at least ostensibly) correspond to a large experimental effect. The Scott and Dixson article reports that in many cases, the fish chose flume A over flume B almost all of the time. The means that the numbers of observations of each fish in flume A, out of 24 opportunities, must almost always be 22 or 23 or 24, in order for the means to correspond to the figures in the article. There are only so many ways that such a small number of different predicted values can be distributed, and given that the person examining the dataset is free to look for matches across 180 20-fish (or 17-fish) columns of data, a number of duplicates of a certain length will very likely arise by chance.

However, the dataset also contains a number of cases where the fish appeared to have no preference between the two. The mean number of times out of 24 trials that they were recorded as having chosen flume A (or flume B) in these cases was close to 12. And it turns out that in almost all of these cases, there is a different lack of variation, not in the sequence of the observations (i.e., the numbers observed from top to bottom of the 20 fish across experiments), but in the variability of the numbers within each group of fish.

If the fish genuinely don't have a preference between the two flumes, then each trial is basically a Bernoulli trial with a probability of success 0.5, which is a fancy way of saying a coin toss, and so the 24 trials for each fish represent 24 coin tosses. Now, when you toss a coin 24 times, the most likely result is 12 heads and 12 tails, corresponding to the fish being in flume A and B 12 times each. However, although this result is the most likely, it's not especially likely; it will occur about 16% of the time, as you can see at this site (put 24 into "Number of Bernoulli trials", click Calculate, and the probability of each result will be in the table under the figure with the curve). If you repeat those 24 trials 100 times, you would expect to get 8 As and 16 Bs either 4 or 5 times, and 8Bs and 16 As also either 4 or 5 times.

Now let's look at the dataset. I identified 32 columns of data with 20 (or, in a few cases, 17) fish and a mean of around 12. I also included 3 other columns which had one or more values of 12; as I hope will become clear, this inclusion works in the authors' favour. I then calculated the standard deviation (SD) of the 20 (or 17) scores that are composed of 24 trials for each of these 35 columns of data.

Next, I generated one million random samples of 24 trials for 20 simulated fish and calculated the SD of each sample. For each of the 35 SDs taken from the dataset, I calculated the fraction of those million simulated SDs that were smaller than the dataset value. In other words, I calculated how likely it was that one would observe an SD as small as the one that appears in the dataset if the values in the dataset were indeed taken from 24 trials of 20 fish that had no preference between the flumes. Statistically-minded readers may recognise this as the p value for the null hypothesis that these data arose as the result of a natural process, as described by the authors of the Scott and Dixson paper.

The results are not very good for the authors. For only nine of the samples, including the three that contain a small number of scores of 12 but otherwise have a substantially different mean, the p values are greater than 0.05. Seven of the p values are zero, meaning that an SD as low as the one corresponding to the data reported by the authors did not occur at all in one million simulated samples (see image below for an example). A further six p values are less than 0.0001 and four are less than 0.001. The overall chances of obtaining these results from a natural process are hard to calculate accurately (for example, one would need to make a small adjustment for the fact that the results come in pairs of 20-fish samples, as each fish took part in 2 sets of 24 trials and those two sets are not independent), but in any case I think it can safely be described as homeopathic, if only from the seven cases of zero matches out of one million. 


Remarkably consistent results. SD in yellow (0.7863), proportion of simulated data values that have a lower SD in green (0.000000).


Conclusion

Lack of expected variability is a recurring theme in the investigation of bad science. Uri Simonsohn was one of the pioneers of this in his paper "Just Post It", and more recently Kyle Sheldrick came up with a novel method of checking whether the sequence of values in a dataset is "too regular". I hope that my explanation of the issues that I see in the Scott and Dixson dataset is clear.

Martin Enserink's latest piece mentions that the University of Delaware is seeking the retraction of three papers with Danielle Dixson as an author. Apparently the Scott and Dixson (2016) article—which, remember, has already been corrected once—is among those three papers. If nobody identifies a catastrophic error in my analyses then I plan to write to the editors of the journal to bring this issue to their attention.


Data availability

I have made an annotated copy of that file available here, which I think constitutes fair use.




07 March 2022

Some examples of apparent plagiarism and text recycling in the work of Dr Paul McCrory

Dr Paul McCrory of the Florey Institute of Neuroscience and Mental Health has been in the news in the past few days. This started with a single retraction of an apparently plagiarised editorial piece in the British Journal of Sports Medicine from 2005, but after I started digging further and more problems came to light, he has now resigned as chair of the influential Concussion in Sport Group (CISG), as reported by The Guardian and The Athletic, among other outlets.

Since much of this story has only been covered in a series of separate threads on Twitter up to now, I thought I would take some time to document in one place the full extent of what I have found about Dr McCrory's extensive recycling of his own and others' writing.

The first five exhibits are already in the public domain, but I will include them here for completeness. If you have been following the story on Twitter up to now, you can skip straight to Exhibit 6.


Exhibit 1

McCrory, P. (2005). The time lords. British Journal of Sports Medicine, 39(11), 785–786.

About 50% of this article has been copied, verbatim and without appropriate attribution, from this 2000 article in Physics Today by Steve Haake, who was the person who first discovered Dr McCrory's plagiarism and brought it to the attention of the current editor-in-chief of the British Journal of Sports Medicine. The copied text is highlighted in pink here:


The editorial has now been retracted. This was reported by Retraction Watch on February 28, 2022. At that point I started looking into other articles by the same author.


Exhibit 2

McCrory, P. (2005). Definitions for the purist. British Journal of Sports Medicine39(11), 786.

About 70% of this article has been copied, verbatim and without appropriate attribution, from this website. A copy of that page, archived on May 22, 2003 (that is, two years before Dr McCrory's article was published) can be found here. The copied text is highlighted in yellow here:

I tweeted about this article on March 1, 2022. Retraction Watch picked up on that and later reported that the author had asked for the article to be retracted, giving an explanation that I found less than impressive.


Exhibit 3

McCrory, P. (2006). Take nothing but pictures, leave nothing but footprints…? British Journal of Sports Medicine40(7), 565. https://doi.org/10.1136/bjsm.2006.029231

Nearly 80% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: This website. A copy of that page, archived on December 7, 2003 (that is, more than two years before Dr McCrory's article was published) can be found here.
  • Pink: This article from New Scientist, dated April 16, 2005.
  • Blue: This website, dated March 2006 (several months before Dr McCrory's article was published). An archived copy from May 2, 2006 can be found here.
  • Green: This website, dated November 2005.
  • Grey: This website. An archived copy from September 6, 2003 can be found here.


As with Exhibit 2, I tweeted about this on March 1, 2022. The author came up with a quite remarkable story for Retraction Watch about why this article only merited a correction. I found that even less impressive than his excuses in the previous case.


Exhibit 4

McCrory, P. (2002). Commotio cordis. British Journal of Sports Medicine, 36(4), 236–237.

About 90% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: Curfman, G. D. (1998). Fatal impact — Concussion of the heart. New England Journal of Medicine, 338(25), 1841-1843. https://doi.org/10.1056/NEJM199806183382511
  • Blue: Nesbitt, A. D., Cooper, P. J., & Kohl, P. (2001). Rediscovering commotio cordis. The Lancet, 357(9263):1195–1197. https://doi.org/10.1016/S0140-6736(00)04338-5
    .

James Heathers discovered a couple of these overlaps on March 3, 2022 and I tweeted the full picture on March 4, 2022.

Exhibit 5

McCrory, P. (2005). A cause for concern? British Journal of Sports Medicine39(5), 249.

Almost half of the words in this article have been copied, verbatim and without appropriate attribution, from the following source:

  • Piazza, O., Anna-Leena Sirén, A.-L., & Ehrenreich, H. (2004). Soccer, neurotrauma and amyotrophic lateral sclerosis: Is there a connection? Current Medical Research and Opinion, 20(4), 505–508. https://doi.org/10.1185/030079904125003296

 The copied text is highlighted in pink here:

I tweeted about this on March 4, 2022.

Exhibit 6

McCrory, P. (2002). Should we treat concussion pharmacologically? British Journal of Sports Medicine36(1), 3–5.

Almost 100% of the text has been copied, verbatim and without appropriate attribution, from:

  • McCrory, P. (2001). New treatments for concussion: The next millennium beckons. Clinical Journal of Sport Medicine, 11(3), 190–193.

That copied text is highlighted blue (light or dark) in the image below. The text in dark blue also overlaps with this MedLink article. Thus, either Dr McCrory plagiarised three paragraphs from MedLink in two separate articles, or MedLink plagiarised him. The MedLink article was initially published in 1997, but it has been updated since, so the direction of copying cannot be established with certainty unless I can find an archived copy from 2001. It may, however, be interesting that the "phase II safety and efficacity trial" mentioned (Dr McCrory's reference 22) has a date of 1997.


James Heathers discovered one of the overlaps in this text on March 3, 2022, but it took another couple of hours work at my end to uncover the full extent of the text recycling and possible plagiarism in this article.


Exhibit 7

McCrory, P. (2006). How should we teach sports medicine? British Journal of Sports Medicine40(5), 377.

About 60% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Pink: Fallon,  K. E., & Trevitt, A. C. (2006). Optimising a curriculum for clinical haematology and biochemistry in sports medicine: A Delphi approach. British Journal of Sports Medicine40(2), 139–144. https://doi.org/10.1136/bjsm.2005.020602
  • Blue: Long, G., & Gibbon, W. W. (2000). Postgraduate medical education: Methodology. British Journal of Sports Medicine34(4), 235–245.
Note that the Fallon & Trevitt article was published in the same journal just three months before it was plagiarised.


Exhibit 8

McCrory, P. (2008). Neurologic problems in sport. In M. Schwellnus (Ed.), Olympic textbook of medicine in sport (pp. 412–428). Wiley.

About 25% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources. Of that 25%, about two-thirds is recycled from other publications by the same author, and the remainder is plagiarised from other authors, as follows:
  • Orange: McCrory, P. (2000). Headaches and exercise. Sports Medicine, 30(3), 221–229. https://doi.org/10.2165/00007256-200030030-00006
  • Green: McCrory, P. (2001). Headache in sport. British Journal of Sports Medicine35(5), 286–287.
  • Blue: McCrory, P. (2005). A cause for concern? British Journal of Sports Medicine39(5), 249. (See also Exhibit 5.)
  • Yellow: Showalter, W., Esekogwu, V., Newton, K. I., & Henderson, S. O. (1997). Vertebral artery dissection. Academic Emergency Medicine, 4(10), 991–995. https://doi.org/10.1111/j.1553-2712.1997.tb03666.x
  • Pink: This MedLink article, which was initially published in 1996, but has been updated since, so the direction of copying cannot be established with certainty unless I can find an archived copy from 2008. It may, however, be interesting that the citations in the pink text (Kaku & Lowenstein 1990; Brust & Richter 1977) both (a) predate the MedLink article and (b) are not — or no longer — referenced at the equivalent points in the MedLink text. It would seem unlikely that MedLink would (a) plagiarise Dr McCrory's article from 2008 at some point after that date and (b) remove these rather old citations (without replacing them with new ones).
(Don't bother squinting too hard at the page - the annotated PDF is available for you to inspect. See link at the end of this post.)

Exhibit 9

McCrory, P., & Turner, M. (2015). Concussion – Onfield and sideline evaluation. In D. McDonagh & D. Zideman (Eds.), The IOC manual of emergency sports medicine (pp. 93–105). Wiley.

About 50% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources, as follows:

  • Blue: McCrory, P., le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Head injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 58–94). Wiley.
  • Yellow: McCrory, P., le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Rehabilitation of acute head and facial injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 95–100). Wiley.
  • Green: Aubry, M., Cantu, R., Dvorak, J., Graf-Baumann, T., Johnston, K., Kelly, J., Lovell, M., McCrory, P., Meeuwisse, W., & Schamasch, P. (2001). Summary and agreement statement of the first International Conference on Concussion in Sport, Vienna 2001. British Journal of Sports Medicine36(1), 6–10. https://doi.org/10.1136/bjsm.36.1.6
  • Pink: McCrory, P. (2015). Head injuries in sports. In M. N. Doral & J. Karlsson (Eds.), Sports injuries (pp. 2935–2951). Springer.

The pink text also appears in Exhibit 9, which was published in the same year, so it's not clear which is the original and which is the copy. I tweeted about some of the similarities between Exhibits 9 and 10 here, although I hadn't found everything at that point.

The green text in the final paragraph on page 105 appears to have been copied and pasted twice (it appears in two paragraphs on page 104), which might cause the reader to wonder exactly how much care and attention went into this copy-and-paste job.

Readers who are interested in the activities of the CISG might be interested to note that the 2001 Vienna conference (the "green" text reference above) was where the name of this group was first coined.

(Note that five pages, corresponding to the photographic reproduction of the Sport Concussion Assessment Tool and the Pocket Concussion Recognition Tool, have been omitted from this image.)

Exhibit 10

McCrory, P. (2015). Head injuries in sports. In M. N. Doral & J. Karlsson (Eds.), Sports injuries (pp. 2935–2951). Springer.

About 90% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources, as follows:

  • Blue (light and dark): McCrory, P. le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Head injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 58–94). Wiley.
  • Yellow: McCrory, P. le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Rehabilitation of acute head and facial injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 95–100). Wiley.
  • Pink: McCrory, P., & Turner, M. (2015). Concussion – Onfield and sideline evaluation. In D. McDonagh & D. Zideman (Eds.), The IOC manual of emergency sports medicine (pp. 93–105). Wiley.

The pink text also appears in Exhibit 9, which was published in the same year, so it's not clear which is the original and which is the copy.

The text in dark blue has been copied twice from the same source; again, it seems as if this chapter was not assembled with any great amount of care.

(Note that six pages, corresponding to the photographic reproduction of the Sport Concussion Assessment Tool and the Pocket Concussion Recognition Tool, have been omitted from this image.)


Conclusion

The exhibits above present evidence of extensive plagiarism and self-plagiarism in seven editorial pieces in the British Journal of Sports Medicine from 2002 through 2006, and three book chapters from 2008 through 2015. As well as the violations of publication ethics and other elementary academic norms, most of these cases would also seem to raise questions about copyright violations.

This is not an exhaustive collection; I have evidence of these transgressions on a smaller scale in a number of other articles and book chapters from the same author, but a combination of time, weariness (of me as investigator and, presumably, of the reader too), and lack of access to source materials (for example, I was only able to find one extensively recycled book chapter on Google Books, which is not very practical for marking up) has led me to stop at 10 exhibits here.

I have no background or experience in the field of head trauma or sports medicine, and I had never heard of Dr McCrory or the CISG until last week. Hence, I am unable to comment about what all of this might mean for the CISG or its influence on the rules and practices of sport. However, although I try not to editorialise too much in this blog, I must say that, based on what I have found here, Dr McCrory does not strike me as an especially outstanding example of scientific integrity, and it does make me wonder what other aspects of his life as a scientist and influencer of public policy might not stand up to close scrutiny.


Data availability

All of the supporting files for this post can be found here. I imagine that this involves quite a few copyright violations of my own, in that many of the source documents are not open access. I hope that the publishers will forgive me for this, but if I receive a legal request to take down any specific file I will, of course, comply with that.


31 October 2021

A bug and a dilemma

A few months ago, I discovered that the SAS statistical software package, which is used worldwide by universities and other large organisations to analyse their data, contained—until quite recently—a bug that could result in information that the user thought they had successfully deleted (and was no longer visible from within the application itself) still being present in the saved data file. This could lead to personal identifiable information (PII) about study participants being revealed, alongside whatever other data might have been collected from these participants, which—depending on the study—could potentially be extremely sensitive. I found this entirely by chance when looking at an SAS data file to try and work out why some numbers weren't coming out as expected, for which it would have been useful to know if numbers are stored in ASCII or binary. (It turned out that they are stored in binary.)

Here's how this bug works: Suppose that as a researcher you have run a study on 80 named participants, and you now have a dataset containing their names, study ID numbers (for example, if the study code within your organisation is XYZ this code might be XYZ100, XYZ101, etc, up to XYZ179), and other relevant variables from the study. One day you decide to make a version of the dataset that can be shared without the participants being identifiable, either because you have to deposit this in an archive when you submit the study to a journal, or because somebody has read the article and asked for your data. You could share this in .CSV file format, and indeed that would normally be considered best practice for interoperability; but there may be good reasons to share it in SAS's native binary data file format with a .sas7bdat extension, which can in any case be opened in R (using a package named "sas7bdat", among others) or in SPSS

So you open your file called participants-final.sas7bdat in the SAS data editor and delete the column with the participants' names (and any other PII, such as IP addresses, or perhaps dates of birth if those are not needed to establish the participants' ages, etc), then save it as deidentified-participants-final.sas7bdat, and share the latter file. But what you don't know is that, because of this bug, in some unknown percentage of cases the text of most of the names can sometimes still be sitting in the sas7bdat binary data file, close to the alphanumeric participant IDs. That is, if the bug has struck, someone who opens the "deidentified" file in a plain text editor (which could be as simple as Notepad on Windows) might see the names and IDs among the binary gloop, as shown in this image.

I am pretty sure these two people did not take part in this study.

This screenshot shows an actual extract from a data file that I found, with only the names and the study ID codes replaced with those of others selected from the phone book. The full names of about two-thirds of the participants in this study were readable. Of course, you can't read the binary data and it would take a lot of work to do so, but given the participant IDs (PRZ045 for Trump, PRZ046 for Biden) you can simply open the "anonymised" data file in SAS and find out all you want about those two people from within the application.

Even worse, though, is the fact that unless the participant's name is extremely common, when combined with knowledge of approximately where and when the study was conducted it might very well let someone identify them with a high degree of confidence for relatively little effort. And by opening the file in SAS—for example, with the free service SAS OnDemand for Academics, or in SPSS or R as previously mentioned—and looking at the data that was intended to be shared, we will be able to see that our newly-identified participant is 1.73 metres tall, or takes warfarin, or is HIV-positive.

(A number of Microsoft products, including Word and Excel, used to have a bug like this, many versions ago. When you chose "Save" rather than "Save As", it typically would not physically overwrite on the disk any text that you had deleted, perhaps because the code had originally been written to minimise writing operations with diskettes, which are slow.)

I have been told by SAS support (see screenshot below) that this bug was fixed in version 9.4M4 of the software, which was released on 16 November 2016. The support agent told me that the problem was known to be present in version 9.4M3, which was released on 14 July 2015; however, I do not know whether the problem also existed in previous versions. I think it would be prudent to assume that any file in .sas7bdat format created by a version of SAS prior to 9.4M4 may have this issue. Neither the existence of the problem, nor the fact that it had been fixed, were documented by SAS in the release notes for version 9.4M4; equally, however, the support representative did not tell me that the problem is regarded as top secret or subject to any sort of embargo.

(The identity of the organisation that shared the files in which I found the bug has been redacted here.)

SAS is a complex software package and it will generally take a while for large organisations to migrate to a new version. Probably by now most versions have been upgraded to 9.4M4 or later, but quite a few sites might have been using the previous version containing this bug until quite recently, and as I already mentioned, it's not clear how old the bug is (i.e., at what point it was introduced to the software). So it could have been around for many years prior to being discovered, and it could well have still been around for two or three years after that date at many sites.

Now, this discovery caused me a dilemma. I worried that, if I were to go public with this bug, this might start a race between people who have already shared their datasets that were made with a version prior to 9.4M4 trying to replace or recall their files, and Bad People™ trying to find material online to exploit. That is, to reveal the existence of the problem might increase the risk of data leaking out. On the other hand, it's also possible that the bad people are already aware of the problem and are actively looking for that material, in which case every day that passes without the problem becoming public knowledge increases the risk, and going public would be the start of the solution.

Note that this is different from the typical "white hat"/"bug bounty" scenario, in which the Good People™ who find a vulnerability tell the software company about the bug and get paid to remain silent until a reasonable amount of time has passed to patch the systems, after which they are free to reveal the existence of the problem. In those cases, patching the software fixes the problem immediately, because the extent of the vulnerability is limited to the software itself. But here, the vulnerability is in the data files that were not anonymised as intended. There is no way to patch anything to stop those files from being read, because that only needs a text editor. The only remedy is for the files to be deleted from, or replaced in, repositories as their authors or guardians become aware of the issue.

In the original case where I discovered this issue, I reported it to the owner of the dataset and he arranged for the offending file to be recalled from the repository where he had placed it, namely the Open Science Framework. (I also gave a heads-up to the Executive Director of the Center for Open Science, Brian Nosek, at that time.) The dataset owner also reported the problem to their management, as they thought (and I completely agree) that dealing with this sort of issue is beyond the pay grade of any individual principal investigator. I do not know what has happened since, nor do I think it's really my business.  I would argue that SAS ought to have done something more about this than just sneaking out a fix without telling anybody; but perhaps they, too, looked at the trade-off described above and decided to keep quiet on that basis, rather than merely avoiding embarrassment.

I have spent several months wondering what to do about this knowledge. In the end, I decided that (a) there probably aren't too many corrupt files out there, and (b) there probably aren't too many Bad People™ who are likely to go hunting for sensitive data this way, because it just doesn't seem like a very productive way of being a Bad Person. So I am going public today, in the hope that the practical consequences of revealing the existence of this problem are unlikely to be major, and that giving people the chance to correct any SAS data files that they might have made public will be, on balance, a net win for the Good People. (For what it's worth, I asked two professors of ethics about this, one of them a specialist in data-related issues, and they both said "Ouch. Tough call. I don't know. Do what you think is best".)

Now, what does this discovery mean? Well, if you use SAS and have made your data available using the .sas7bdat file format, you might want to have a look in the data files with a text editor and check that there is nothing in there that you wouldn't expect. But even if you don't use SAS, there may still be a couple of lessons for you from this incident, because (a) the fact that this particular software bug is fixed doesn't mean there aren't others, and (b) everyone makes mistakes.

First, consider always using .CSV files to share your data, if there is no compelling reason not to do so. The other day I had to download a two-year-old .RData file from OSF and it contained data structures that were already partly obsolete when read by newer versions of the package that had create them; I had to hunt around online for the solution, and that might not work at all at some future point. When I  had sorted that out I saved the resulting data in a .CSV file, which turned out to be nearly 20% smaller than the .RData file anyway.

Second, try to keep all PII out of the dataset altogether. Build a separate file or files that connects each participant's study ID number to their name and any other information that is not going to be an analysed variable. If your study requires you to generate a personalised report for the participants that includes their name then this might represent a little extra effort, but generally this approach will greatly reduce the chances of a leak of PII. (I suspect that for every participant whose PII is revealed by bugs, several more are the victims of either data theft or simply failure on the part of the researchers to delete the PII before sharing their data.)

(Thanks to Marcus Munafò and Brian Nosek for valuable discussions about an earlier draft of this post.)


28 October 2021

A catastrophic failure of peer review in obstetrics and gynaecology

In this post I will discuss a set of 46 articles from the same institution that appear to show severe problems in many journals in the field of obstetrics and gynaecology. These are not entirely new discoveries; worrying overlaps among 35 of these articles have already been investigated in a commentary article from 2020 by Esmée Bordewijk and colleagues that critiqued 24 articles on which Dr Ahmed Badawy was lead author (19) or a co-author (5), plus 11 articles lead-authored by Dr Hatem Abu Hassim, who is Dr Badawy's colleague in the Department of Obstetrics and Gynaecology at Mansoura University in Egypt.

Bordewijk et al. reported that they had detected a large number of apparent duplications in the summary statistics across those articles, which mostly describe randomized controlled trials carried out in the Mansoura ObGyn department. Nine of these articles appear as chapters in Dr Badawy's PhD thesis , which he defended in December 2008 at the University of Utrecht in the Netherlands.

I think it is fair to say that Dr Badawy was not especially impressed by the arguments in Bordewijk et al.'s commentary; indeed, he wrote a reply with the uncompromising title of "Data integrity of randomized controlled trials: A hate speech or scientific work?" in which he questioned, among other things, the simulation techniques that Bordewijk et al. had used to demonstrate how unlikely it was that the patterns that they had observed across the 35 articles that they examined had arisen by chance.

The senior author on the Bordewijk et al. commentary was Dr Ben Mol of Monash University in Melbourne, Australia. Since their commentary was published, Dr Mol and his colleagues have been attempting to get the journal editors who published the 35 articles in question to take some form of action on them. To date, five articles have been retracted and another 10 have expressions of concern. The story, including its potential legal fallout, has been covered in considerable detail at Retraction Watch in December 2020 and again in August 2021.

However, to some extent, Dr Badawy may have a point: The evidence presented in the commentary is circumstantial and depends on a number of probabilistic assumptions, which editors may not be inclined to completely trust (although personally I find Bordewijk et al.'s analysis thoroughly convincing). And, as an editor, even if you believe that two or more articles are based on recycled data or summary statistics, how are you to know that the one in your journal is not the original ("good") one?

Fortunately (at least from the point of view of the error correction process) there is a much simpler approach to the problem at hand. It can be shown that almost all of the articles that were analysed by Bordewijk et al.—plus a few more that did not make it into their commentary—have very substantial statistical flaws at the level of each individual article. In my opinion, in most cases these errors would justify a rapid retraction based solely on the evidence that is to be found in each article's PDF file. There is no need for simulations or probability calculations; in the majority of cases, the numbers sitting there in the tables of results are demonstrably incorrect.


General description of the articles

As mentioned above, for this blog post I examined 46 articles from the Department of Obstetrics and Gynaecology at Mansoura University. Of these, 35 had already been analysed by Bordewijk et al., and the rest were included either at the suggestion of Ben Mol or after I searched for any other empirical studies that I could find in the Google Scholar profiles of Dr Badawy and Dr Abu Hashim. Seven of the 46 articles had neither Dr Badawy nor Dr Abu Hashim as co-authors, but for all seven of those Dr Tarek Shokeir was listed as a co-author (or, in one case, sole author).

These articles mostly describe RCTs of various interventions for conditions such as infertility, heavy menstrual bleeding, polycystic ovary syndrome, preterm labour, or endometriosis. Several of them have more than 100 citations according to Google Scholar. The studies seem to be well-powered, many with more than 100 participants in each group (e.g., for this one the authors claimed to have recruited 996 infertile women), and it is not hard to imagine that their findings may be affecting clinical practice around the world.

The typical article is relatively short, and contains a baseline table comparing the groups of patients (usually two), followed by one or sometimes more tables comparing the outcomes across those groups. These are usually expressed as simple unpaired comparisons of parameters (e.g., height, with mean and standard deviation reported for each group), or as tests of proportions (e.g., in the treatment group X% of N1 participants became pregnant, versus Y% of N2 participants in the control group). The statistics are therefore for the most part very simple; for example, there are no logistic regressions with covariates. This means that we can readily check most of the statistics from the tables themselves.


The t statistics

First up, I note that in about half of these articles, no t statistics at all are reported for the comparisons of continuous variables across groups. Sometimes we get just a p value. In other cases we are only told that individual comparisons (or all of the comparisons in a table, via a note at the end) were statistically significant or not; typically we are left to infer that that means p < 0.05. (In a few articles the authors reported using the Mann-Whitney U test when data were not normally distributed, but they do not generally indicate which variables are concerned by this in each case.)

In quite a few cases the errors in the implicit t statistics are visible from space, as in this example from 10.1111/j.1447-0756.2010.01383.x:

(This table has been truncated on the right in order to fit on the page.)



Have a look at the "Fasting glucose" numbers (fourth line from the bottom). The difference between the means is 5.4 (which means a minimum of 5.3 even after allowing for rounding) and just by approximating a weighted mean you can see that the pooled SD is going to be about 1.6, so this is a Cohen's d of around 3.3, which is never going to be non-significant at the 0.05 level. You don't have to carry the formula for the pooled standard error around in your head to know that with df = 136 the t statistic here is going to be huge, and indeed it is: the minimum t value is 17.31, the midpoint is 18.17, the maximum is 19.09, p = homeopathic. (Aside: p
hysiologists might wonder about the homogeneity of the testosterone levels in the ovulation group, with an SD of just 0.01.)

When we do get to see t statistics, the majority are incorrect given the mean, SD, and sample size, and even after allowing for rounding (but see also our RIVETS preprint for what happens if test statistics are derived with too much allowance for rounding). See this table from 10.1016/j.fertnstert.2008.06.013:


Several of the effect sizes here are huge; for example, it is close to d = 4 on the first line, for which the correct t statistic is not 4.4 but instead somewhere between 38.71 and 40.26. If you are more familiar with ANOVA, you can square those numbers to get the corresponding F statistic. (Spoiler: F = 1,500 is a lot.)

The overall results for the set of 46 articles are disastrous. In most of the articles the majority of the p values are incorrect, sometimes by a wide margin. Even in those articles where the authors did not explicitly report which test they used, leaving open the possibility that they might have used the Mann-Whitney U test, the discrepancies between the reported p values and those that I obtained from a t test were very often such that even the discovery that the Mann-Whitney test had been used would not be sufficient to explain them.

[Begin update 2021-10-28 19:57 UTC]
Brenton Wiernik asked whether it was possible that the authors had accidentally reported the variability of their means as the standard error of the mean (SEM) rather than the standard deviation. I must confess that when doing the analyses I did not consider this possibility, not least because all of the SDs for things like age (around 3) or height (around 5–6cm) seemed quite reasonable, although perhaps a bit on the low side of what one might expect. 

However, I did identify seven articles (10.3109/01443615.2010.49787310.1016/j.fertnstert.2008.04.06510.1016/j.fertnstert.2008.06.013, 10.1016/j.fertnstert.2007.08.034, 10.1016/s1472-6483(10)60148-4, 10.1016/j.fertnstert.2007.05.010, and 10.1016/j.fertnstert.2007.02.062) in which the authors claimed that their measure of variability (typically written as ± and a number after the mean) was indeed the SEM. But the sample sizes in these papers are such that the implied SDs would need to be huge. Across these seven papers, the implied SD for age would range from 29.50 to 62.39 years, and for height from 54.79 to 124.77 cm. I therefore stand by my interpretation that these values were intended by the authors to be standard deviations, although that then gives them another question to answer, namely why they claimed them to be SEM when those numbers are patently absurd.
[End update 2021-10-28 19:57 UTC]


The Χ² (etc) statistics


In examining whether the tests of proportions had been reported correctly, I included only those articles (30 out of the total set of 46) that contained at least one exact numerical (i.e., not "NS" or "<0.001") p value from a Pearson chi-square test or Fisher's exact test of a 2x2 contingency table. If the authors also reported Χ² statistics and/or odds ratios, I also included those numbers. I then examined the extent to which these statistics matched the values that I calculated from the underlying data. When the subsample sizes were very small, I allowed the authors some more leeway, as the Pearson chi-square test does not always perform well in these cases.

As with the t test results (see previous section), the overall results revealed a large number of incorrect p values in almost every article for which I recalculated the tests of proportions.

Perhaps the most indisputable source of errors is situations in which what is effectively the same test is reported twice, with different chi-square statistics (if those are reported) and different p values, even though those values are necessarily identical. I counted 8 examples of this across 7 different articles. For example, consider this table from 10.3109/01443615.2010.508850:

You're either pregnant or you aren't. I don't make the rules.

After 6 months in the study, every participant either had, or had not, become pregnant. So the contingency table for the first outcome ("No pregnancy") is ((141, 114), (150, 107)) and for the second outcome ("Clinical pregnancy") it is 
((114, 141), (107, 150)). Those will of course give exactly the same result, meaning that at least one of the Χ²/p-value pairs must be wrong. In fact the correct numbers are Χ²(1) = 0.49, p = 0.48, which means that neither of the Χ² test statistics, nor the p values, in the first two lines of the table match are even remotely valid. For that matter, neither (incorrect) p value in the table even matches its corresponding (incorrect) Χ² statistic; I will let you check this for yourself as an exercise.


The p values

There are a couple of basic things to keep in mind when reading tables of statistics:

  • A t statistic of 1.96 with 100 or more degrees of freedom gives a rounded two-tailed p value of 0.05 (although if you want it to be strictly less than 0.05000, you need a t statistic of 1.984 with 100 dfs).
  • For any given number of degrees of freedom, a larger t or Χ² statistic gives a smaller p value.
With that in mind, let's look at this table (from 10.1016/j.fertnstert.2007.05.010), which I believe to be entirely typical of the articles under discussion here:

The good news is that the percentages and the Χ² statistics check out OK!


Nether of the above bits of very basic knowledge is respected here. First we have t(228) = 2.66, p = 0.1 (when the correct p value for that t is 0.008—although in any case the correct t statistic for the given means and SDs would be between 5.05 and 5.83). Second, between FSH (follicle-stimulating hormone) and LH (luteinizing hormone), the t statistic goes up and so does the p value (which is also clearly incorrect in both cases).

A number of the articles contain p values that are literally impossible (i.e., greater than 1.0; don't @ me to tell me about that time you did a Bonferroni correction by multiplying the p value instead of dividing the alpha). See 10.3109/01443615.2010.497873 (Table 1, "Parity", p = 1.13; see also "Other inconsistencies", below), 10.1016/j.fertnstert.2008.04.065 (Table 1, "Height", = 1.01),  and 10.1007/s00404-013-2866-0, which contains no less than four examples across its Tables 1 and 2:

1.12, 1.22, 1.32, 1.42: The impossible p values form a nice pattern.


The confidence intervals

A few of the articles have confidence intervals in the tables, perhaps added at the insistence of a reviewer or editor. But in most cases the point estimate falls outside the confidence interval. Sometimes this can become quite absurd, as in the following example (from 10.1016/j.fertnstert.2007.08.034). Those CI limits are ± 1.96 standard errors either side of... what exactly?

(Once you look beyond the CIs, there is a "bonus" waiting in the p value column here.)


Other inconsistencies

Within these 46 articles it is hard not to notice a considerable number of other inconsistencies, which make the reader wonder how much care and attention went into both the writing and review processes. These tables from 10.3109/01443615.2010.497873 provide a particularly egregious example, with the appearance of 41 and 42 extra patients in the respective groups between baseline and outcome. (As a bonus, we also have a p value of 1.13.)

(Some white space has been removed from this table.)


These results from 10.1016/j.ejogrb.2012.09.014 make no sense. The first comparison was apparently done using Fisher's exact test, the second with Pearson's Χ² test, and the third, well, your guess is as good as mine. But there is no reason to use the two different types of test here, and even less reason to use Fisher's test for the larger case numbers and Pearson's for the smaller ones. (The p values are all incorrect, and would be even if the other test were to have been used for every variable.)

(Some white space and some results that are not relevant to the point being illustrated have been removed from this table.)


Finally, fans of GRIM might also be interested to learn that two articles show signs of possible inconsistencies in their reported means:
10.1016/j.rbmo.2014.03.011, "Age" (both groups) and "Infertility" (control group)


Conclusion

The results of these analyses seem to indicate that something has gone very badly wrong in the writing, reviewing, and publication of these articles. Even though I tried to give the published numbers the benefit of the doubt as far as possible, I estimate that across these 46 articles, 346 (64%) of the 542 parametric tests (unpaired t tests, or, occasionally, ANOVA) and 151 (61%) of the 247 contingency table test (Pearson's Χ² or Fisher's exact test) that I was able to check were incorrectly reported. I don't think that anybody should be relying on the conclusions of these articles as a guide to practice, and I suspect that the only solution for most of them will be retraction. (As already mentioned, five have already been retracted following the publication of Bordewijk et al.'s commentary.)

I have a few aims in writing this post.

First, I want to do whatever I can to help get these misleading (at best) papers retracted from the medical literature, where they would seem to have considerable potential to do serious harm to the health of women, especially those who are pregnant or trying to overcome infertility.

Second, I aim to show some of the techniques that can be used to detect obvious errors in published articles (or in manuscripts that you might be reviewing).

Third, and the important reason for doing all this work (it took a lot of hours to do these analyses, as you will see if you download the Excel file!), is to draw attention to the utter failure of peer review that was required in order for most of these articles to get published. They appeared in 13 different journals, none of which would appear to correspond to most people's idea of a "predatory" outlet. It is very tempting to imagine that nobody—editors, reviewers, Dr Badawy's thesis committee at the University of Utrecht, or readers of the journals (until Ben Mol and Esmée Bordewijk came along)—even so much glanced at the tables of results in these articles, given that they almost all contain multiple impossible numbers.

It is true that the majority of these articles are more than 10 years old, but I wonder how much has changed in the publication processes of medical journals since then. The reality of scientific peer review seems to be that, to a first approximation, nobody ever checks any of the numbers. I find that deeply worrying.


Supporting documents

The majority of the analyses underlying this post have been done with Microsoft Excel 2003. I have some R code that can do the same thing, but it seemed to me to make more sense to use Excel as the process of copying and pasting numbers from the tables in the articles was a lot more reliable, requiring only a text editor to replace the column separators with tab characters. I used my R code to compute the test statistics in a couple of cases where there were more than two groups and so I had to use the rpsychi package to calculate the results of a one-way ANOVA.

In my Excel file, each unpaired t test is performed on a separate line. The user enters the mean, standard deviation, and sample size for each of the two conditions, plus an indication of the rounding precision (i.e, the number of decimal places) for the means and SDs separately. The spreadsheet then calculates (using formulas that you can find in columns whose width I have in most cases reduced to zero) the minimum and maximum possible (i.e., before rounding) means and SDs, and from that it determines the minimum, notional (i.e., assuming that the rounded input values are exact), and maximum t statistics and the corresponding p values. It then highlights those t statistics (if available) and p values (or "significant/not-significant" claims) from the article that are not compatible with any point in the possible ranges of values. That is, at all times, I give the maximum benefit of the doubt to the authors. (Similar considerations apply, mutatis mutandis, to the table of chi-square tests in the same Excel file.)

The documents (an Excel file for the main analyses and some R code for the bits that I couldn't work out how to do easily in Excel) are available here. The article PDFs are all copyrighted and I cannot share them, but if you do not have institutional access then there is always the site whose name rhymes with Dry Club.


Appendix: List of examined articles

Articles that have not been retracted and have no expression of concern

Badawy et al. (2009). Gonadotropin-releasing hormone agonists for prevention of chemotherapy-induced ovarian damage: Prospective randomized study. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.12.044

Badawy et al. (2007). Induction of ovulation in idiopathic premature ovarian failure: A randomized double-blind trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60711-0

Badawy et al. (2010). Clomiphene citrate or aromatase inhibitors combined with gonadotropins for superovulation in women undergoing intrauterine insemination: A prospective randomised trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.497873

Badawy et al. (2009). Ultrasound-guided transvaginal ovarian needle drilling (UTND) for treatment of polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.044

Badawy et al. (2008). Low-molecular weight heparin in patients with recurrent early miscarriages of unknown aetiology. Journal of Obstetrics and Gynaecology. https://doi.org/10.1080/01443610802042688

Badawy et al. (2010). Laparoscopy--or not--for management of unexplained infertility. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.508850

Badawy et al. (2007). Plasma homocysteine and polycystic ovary syndrome: The missed link. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2006.10.015

Badawy et al. (2008). Extending clomiphene treatment in clomiphene-resistant women with PCOS: A randomized controlled trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60148-4

Badawy et al. (2006). Clomiphene citrate plus N-acetyl cysteine versus clomiphene citrate for augmenting ovulation in the management of unexplained infertility: A randomized double-blind controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2006.02.097

Badawy et al. (2007). Randomized controlled trial of three doses of letrozole for ovulation induction in patients with unexplained infertility. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)61046-2

Fawzy et al. (2007). Treatment options and pregnancy outcome in women with idiopathic recurrent miscarriage: A randomized placebo-controlled study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-007-0527-x

Gibreel et al. (2012). Endometrial scratching to improve pregnancy rate in couples with unexplained subfertility: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2012.02016.x

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus highly purified FSH for ovulation induction in clomiphene-resistant PCOS women: A randomised controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2010.488771

Abu Hashim et al. (2010). Letrozole versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-010-1566-2

Abu Hashim et al. (2011). Laparoscopic ovarian diathermy after clomiphene failure in polycystic ovary syndrome: is it worthwhile? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1983-x

Abu Hashim et al. (2012). Contraceptive vaginal ring treatment of heavy menstrual bleeding: A randomized controlled trial with norethisterone. Contraception. https://doi.org/10.1016/j.contraception.2011.07.012

Abu Hashim et al. (2010). N-acetyl cysteine plus clomiphene citrate versus metformin and clomiphene citrate in treatment of clomiphene-resistant polycystic ovary syndrome: A randomized controlled trial. Journal of Women's Health. https://doi.org/10.1089/jwh.2009.1920

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2010.01383.x

Abu Hashim et al. (2011). Minimal stimulation or clomiphene citrate as first-line therapy in women with polycystic ovary syndrome: A randomized controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2011.589924

Abu Hashim et al. (2011). Does laparoscopic ovarian diathermy change clomiphene-resistant PCOS into clomiphene-sensitive? Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1931-9

Abu Hashim et al. (2013). LNG-IUS treatment of non-atypical endometrial hyperplasia in perimenopausal women: A randomized controlled trial. Journal of Gynecologic Oncology. https://doi.org/10.3802/jgo.2013.24.2.128

Marzouk et al. (2014). Lavender-thymol as a new topical aromatherapy preparation for episiotomy: A randomised clinical trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2014.970522

Ragab et al. (2013). Does immediate postpartum curettage of the endometrium accelerate recovery from preeclampsia-eclampsia? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-013-2866-0

El Refaeey et al. (2014). Combined coenzyme Q10 and clomiphene citrate for ovulation induction in clomiphene-citrate-resistant polycystic ovary syndrome. Reproductive Biomedicine Online. https://doi.org/10.1016/j.rbmo.2014.03.011

Seleem et al. (2014). Superoxide dismutase in polycystic ovary syndrome patients undergoing intracytoplasmic sperm injection. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s10815-014-0190-7

Shokeir (2006). Tamoxifen citrate for women with unexplained infertility. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-006-0181-8 *

Shokeir et al. (2016). Hysteroscopic-guided local endometrial injury does not improve natural cycle pregnancy rate in women with unexplained infertility: Randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/jog.13077

Shokeir et al. (2009). The efficacy of Implanon for the treatment of chronic pelvic pain associated with pelvic congestion: 1-year randomized controlled pilot study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-009-0951-1 *

Shokeir & Mousa (2015). A randomized, placebo-controlled, double-blind study of hysteroscopic-guided pertubal diluted bupivacaine infusion for endometriosis-associated chronic pelvic pain. International Journal of Gynecology & Obstetrics. https://doi.org/10.1016/j.ijgo.2015.03.043

Articles that are subject to an editorial Expression of Concern

Badawy et al. (2012). Aromatase inhibitors or gonadotropin-releasing hormone agonists for the management of uterine adenomyosis: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2012.01350.x

Badawy et al. (2009). Extended letrozole therapy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.04.065

Badawy et al. (2008). Luteal phase clomiphene citrate for ovulation induction in women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.016

Badawy et al. (2007). N-Acetyl cysteine and clomiphene citrate for induction of ovulation in polycystic ovary syndrome: A cross-over trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340601090337

Badawy et al. (2009). Clomiphene citrate or anastrozole for ovulation induction in women with polycystic ovary syndrome? A prospective controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.08.034

Badawy et al. (2009). Pregnancy outcome after ovulation induction with aromatase inhibitors or clomiphene citrate in unexplained infertility. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340802638199

Abu Hashim et al. (2011). Intrauterine insemination versus timed intercourse with clomiphene citrate in polycystic ovary syndrome: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2010.01063.x

Abu Hashim et al. (2012). Randomized comparison of superovulation with letrozole versus clomiphene citrate in an IUI program for women with recently surgically treated minimal to mild endometriosis. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2011.01346.x

Shokeir et al. (2011). An RCT: use of oxytocin drip during hysteroscopic endometrial resection and its effect on operative blood loss and glycine deficit. Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2011.03.015

Shokeir et al. (2013). Does adjuvant long-acting gestagen therapy improve the outcome of hysteroscopic endometrial resection in women of low-resource settings with heavy menstrual bleeding? Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2012.11.006

Badawy & Gibreal (2011). Clomiphene citrate versus tamoxifen for ovulation induction in women with PCOS: A prospective randomized trial. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2011.07.015

Shokeir et al. (2013). Reducing blood loss at abdominal myomectomy with preoperative use of dinoprostone intravaginal suppository: A randomized placebo-controlled pilot study. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2012.09.014

Articles that have been retracted

Badawy et al. (2009). Clomiphene citrate or aromatase inhibitors for superovulation in women with unexplained infertility undergoing intrauterine insemination: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2008.06.013

Badawy et al. (2009). Clomiphene citrate or letrozole for ovulation induction in women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.02.062

Badawy et al. (2008). Anastrozole or letrozole for ovulation induction in clomiphene-resistant women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.05.010

Abu Hashim et al. (2010). Letrozole versus combined metformin and clomiphene citrate for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2009.07.985

El-Refaie et al. (2015). Vaginal progesterone for prevention of preterm labor in asymptomatic twin pregnancies with sonographic short cervix: A randomized clinical trial of efficacy and safety. Archives of Gynecology and Obstetricshttps://doi.org/10.1007/s00404-015-3767-1

Note

The two articles marked with a * above are the only ones in which I did not identify any problems; in each of these articles all of the statistical tests are marked as either "S" (significant) or "NS" (not significant) and none of the calculations that I performed resulted in the opposite verdict for any test.