16 May 2020

The perils of improvising with linear regression: Stedman et al. (in press)

This article has been getting a lot of coverage of various kinds in the last few days, including the regional, UK national, and international news media:

Stedman, M., Davies, M., Lunt, M., Verma, A., Anderson, S. G., & Heald, A. H. (in press). A phased approach to unlocking during the COVID-19 pandemic – Lessons from trend analysis. International Journal of Clinical Practice. https://doi.org/10.1111/ijcp.13528

There doesn't seem to be a typeset version from the journal yet, but you can read the final draft version online here and also download it as a PDF file.

The basic premise of the article is that, according to the authors' model, COVID-19 infections are far more widespread in the community in the United Kingdom(*) than anyone seems to think. Their reasoning works in three stages. First, they built a linear model of the spread of the disease, one of whose predictors was the currently reported number of total cases (i.e., the official number of people who have tested positive for COVID-19). Second, they extrapolated that model to a situation in which the entire population was infected, assuming the the spread continues to be entirely linear. Third, they used the slope of their line to estimate what the official reported number of cases would be at that point. They concluded their model shows that the true number of cases in the population is 150 times larger than the number of positive tests that have been carried out, so that on the day when their data were collected (24 April 2020) 26.8% of the population were already infected.

The above figures are from the Results section of the paper, on p. 11 of the final draft PDF. However, the Abstract contains different numbers, which seem to be based on data from 19 April 2020. The Abstract asserts that the true number of cases in the population may be 237 (versus 150) times the reported number, and that the percentage of the population who had been infected might have been 29% (versus 26.8% several days later). Aside from the question of why the Abstract includes different principal results from the Results section, it would appear to be something of a problem for the authors' assumptions of (ongoing) linearity in the relation between the spread of the disease and the number of reported cases if the slope of their model changed by a factor of one-third over five days.

But it seems to me that, apart from the rather tenuous assumptions of linearity and the validity of extrapolating a considerable way beyond the range of the data (which, to be fair, they mention in their Limitations paragraph), there is an even more fundamental problem with how the authors have used linear regression here. Their regression model contained at least nine covariates(**), and we are told that "The stepwise(***) regression of the local UTLA factors to RADIR showed that only one factor total reported cases/1,000 population [sic] was significantly linked" (p. 11). I take this to mean that, if the authors had reported the regression output in a table, this predictor would be the only one whose absolute value was at least twice its standard error. (The article is remarkably short on numerical detail, with neither a table of regression output nor a table of descriptives and correlations of the variables. Indeed, there are many points in the article where a couple of minutes spent on attention to detail would have greatly improved its quality, even in the context of the understandable desire to communicate results rapidly during a pandemic.)

Having established that only one predictor in this ten-predictor regression was statistically significant (in the sense of a 95-year-old throwaway remark by R. A. Fisher), the authors then proceeded to do something remarkable. Remember, they had built this model:

Y = B0 + B1X1 + B2X2 + B3X3 + ... + B10X10 + Error

(with Error apparently representing 78% of the variance, cf. line 3 of p. 11). But they then dropped ten of those terms (nine regression coefficients multiplied by the values of the predictors, plus the error term) to come up with this model (p. 11):

RADIR = 1.06 - 0.16 x Current Total Cases/1,000

What seems to have happened here is that authors in effect decided to set the regression coefficients B2 through B10 to zero, apparently because their respective values were less than twice their standard errors and so "didn't count" somehow. However, they retained the intercept (B0) and the coefficient associated with their main variable of interest (B1) from the 10-variable regression, as if the presence of the nine covariates had had no effect on the calculation of these values. But of course, those covariates had had an effect on both the estimation of the intercept and the coefficient of the first variable. That was precisely what they were included in the regression for. If the authors had wanted to make a model with just one predictor (the number of current total cases), they could have done so quite simply a one-variable regression. You can't just run a multiple regression and keep the coefficients (with or without the intercept) that you think are important while throwing away the others.

This seems to me to be a rather severe misinterpretation of what a regression model is and how it works. There are many other things that could be questioned about this study(****), and indeed several people are already doing precisely that, but this seems to me to a very fundamental problem with the article, and something that the reviewers really ought to have picked up. The first two authors appear to be management consultants whose qualifications to conduct this sort of analysis are unclear, but the third author's faculty page suggests that he knows his stuff when it comes to statistics, so I'm not sure how this was allowed to happen.

Stedman et al. end with this sentence: "The manuscript is an honest, accurate, and transparent account of the study being reported. No important aspects of the study have been omitted." This is admirable, and I take it to mean that the authors did not in fact run a one-predictor regression to estimate the effect of their main IV of interest on their DV before they decided to run a stepwise regression with nine covariates. However, I suggest that it might be useful if they were to run that one-predictor regression now, and report the results along with those of the multiple regression (cf. Simmons, Nelson, & Simonsohn, 2011, p. 1362, Table 2, point 6). When they do that, they might also consider incorporating the latest testing data and see if the slope of their regression has changed, because since 24 April the number of cases in the UK has more than doubled (240,161 at the moment I am writing this), suggesting that between 54% and 84% of the population has by now become infected, depending on whether we take the numbers from p. 11 of the article or those from the Abstract.

[[ Update 2020-05-18 00:15 UTC: There is a preprint of this paper, available here. It contains the same basic model, which is estimated using data from 11 days earlier than the accepted manuscript. In the preprint, the regression equation (p. 7) is:

RADIR = 1.20 - 0.26 x Current Total Cases/1,000

In other words, between the submission date of the preprint and the preparation data of the final manuscript, the slope of the regression line --- which the model assumes would be constant until everyone was infected --- changed from -0.26 to -0.16. And yet the authors did not apparently think that this was sufficient reason to question the idea that the progress of the disease would continue to match their linear model, despite direct evidence that it had failed to do so over the previous 11 days. This is quite astonishing. ]]

(*) Whether the authors claim that their model applies to the UK or just to England is not entirely clear, as both terms appears to be used more or less interchangeably. They use a figure of 60 million as the population of England, although the Office of National Statistics reports figures of 56.3 million for England and 66.8 million for the UK in mid 2019.

(**) I wrote here that there were nine, but one of them is ethnicity, which would typically have been coded as a series of categories, each of which would have functioned as a separate predictor in the regression. But maybe they used some kind of shorthand such as "Percentage who did/didn't identify in the 'White British' category", so I'll continue to assume that there were nine covariates and hence 10 predictors in total.

(***) Stepwise regression is generally not considered a good idea these days. See for example here. Thanks to Stuart Ritchie for this tip and for reading through a draft of this post.

(****) Looking at Figure 4, it occurs to me that a few data points at the top left might well be lifting the left-hand side of the regression line up to some extent, but that's all moot until we know more about the single-variable regression. Also, there is no confidence interval or any other measure of the uncertainty --- even assuming that the model is perfectly linear --- of the estimated reported case rate when the infection rate drops to zero.


  1. "suggesting that between 54% and 84% of the population has by now become infected, depending on whether we take the numbers from p. 11 of the article or those from the Abstract."

    In another week, the fraction of the English / UK population to have been infected with the virus will exceed 100% and there will be a Divide-by-Zero error.

  2. I am still trying to get my mind around a 10 variable linear regression let alone that it morphs into a 1 variable regression.

    It may have been thot I was tired when I was reading the paper but I found it very hard to tease out what the authors were doing.

    1. In my experience, the articles with the worst deficiencies are also the ones that are hardest to understand. There may be a common cause of these two observed phenomena.

  3. On 20th May it was announced that antibody testing showed that about 5% of people in the UK have antibodies to SARS-COV-2. (Or 6.5% if you read the 5% figure to refer only to areas outside of London, not the UK average)

    The antibody testing program was press released on 23rd April, i.e. after the 19th April end of data inclusion in this study.

    So the true figure during the period modelled in this paper was not 29% or 26.8%, it was - at most - 6.5%.