Nick Brown's blog: A bug and a dilemma

A few months ago, I discovered that the SAS statistical software package, which is used worldwide by universities and other large organisations to analyse their data, contained—until quite recently—a bug that could result in information that the user thought they had successfully deleted (and was no longer visible from within the application itself) still being present in the saved data file. This could lead to personal identifiable information (PII) about study participants being revealed, alongside whatever other data might have been collected from these participants, which—depending on the study—could potentially be extremely sensitive. I found this entirely by chance when looking at an SAS data file to try and work out why some numbers weren't coming out as expected, for which it would have been useful to know if numbers are stored in ASCII or binary. (It turned out that they are stored in binary.)

Here's how this bug works: Suppose that as a researcher you have run a study on 80 named participants, and you now have a dataset containing their names, study ID numbers (for example, if the study code within your organisation is XYZ this code might be XYZ100, XYZ101, etc, up to XYZ179), and other relevant variables from the study. One day you decide to make a version of the dataset that can be shared without the participants being identifiable, either because you have to deposit this in an archive when you submit the study to a journal, or because somebody has read the article and asked for your data. You could share this in .CSV file format, and indeed that would normally be considered best practice for interoperability; but there may be good reasons to share it in SAS's native binary data file format with a .sas7bdat extension, which can in any case be opened in R (using a package named "sas7bdat", among others) or in SPSS.

So you open your file called participants-final.sas7bdat in the SAS data editor and delete the column with the participants' names (and any other PII, such as IP addresses, or perhaps dates of birth if those are not needed to establish the participants' ages, etc), then save it as deidentified-participants-final.sas7bdat, and share the latter file. But what you don't know is that, because of this bug, in some unknown percentage of cases the text of most of the names can sometimes still be sitting in the sas7bdat binary data file, close to the alphanumeric participant IDs. That is, if the bug has struck, someone who opens the "deidentified" file in a plain text editor (which could be as simple as Notepad on Windows) might see the names and IDs among the binary gloop, as shown in this image.

I am pretty sure these two people did not take part in this study.

This screenshot shows an actual extract from a data file that I found, with only the names and the study ID codes replaced with those of others selected from the phone book. The full names of about two-thirds of the participants in this study were readable. Of course, you can't read the binary data and it would take a lot of work to do so, but given the participant IDs (PRZ045 for Trump, PRZ046 for Biden) you can simply open the "anonymised" data file in SAS and find out all you want about those two people from within the application.

Even worse, though, is the fact that unless the participant's name is extremely common, when combined with knowledge of approximately where and when the study was conducted it might very well let someone identify them with a high degree of confidence for relatively little effort. And by opening the file in SAS—for example, with the free service SAS OnDemand for Academics, or in SPSS or R as previously mentioned—and looking at the data that was intended to be shared, we will be able to see that our newly-identified participant is 1.73 metres tall, or takes warfarin, or is HIV-positive.

(A number of Microsoft products, including Word and Excel, used to have a bug like this, many versions ago. When you chose "Save" rather than "Save As", it typically would not physically overwrite on the disk any text that you had deleted, perhaps because the code had originally been written to minimise writing operations with diskettes, which are slow.)

I have been told by SAS support (see screenshot below) that this bug was fixed in version 9.4M4 of the software, which was released on 16 November 2016. The support agent told me that the problem was known to be present in version 9.4M3, which was released on 14 July 2015; however, I do not know whether the problem also existed in previous versions. I think it would be prudent to assume that any file in .sas7bdat format created by a version of SAS prior to 9.4M4 may have this issue. Neither the existence of the problem, nor the fact that it had been fixed, were documented by SAS in the release notes for version 9.4M4; equally, however, the support representative did not tell me that the problem is regarded as top secret or subject to any sort of embargo.

(The identity of the organisation that shared the files in which I found the bug has been redacted here.)

SAS is a complex software package and it will generally take a while for large organisations to migrate to a new version. Probably by now most versions have been upgraded to 9.4M4 or later, but quite a few sites might have been using the previous version containing this bug until quite recently, and as I already mentioned, it's not clear how old the bug is (i.e., at what point it was introduced to the software). So it could have been around for many years prior to being discovered, and it could well have still been around for two or three years after that date at many sites.

Now, this discovery caused me a dilemma. I worried that, if I were to go public with this bug, this might start a race between people who have already shared their datasets that were made with a version prior to 9.4M4 trying to replace or recall their files, and Bad People™ trying to find material online to exploit. That is, to reveal the existence of the problem might increase the risk of data leaking out. On the other hand, it's also possible that the bad people are already aware of the problem and are actively looking for that material, in which case every day that passes without the problem becoming public knowledge increases the risk, and going public would be the start of the solution.

Note that this is different from the typical "white hat"/"bug bounty" scenario, in which the Good People™ who find a vulnerability tell the software company about the bug and get paid to remain silent until a reasonable amount of time has passed to patch the systems, after which they are free to reveal the existence of the problem. In those cases, patching the software fixes the problem immediately, because the extent of the vulnerability is limited to the software itself. But here, the vulnerability is in the data files that were not anonymised as intended. There is no way to patch anything to stop those files from being read, because that only needs a text editor. The only remedy is for the files to be deleted from, or replaced in, repositories as their authors or guardians become aware of the issue.

In the original case where I discovered this issue, I reported it to the owner of the dataset and he arranged for the offending file to be recalled from the repository where he had placed it, namely the Open Science Framework. (I also gave a heads-up to the Executive Director of the Center for Open Science, Brian Nosek, at that time.) The dataset owner also reported the problem to their management, as they thought (and I completely agree) that dealing with this sort of issue is beyond the pay grade of any individual principal investigator. I do not know what has happened since, nor do I think it's really my business. I would argue that SAS ought to have done something more about this than just sneaking out a fix without telling anybody; but perhaps they, too, looked at the trade-off described above and decided to keep quiet on that basis, rather than merely avoiding embarrassment.

I have spent several months wondering what to do about this knowledge. In the end, I decided that (a) there probably aren't too many corrupt files out there, and (b) there probably aren't too many Bad People™ who are likely to go hunting for sensitive data this way, because it just doesn't seem like a very productive way of being a Bad Person. So I am going public today, in the hope that the practical consequences of revealing the existence of this problem are unlikely to be major, and that giving people the chance to correct any SAS data files that they might have made public will be, on balance, a net win for the Good People. (For what it's worth, I asked two professors of ethics about this, one of them a specialist in data-related issues, and they both said "Ouch. Tough call. I don't know. Do what you think is best".)

Now, what does this discovery mean? Well, if you use SAS and have made your data available using the .sas7bdat file format, you might want to have a look in the data files with a text editor and check that there is nothing in there that you wouldn't expect. But even if you don't use SAS, there may still be a couple of lessons for you from this incident, because (a) the fact that this particular software bug is fixed doesn't mean there aren't others, and (b) everyone makes mistakes.

First, consider always using .CSV files to share your data, if there is no compelling reason not to do so. The other day I had to download a two-year-old .RData file from OSF and it contained data structures that were already partly obsolete when read by newer versions of the package that had create them; I had to hunt around online for the solution, and that might not work at all at some future point. When I had sorted that out I saved the resulting data in a .CSV file, which turned out to be nearly 20% smaller than the .RData file anyway.

Second, try to keep all PII out of the dataset altogether. Build a separate file or files that connects each participant's study ID number to their name and any other information that is not going to be an analysed variable. If your study requires you to generate a personalised report for the participants that includes their name then this might represent a little extra effort, but generally this approach will greatly reduce the chances of a leak of PII. (I suspect that for every participant whose PII is revealed by bugs, several more are the victims of either data theft or simply failure on the part of the researchers to delete the PII before sharing their data.)

(Thanks to Marcus Munafò and Brian Nosek for valuable discussions about an earlier draft of this post.)

7 comments:

Data_Geek2 November 2021 at 19:11
There are several hundred issues in that fix list, do you have the actual Issue # specifically? I wonder if this is related to files that were created with versioning/auditing enabled which would make some sens e then to have this type of data included.
Anonymous3 November 2021 at 00:01
Can you clarify what you mean by "SAS data editor"? Does this behavior happen in a standard data step?
MaryKaye4 November 2021 at 17:44
The cancer lab I worked at transformed all patient names to ID codes right at the start, and never used patient names in *any* data documents. The key that connected patient names to IDs was kept locked up in a cabinet where only 2 people could access it.

This proved very helpful when I joined the group; I could make a clear case that I had no access to patient identifying information, which exempted me from a lot of onerous requirements. I strongly recommend it to biomedical researchers. Your bug is yet another reason why.

31 October 2021

A bug and a dilemma

7 comments: