Assessing the real risks of re-identifying patient data

by Andrew Oram

This was originally published on O’Reilly Media’s Strata blog, September 24, 2012.

Daniel Barth-Jones, an epidemiologist and expert on health data privacy, has published an examination of the sensitive issue of re-identifying patients. This is worthwhile reading for anyone interested in the use of patient data for improving health care. He has blogged about his key findings, but I suggest reading his full paper for the recommendations he makes.

I discussed this article and exchanged a lot of email with Barth-Jones. Recognizing that his academic language and cautious reasoning may make it difficult for some readers to understand the implications of his work, I’ll explain the conclusions he wants to stress here.

Before HIPAA, common health data practices were lacking in systematic protections against re-identification, which could lead to discrimination, embarrassment, or even identity theft. But the reforms introduced by HIPAA—for example the list of 18 types of Personal Health Information that must be removed, storing only three digits of a zip code instead of five have reduced the likelihood of re-identification dramatically. We should explore the vast benefits of sharing de-identified patient data without looking over our shoulders in fear.

Crossing swords with this perspective is another attitude that re-identification is eventually almost guaranteed to happen, as laid out in the prestigious Communications of the ACM (PDF), the press, and WIRED magazine.

Not only are data mining algorithms improving, but the skyrocketing quantity of personal data being collected provides the sheer mass to enable re-identification. For instance, one data set may carefully omit ZIP codes, but it could be combined with another commercial data set (not subject to HIPAA rules) that smacks the ZIP code on each individual. A simple SQL join can extract a unique person along with all data stored on him or her. Deborah Peel of Patient Privacy Rights calls this proliferation of commercial databases the key danger in re-identification. Barth-Jones downgrades these worries in another blog posting.

Barth-Jones leaps into the controversy by re-examining the famous incident when Latanya Sweeney plucked patient information out of public records to identify the medical records of Massachusetts Governor William Weld. This brilliant and audacious research led to many changes in de-identification practices and regulation, including the HIPAA rules for PHI. But re-identification is also hard in most real-life situations because the lists of the people being searched are almost always incomplete—for instance, voter registration rolls usually miss 30% of the people in their region.

This is the basic point behind a bit of key jargon in Barth-Jones’s paper, the "perfect population register." If you narrow your search to a single person, but there happens to be just one other person with the same demographics who is missing from your register, you have only a 50% chance of identifying the right person—pretty bad odds. No existing data, even the US Census, covers everybody, and the missing people always lurk like shades behind those that are being re-identified.

Barth-Jones says that because voter registration rolls are incomplete, the combination of information Sweeney identified actually had only a two-thirds chance of being William Weld. The medical information did, however, prove her correct, because news reports had mentioned the exact date of a hospitalization that Weld had undergone, what hospital he was seen in, and what tests were performed on him.

When I asked Sweeney about his work, she pointed out that she did not need to know anything about Weld’s medical misadventures, and that the combination of date of birth, gender, and 5-digit zip is unique for most Americans. (HIPAA now prohibits releasing it.)

We should also consider the costs of re-identification. It’s cheap to use public information, such as voter registration rolls. Consulting commercial databases can raise the costs to the hundreds of thousands of dollars to re-identify a single person, which raises the question of who would find it worthwhile. Barth-Jones predicts that it would be much more cost effective for a malicious snoop to hire a private detective to tail you or dig through your garbage than to combine pubic records to make a positive identification.

Khaled El Emam, a professor at the University of Ottawa who is the founder and CEO of Privacy Analytics, Inc., told me in email that the Barth-Jones analysis is helpful because it describes in detail the real re-identification risks. He pointed to a study of his finding that most research claiming successful re-identification uses data that was de-identified using techniques that were in use before HIPAA, and that while a theoretical chance of re-identification remains, it has a tiny chance of success.

According to El Emam, "There are a number of other, oft-cited re-identification attacks where the methodology was poorly documented, so the real risks of them happening again cannot be easily determined. The Weld attack is a good example. There are multiple versions of the Weld attack story floating around, with different levels of drama involved. It was never actually published or fully documented. It got into privacy folklore and developed its own narrative with little actual evidence of how it was done."

I find the incomplete population register a powerful argument. But if you know something about a victim—whether it’s the date he entered a certain hospital or his possession of an American Express charge card—and you can get your hands on all hospital admissions or American Express customers, you may be able to hit the jackpot.

Celebrities are often in the news for health related issues, like Weld: when they deliver a baby, when they go into drug rehab, when they get serious diseases, and so forth. Hospitals and health information exchanges mark celebrity records as particularly sensitive, knowing that they’re often the targets of snoops.

You or I may also be the victim of an estranged relative or someone similar who happens to know details about our health and want to ferret out something we’re trying to keep secret. One could imagine a child custody battle, for instance, being influenced by medical records that one spouse retrieves about another.

Good de-identification practices take highly specialized knowledge, and are important. I notice that the arguments for re-identification are similar to arguments that all encryption will eventually be broken. The vulnerability of encryption is predicated on continual increases in processing power, along with the routine discovery of weaknesses in existing encryption practices, plus the possibility of quantum computing, itself a quantum leap in the field. But we still encrypt our files and use secure web sites. So we can use de-identified patient data as well—let’s just hire experts to do it right.

Author’s home page
Other articles in chronological order
Index to other articles