Combining patient data sets for better medical research

This was originally published on O’Reilly Media’s Strata blog, Oct 10, 2012.

I find Datalanche’s upcoming search application interesting because its database mixes public health data with patients’ clinical data from a private vendor. Practice Fusion opened up their data set of de-identified clinical information for a challenge that Datalanche won last week.

The Datalanche service hasn’t gone live yet, but I got a chance to peek at it and enter searches for common drugs—where Datalanche can tell you what conditions they tend to be prescribed for—and common diagnoses—where Datalanche can tell you what is commonly prescribed for them—along with standard public health information, such as the demographics of people who are diagnosed with a particular condition or receive a particular drug.

That’s just the consumer-facing part of the application. I talked to Ryan Pedela, co-founder of Datalanche, who told me it can be used to research interesting disparities such as doctors who prescribe different medications for same diagnosis. Of course, the site can’t tell you why they differ in their choices, but the data can be mined further to provide hints. For instance, if a patient is prescribed one medication but returns quickly for a follow-up visit and is prescribed another medication, one can guess that he experienced side effects or that the medication was ineffective. Even so, key insights such as "side effects" or "ineffective" would be buried in free-text notes that weren’t given to Datalanche.

So where does Datalanche gets its input? For patient data it draws on the Practice Fusion data set, which covers 10,000 patients (a sample of Practice Fusion patients), and will still add public health data from the Centers for Disease Control. General descriptions of medical information, which don’t come from patient data, are taken from MedlinePlus, a free online encyclopedia offered by the National Library of Medicine, NIH, and the Department of Health and Human Services.

The data set from the CDC is large, but suffers from a key limitation: it records individual visits without linking them to one another. You can’t do longitudinal studies with it, because you can’t tell whether any two visits were made by a single patient or even were handled by a single doctor. This further depth comes from the Practice Fusion data set.

As more and more patient data enters public use, we can make finer and finer distinctions among medical conditions and which treatments seem to work. This will require sophisticated tools for hooking up different visits and tracing what is really happening, along with—of course—safeguards that ensure patient consent and hedge against re-identification.

Author’s home page
Other articles in chronological order
Index to other articles