Data sharing drives diagnoses and cures, if we can get there

by Andrew Oram

This was originally published on O’Reilly Media’s Strata blog, April 30, 2013.

The glowing reports we read of biotech advances almost cause one’s brain to ache. They leave us thinking that medical researchers must command the latest in all technological tools. But the engines of genetic and pharmaceutical innovation are stuttering for lack of one key fuel: data. Here they are left with the equivalent of trying to build skyscrapers with lathes and screwdrivers.

Sage Congress, held this past week in San Francisco, investigated the multiple facets of data in these field: gene sequences, models for finding pathways, patient behavior and symptoms (known as phenotypic data), and code to process all these inputs. A survey of efforts by the organizers, Sage Bionetworks, and other innovations in genetic data handling can show how genetics resembles and differs from other disciplines.

An intense lesson in code sharing

At last year’s Congress, Sage announced a challenge, together with the DREAM project, intended to galvanize researchers in genetics while showing off the growing capabilities of Sage’s Synapse platform. Synapse ties together a number of data sets in genetics and provides tools for researchers to upload new data, while searching other researchers’ data sets. Its challenge highlighted the industry’s need for better data sharing, and some ways to get there.

The Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge was cleverly designed to demonstrate both Synapse’s capabilities and the value of sharing. The goal was to find a better way to predict the chances of survival among victims of breast cancer. This is done through computational models that search for patterns in genetic material.

To participate, competing teams had to upload models to Synapse, where they were immediately evaluated against a set of test data and ranked in their success in predicting outcomes. Each team could go online at any time to see who was ahead and examine the code used by the front-runners. Thus, teams could benefit from their competitors’ work. The value of Synapse as a cloud service was also manifest. The process is reminiscent of the collaboration among teams to solve the Netflix prediction challenge.

Although this ability to steal freely from competing teams would seem to be a disincentive to participation, more than 1400 models were submitted, and the winning model (which was chosen by testing the front-runners against another data set assembled by a different research team in a different time and place) seems to work better then existing models, although it will still have to be tested in practice.

The winner’s prize was a gold coin in the currency recognized by researchers: publication in the prestigious journal Science Translational Medicine, which agreed in advance to recognize the competition as proof of the value of the work (although the article also went through traditional peer review). Supplementary materials were also posted online to fulfill the Sage mission of promoting reproducibility as well as reuse in new experiments.

Junctions and barriers

Synapse is a cloud-based service, but is open source so that any organization can store its own data on servers of its choice and provide Synapse-like access. This is important because genetic data sets tend to be huge, and therefore hard to copy. On its own cloud servers Synapse stores metadata, such as data annotations and provenance information, on data objects that can be located anywhere. This allows organizations to store data on their own servers, while still using the Synapse services. Of course, because Synapse is open source, an organization could also chose to create their own instance, but this would eliminate some of the cross-fertilization across people and projects that has made the code-hosting site GitHub so successful.

Sage rents space on Amazon Web Services, so it looks for AWS solutions, such as DynamoDB for its non-relational storage area, to fashion each element of Synapse’s solution. More detail about Synapse’s purpose and goals can be found in my report from last year’s Congress.

A follow-up to this posting will summarize and compare some ways that the field of genetics is sharing data, and how it is being used both within research and to measure the researchers’ own value.


Data sharing is not an unfamiliar practice in genetics. Plenty of cell lines and other data stores are publicly available from such places as the TCGA data set from the National Cancer Institute, Gene Expression Omnibus (GEO), and Array Expression (all of which can be accessed through Synapse). So to some extent the current revolution in sharing lies not in the data itself but in critical related areas.

First, many of the data sets are weakened by metadata problems. A Sage programmer told me that the famous TCGA set is enormous but poorly curated. For instance, different data sets in TCGA may refer to the same drug by different names, generic versus brand name. Provenance—a clear description of how the data was collected and prepared for use—is also weak in TCGA.

In contrast, GEO records tend to contain good provenance information (see an example), but only as free-form text, which presents the same barriers to searching and aggregation as free-form text in medical records. Synapse is developing a structured format for presenting provenance based on the W3C’s PROV standard. One researcher told me this was the most promising contribution of Synapse toward the shared used of genetic information.

Data can also be inaccessible to researchers because it reflects the diversity of patient experiences. One organizer of Army of Women, an organization that collects information from breast cancer patients, say it’s one of the largest available data repositories for this disease, but is rarely used because researchers cannot organize it.

Fragmentation in the field of genetics extends to nearly everything that characterizes data. One researcher told me about his difficulties combining the results of two studies, each comparing responses of the same genetic markers to the same medications, because the doses they compared were different.

The very size of data is a barrier. One speaker surveyed all the genotypic information that we know plays a role in creating disease. This includes not only the patient’s genome—already many gigabytes of information—but other material in the cell and even the parasitic bacteria that occupy our bodies. All told, he estimated that a complete record of our bodies would require a yottabyte of data, far beyond the capacity of any organization to store.

Synapse tries to make data easier to reuse by encouraging researchers to upload the code they use to manipulate the data. Still, this code may be hard to understand and adapt to new research. Most researchers learn a single programming language such as R or MATLAB and want only code in that language, which in turn restricts the data sets they’re willing to use.

Sage has clearly made a strategic choice here to gather as much data and code as possible by minimizing the burden on the researcher when uploading these goods. That puts more burden on the user of the data and code to understand what’s on Synapse. A Sage programmer told me that many sites with expert genetics researchers lack programming knowledge. This has got to change.

Measure your words

Standardized data can transform research far beyond the lab, including the critical areas of publication and attribution. Current scientific papers bear large strings of authors—what did each author actually contribute? The last author is often a chief scientist who did none of the experimentation or writing on the paper, but organized and directed the team. There are also analysts with valuable skills that indirectly make the research successful.

Publishers are therefore creating forms for entering author information that specifies the role each author played, called multidimensional author descriptions. Data mining can produce measures of how many papers each author has worked on and the relative influence of each. Universities and companies can use these insights to hire good candidates to fill the particular skills they need.

One of the first steps to data sharing is simply to identify and label it, at the relevant granularity. For scientific data, one linchpin is the Digital Object Identifier (DOI), which uniquely identifies each data set. When creating a DOI, a researcher provides critical metainformation such as contact information and when the data was created. Other researchers can then retrieve this information and use it when determining whether to use the data set, as well as to cite the original researcher. Metrics can determine the "impact factor" of a data set, as the now do for journals.

Sage supports DOIs and is working on a version layer, so that if data changes, a researcher can gain access both to the original data set and the newer ones. Clearly, it’s important to get the original data set if one wants to reproduce an experiment’s result. Versioning allows a data set to keep up with advances, just as it does for source code.

Stephen Friend, founder of Sage, said in his opening remarks that the field needs to move from hypothesis-driven data analysis to data-driven data analysis. He highlighted funders as the key force who can drive this change, which affects the recruitment of patients, the collection and storage of data, and collaboration of teams around the globe. Meanwhile, Sage has intervened surgically to provide tools and bring together the people that can make this shift happen. Parts of Sage Congress were videotaped and posted online.

Author’s home page
Other articles in chronological order
Index to other articles