by Andrew Oram
This was originally published on O’Reilly Media’s Strata blog, May 7, 2014.
If your data consists of one million samples, but only 100 have the characteristics you’re looking for, and if each of the million samples contains 250,000 attributes, each of which is built of thousands of basic elements, you have a big data problem. This is kind of challenge faced by the 2,700 Bio-IT World attendees, who discover genetic interactions and create drugs for the rest of us.
Often they are looking for rare (orphan) diseases, or for cohorts who share a rare combination of genetic factors that require a unique treatment. The data sets get huge, particularly when the researchers start studying proteomics (the proteins active in the patients’ bodies).
So last week I took the subway downtown and crossed the two wind- and rain-whipped bridges that the city of Boston built to connect to the World Trade Center. I mingled for a day with attendees and exhibitors to find what data-related challenges they’re facing and what the latest solutions are. Here are some of the major themes I turned up.
Clinical researchers have always worked hard to get their results used for treatment (why else would they do the research?), spawning a whole field known as "translational medicine" that turns clinical insights into action. But translation goes in the other direction too, and doctors’ information about patients has become more important to researchers than ever.
A talk by Steven Labkoff, MD, of AstraZeneca revealed what clinical data means to pharmaceutical companies. (AstraZeneca won one of the Best Practices Awards at Bio-IT World.) To be covered by insurance nowadays, drug companies have to prove not only that their drugs are safe and effective, but comparative effectiveness as well (showing that their drugs are better than competitors). This requirement has been imposed, of course, because doctors have prescribed so many new, expensive drugs, only to discover later that an older drug could do just as well at one-tenth the price. Without proof of comparative effectiveness, the drug may never appear on a formulary and doctors won’t even think of prescribing it.
But drug companies can’t use conventional clinical trials to prove comparative effectiveness—there are too many variables to test and it would take too long. Instead, they need to collect big data from clinicians.
With AstraZeneca’s help, Pfizer identified 42 uses for EMR data in pharma. Other examples included finding patients suitable for drug trials, recruiting patients, and targeting a medicine to the right patient (patient segmentation) using biomarkers collected by EMRs.
Dr. Labkoff included a lament that the US, unlike many other countries, has ruled out the use of universal patient identifiers. Although complex matching techniques can fill most of the gap, it still contributes to what Labkoff called a "data archipelago" of many unlinked repositories.
The stakes resting on data collection illustrate the financial issues driving data in the life sciences. Big money was being spent in the exhibition hall.
An estimated 80% of patient information in doctors’ records is unstructured. That figure may be skewed by including images (which are huge) but there’s no doubt that doctors prefer to enter information in their own way, that structured fields tend to be overly rigid, and that God often lies in the details. So it’s also no surprise that natural language processing, which promises to aggregate useful data from free-form text, showed up a lot at Bio-IT World.
Cambridge Semantics offers NLP processing, and uses linked data technologies such as OWL under the hood. It also accepts data from open source repositories such as Cassandra and Hadoop.
Another company making heavy use of NLP is Sinequa, which is used by AstraZeneca and many other companies. Sinequa claims to support 19 languages and to be able to index a wide range of data source, from relational databases to tweets. MongoDB, Sharepoint, Google Apps, etc. can all be mined through more than 140 off-the-shelf connectors to index input data. The Sinequa "logical data warehouse" is a combination of an index and columnar database, similar to other search engines.
Although all companies I found on the show floor were proprietary—offering either Software as a Service or conventionally licensed products—open source software pervaded the conference. Hadoop, the R language, and other free software came up routinely during talks.
For instance, the Department of Veterans Affairs runs patient notes through a system based on the Unstructured Information Management Applications project of Apache Foundation, a natural language processing engine that IBM donated and that is also the foundation of IBM’s famous Watson. It helps the VA determine which patients were smokers, which patients had advanced stages of cancer, which patients had genome information in their records, etc.
Free software, along with collaboration around data, were central to the keynote by Stephen Friend (whose Sage Congresses I covered in 2011, 2012 and 2013). The meme he pushes is knowledge as a currency—a good that researchers can trade to the benefit of all.
Echoing the background behind the Sage Congresses, Friend laid out the growth in complexity of the phenomena geneticists are studying. We are studying wellness in addition to disease. And we are trying to move from describing disease through their symptoms to representing them through their underlying causes, which can split a single visible condition into multiple diseases. For instance, it’s worth noting here that the Editors’ Choice Award at the conference went to the Icahn School of Medicine at Mount Sinai in New York City, which found that Type 2 diabetes consists of three different diseases.
Further adding to the complexity of research, we know now that proteins have different functions in different contexts, and that removing one protein from a pathway changes the functions of other elements in the pathway.
For many reasons, researchers see many different sides of the data element and report different findings on the same molecules or genes. Friend would like to see such "adjacent truths" as a boon to research, not a barrier. He warned that researchers feel pressured to draw conclusions from their individual results to fast in order to get published. These conclusions then get picked up by the press and reported as the gospel truth.
The two major current Sage projects, BRIDGE and Synapse, facilitate a shift from the traditional publishing cycle to sharing over open networks such as GitHub (which hosts both projects). BRIDGE also brings patients into the promotion and control of clinical research. Some of the challenges Friend pointed to are:
Another powerful tribute to the power of data sharing came from Helen M. Berman, who has spent more than 40 years ensuring the availability of open data through the Protein Data Bank and who is this year’s Benjamin Franklin Award winner.
She described her journey as beginning with a 1971 petition to make protein data open. A committee was created in 1989 to put more force behind this mandate, publishing guidelines about how to include data seven years later. Journals started to refuse to publish articles until data was contributed to the PDB. As I have reported in my Sage Congress articles, this allows other researchers to validate experiments, as well as giving them raw material for further research.
In 1993 a committee was started in order to create data standards for data structure and experimental data. There were a lot of complaints that standards would inhibit creativity, but it eventually became clear that standards were necessary to make data moveable.
Two more advances in 2008 were encouraging researchers to publish the data about their experiments, and creating standards for validation. At the database, data is checked and results given back to depositor, who may have to repeat the experiment.
The TranSMART Foundation has also been working on open data sets and open source software for years, and won an award at Bio-IT World this year for its work on the European Union’s U-BIOPRED project. This project encourages multiple companies to work together on cures for asthma and other respiratory illnesses. The EU contributed two billion Euros to the project, and pharma companies threw in another two billion Euros. Keith Elliston, CEO of the TranSMART Foundation, told me U-BIOPRED is the first of multiple open projects that the EU will promote through an Innovation Medicine Initiative.
The TranSMART Foundation is one year old, and is dedicated to data sharing in the life sciences by producing free software meant to be run on multiple sites. Numerous forks of the software have produced useful enhancements, and version 1.2 of the platform will bring together 12 branches. The system can share deidentified clinical data, but users can also set up private instances to read in public data and meld it with their patient data.
Among the companies and projects that build on tranSMART, Elliston mentioned the European Translational Information and Knowledge Management Services (eTriks), Orion Bionetworks and the Accelerated Cure Project for Multiple Sclerosis, who use it to collaborate, Harvard Medical School, Takeda Cambridge, and the Laboratory of Neuro Imaging.
Data cleaning was a theme I discussed with BioDatomics, who use an open core business model. Running data from an experiment through their system turns up a quick visualization of the range of samples, showing (among other things) how many outliers there are. Co-Founder Maxim Mikheev said this visualization makes it much simpler for a researcher to build analytical pipelines, and subsequently to identify and remove bad data from the set or perform other operations on the data. An outsider like the PDB can also check the data using this tool, and reject poor data sets.
BioDatomics runs on Hadoop, which president Alan Taffel told me is ideal for genomic data, represented by extremely large files that can be split up and submitted to clusters for parallel processing. BioDatomics includes 150 analysis tools, of which 50 are very commonly used, that let users analyze sequenced genomes through a point-and-click interface.
Provenance—knowing who created data, when it was collected, and other such metadata—also came up as a conference theme. Friend mentioned its heightened importance as researchers share data. I talked to staff at Lab7 Systems, whose Enterprise Sequencing Platform (ESP) software collects a huge amount of data about experiments and attaches provenance information that the tool culls automatically when staff enter the lab information. You can figure out, for instance, which bad lab kit is responsible for failed results, and can survey results over time. This level of data provenance in sequence data management and processing is critical as the technology progresses toward routine clinical use.
We’ve seen how big data gets in life science research. So easy access to both vast storage resources and cheap computing clusters facilitates the current growth in data-crunching by clinical researchers. John Quackenbush of Dana-Farber Cancer Institute and Harvard School of Public Health, in his opening keynote about precision medicine, pointed out that cloud computing makes processing affordable for this research. He apparently sees no bound on the size of the data to be shared: "The next large cohort study should be everybody."
There were too many cloud companies at the show to list them here, but you can imagine that all the likely players were there, and can check the complete list of exhibitors if you like.
Narges Bani Asadi, Ph D, founder of Bina Technologies, told me that the explosion in life science research was driven by two recent advances: cloud computing and the drastic reduction in cost of DNA sequencing. In particular, complete genome sequencing comes within the means of individual clinicians, and they are using it extensively in cancer treatment and pediatrics.
Bina’s service is aimed at these doctor and hospital settings, taking a researcher from a representation of molecular data to clinically useful results. It starts with gene mapping and variant detection, and provides more than 130 ways to annotate and interpret the variations. Shared data is not currently part of the service: for both privacy reasons and institutional preferences, Data stays with the clinician.
Liaison Healthcare Informatics is a one-stop repository, which collects data from multiple sources and allows a researcher to run her own data through analytics that draw on these sources. The researcher is therefore using Liaison’s software as a service instead of having to install and learn to use a variety of stand-alone tools. In addition to collecting the data, Liaison validates and normalizes it, taking an additional burden off of the researcher.
I talked to Dr. Pek Lum, Chief Data Scientist and VP of Solutions at Ayasdi, which participated in the research I mentioned earlier that won a prize with the Icahn School of Medicine. She described Ayasdi’s strength as the application of TDA ( topological data analysis) as a framework for machine learning algorithms. You can think of it as a relatively new form of data clustering or grouping.
Finding a line that best represents a set of scattered points (through least squares, for instance) is a fairly familiar way to make sense out of data; topological data analysis clusters data and finds patterns using much more sophisticated techniques and produces a graph with nodes and edges to show relationships (see section 1, Preliminary Mathematical Tools, of this paper).
Dr. Lum told me that this analysis allows the company to greatly reduce the time needed to find correlations in biological data—sometimes by two orders of magnitude over traditional analysis. The approach allows unsupervised analysis, meaning that you don’t have to formulate hypotheses and test them, but let unexpected relationships emerge from the analysis. Visualizations can also be based on their analysis, because the graphs it produces are amenable to visual display.
WOne of the problems making clinical research harder is the need to compress trials, because the big blockbuster drugs seem to have all been found and the companies have to mine less lucrative desposits of genetic data to produce drugs treating smaller populations. The identification of different patient strata (such as the Type 2 diabetes findings mentioned earlier) also forces companies to aim new drugs at fewer people.
Tim Carruthers, CEO of Neural ID, told me many companies are speeding trials by doing adaptive research, meaning that they change drug dosing or other parameters during the clinical trial. This added complexity increases the amount of data to be evaluated and may postpone deep analysis of information until the conclusion of the trial. For example, cardiac trace monitoring data may be captured, but its full impact must be found by sifting through an even larger pile of information. The problem is addressable through the application of machine learning.
Neural ID harmonizes the biosignal data that is generated in multiple formats during all phases of drug development: discovery, pre-clinical, and clinical. A researcher trains the system by showing it what to look for, then turns it loose to find those key elements. She can spot subtle, hard-to-find drug interactions such as QT interval changes (a measure of heart activity) across different phases. Once a relationship is found in the numbers, she then can drill back to the visualization to see what’s happening in the original data.
Neural ID works with industry standards where possible, and looks forward to handling even more massive data, such as the multi hour Holter traces gathered in clinical research and stored in HL7. Like Stephen Friend, Carruthers would like to see people upload data from fitness devices and run analytics on them the same way clinical researchers do with Neural ID.
A number of the companies at the show offered data repositories, thus filling a gap left by inadequate standards and organizational cultures to support direct data sharing by the researchers and data consumers. Of course, validation, cleaning, and text search are valuable services making repositories worthwhile. But the Signet Accel Accelematics platform takes a different approach, leaving data in the hands of its clients, rather than centralizing it in a cloud repository outside their control, while enabling secure collaborative analytics.
Although Bio-IT World revealed a wealth of tools for data sharing and data analysis, we must remember that the industry is slow to change cultures, even when threatened by such unescapable challenges as the shrinking market for each drug, the public’s disgust at drugs that turn out to be risky or ineffective, and the intricacies that compound developers must juggle as research brings us deeper into the pathways taken by disease.
It’s a bit disconcerting, therefore, to see that a lot of researchers still use spreadsheets on laptops to store and manipulate data. And a number of companies—such as Helium by Ceiba, which offers very powerful capabilities based on a large based, NLP, and quick retrieval of information matched by a compound—offer Excel interfaces so that they don’t wrench the researchers away from their familiar habits.
So the life sciences don’t have just a big data problem, as I mentioned at the start of the article—they have a big tool problem. When Hadoop becomes more popular than Excel for processing "omics" data, we’ll really have a sea change in clinical research.
Author’s home page
Other articles in chronological order
Index to other articles