This article will explore the issues of a lack of data standards; focusing on an industry hackathon that analyzed data sets relating to rare diseases.
It is no secret that data is a life science companies’ greatest asset, but what few realize is that it can be its biggest downfall. The data ‘deluge’, left unmanaged, has the potential to overwhelm researchers. In 2016, Accenture said; “many life sciences companies do not yet have the digital infrastructure and-most important-the talent to use these data well.” The issue doesn’t lie with data generation, but in how researchers are able to make sense of, and connect the data, to improve patient outcomes or support the discovery of new medicines. The development of techniques such as ‘machine learning’, ‘deep learning’, and ‘text mining’ have offered a new approach to data handling. Recently, for example, researchers at the University of Sydney used a combination of machine learning and omics technology to accurately indicate insulin resistance, or pre-diabetes-a major predictor of metabolic syndrome. The reality, though, is progress through new technology is being hampered by a lack of data standards. This is an industry-wide problem; overcoming it will require life science companies to work with technology firms and academic institutions, to create data standards governing how data is stored and retrieved-so that it can support drug discovery. This article will explore these issues; focusing on an industry hackathon that analyzed data sets relating to rare diseases. The hackathon helped to further our understanding of the challenges data can present, and indicated ways to overcome these issues to the benefit of the whole industry. The data formatting dilemma Life science R&D has access to, and produces, a tremendous amount of data. Whether that be data generated in house or from external sources-such as scientific journals, Electronic Health Records, or from CROs, partners or licensed IP. The lack of data standards across these stakeholders significantly hampers research efforts and collaboration initiatives, within and between organizations. What’s more, the absence of a common data model costs both extra time and money. According to the Institute of Medicine, the use of data standards in the pharmaceutical industry would reduce the United States’ health care administration expenditure by 20%–30%. Traditionally, organizations have developed their own internal systems; resulting in a duplication of efforts, and in infrastructures that are not interoperable. In addition, with data siloed and stored in different formats, harmonizing data so it can be integrated into research projects, is a very difficult task. For the industry to minimize this hurdle, one of its main goals should be publishing an open and freely available format for the storage and exchange of drug discovery data. This will accelerate drug discovery research and overcome a shared barrier to collaboration. Case Study: A collaborative Hackathon exposes challenges In March 2017, Elsevier set a task at The President’s Challenge Hackathon, an event organized in London, UK, by The Pistoia Alliance-a not-for-profit group that aims to lower barriers to innovation in R&D. The objective of Elsevier’s challenge was to demonstrate the ability of deep learning to help the UK-based charity, Findacure. A charity which aims to promote collaboration between rare disease stakeholders to facilitate treatment development for all. Elsevier wanted the challenge to accelerate treatment and clinical research for Friedreich’s Ataxia, a rare progressive neurodegenerative disease affecting balance, coordination, and speech. When Find Me a Drug, the team of highly motivated students and recent graduate researchers came together to aid drug discovery through drug repurposing for an orphan disease, they knew the data was going to be a huge challenge. Though the participants had access to very large amounts of data, this data was trapped in disparate formats and would slow the investigative process down significantly. To begin with, the students were given a heterogeneous set of data in XML files related to the disease; this included biological pathway analysis, associated chemical compounds and bioactivities, potential candidates for drug repurposing, full-text scientific literature, and clinical trial data. Elsevier also wanted the hackathon to highlight the difficulty researchers’ face when using new data approaches without common data models. Daniel Rhodes, a PhD candidate in drug repurposing at Queen Mary, University of London said, “We spent most of our time the first day just trying to get our heads around the data, so we could start to find some solutions. Even opening the files was tricky.” The students used various tools to try to extract data from the provided XML files, but it was slow going. Daniel commented that, “we wound up having to do a lot of things manually, so we could at least read the files in plain text.” Eventually the students and graduate researchers were able to cut through a lot of the ‘noise’ created by the large and unformatted dataset. “We ended up with a relatively simplified network graph that was a visualization of just one of the datasets, something that was definitely manageable,” Rhodes said. By keeping the approach simple, they identified areas where researchers might want to apply machine learning in the future. The hackathon team gave its dataset visualization to Findacure, which in turn showed it to scientific advisors at the non-profit Ataxia UK. The visualization could be used to point to new research avenues or validate work currently underway. Over the weekend, Team Find Me a Drug helped move the discovery project forward and learned how to streamline similar projects in the future. The hackathon teams presented their findings to a panel of judges from The Pistoia Alliance. As a runner up, the Elsevier/Findacure team received £500 GBP, some of which was donated to Findacure. For Findacure CEO Dr Rick Thompson the most important benefit from the event was that it, “instantly raised the level of awareness for the concept of repurposing in rare diseases.” Highlighting how the team explored and began to work with the ataxia data, he said, “it helped everyone see how data-mining can very quickly begin to make a difference to patient groups for diseases that are otherwise not getting any attention or study.” What the hackathon means for Future R&D Events like this hackathon demonstrate how future R&D will benefit from new techniques such as deep learning. Dr Thompson said, “Knowing how much this team did in essentially 24 hours, starting from having no clue about ataxia beforehand, is something that can hook researchers, and that patient groups need to know about as well, because if this can be done, then anything is possible.” The hackathon was evidence that data processes such as machine learning and artificial intelligence computational capabilities and visualizations, will be crucial in future research projects. Elsevier’s education and development teams were also able to observe the frameworks the students used for approaching the data problem. This meant they were able to see first-hand the most common challenge for any end-user researcher, of culling data for directional insight in developing a drug candidate. These observations can be fed back into future product and service development, ensuring that researchers’ needs are fully met. What’s more, real world initiatives such as the hackathon also show the value of collaborating with parties that researchers and students wouldn’t normally have the chance to work with. By getting the right people in the room at the right time-those involved were able to work and learn from a multi-disciplinary team and pool their knowledge. It also gave students real-life experience that will prove invaluable when they enter the working world. The volume of data will continue to grow Data is the lifeblood of life science research today, and will continue to grow in volume as technological advancements continue. It is estimated that by 2020, knowledge will be doubling every 73 days. The industry, however, will continue to face significant roadblocks if it doesn’t overcome the problems caused by a lack of data standards. Organizations must look to events such as the hackathon as a way of encouraging both collaboration and innovation; while preparing the next generation of scientists for the working world. Without such real-world initiatives, it will become harder for the industry to capture key skills and evidence needed to deliver new therapies for all rare and endemic diseases. It is an exciting time to be part of R&D, so the industry must make the most out of the opportunity it has. With the right approach to the advanced technology that we have access to-our potential to greatly improve successful patient outcomes is significant. Tim Hoctor, Vice President, Life Science Solutions Services at Elsevier