The Backbone of Innovation: High Quality Datasets

Feature
Article

Understanding the benefits and concerns associated with implementing datasets into clinical trial workflows.

Paul Mancinelli, PhD, is the Chief Technology Officer of WCG

Paul Mancinelli, PhD, is the Chief Technology Officer of WCG

In today’s rapidly evolving technological landscape, robust datasets have become indispensable in clinical trials. With innovations such as artificial intelligence (AI) and large language models (LLMs), the quality and reliability of datasets form the backbone of these advancements. High-quality datasets, (defined as meeting the approved measures of accuracy, completeness, consistency, timeliness, relevance, validity, granularity, documentation, etc.) drive efficiency, streamline the recruitment and retention process, and site identification. Datasets also foster groundbreaking innovations within clinical trials, like Google AI's development of an AI system to interpret chest x-ray scans for early signs of tuberculosis. They ensure comprehensive analyses, reduced study start-up timelines, increased efficiency, and support the development of targeted therapies. As this technology continues to develop, it is paramount to consider data privacy, informed consent, potential bias, and the quality of input information when utilizing datasets to create AI algorithms.

The role and impact of datasets

Data in and of itself can be interesting, but the true value of data lies in how you use it and the insights that come from it. If you take high-quality data and implement it into an AI model, that is when it can become useful, insightful, and impactful.

But not all data will do. The data being implemented into these models and AI algorithms needs to be vetted for accuracy in order to reduce bias and predict accurate outcomes. Having a high-quality data set is an important foundation for creating AI models and algorithms that can significantly improve all parts of clinical trials, from participant recruitment to outcomes.

Challenges in utilizing and incorporating datasets

There is a current lack of consistency across all the data that we bring into clinical trials. It can even depend on location. The U.S. and the E.U., for example, have a standardized listing of clinical trials that the rest of the world, like Asia and Africa, lacks. This distinct lack in data normalization across the world is a challenge that is also seen on a smaller scale at sites and within companies. This needs to be addressed for future data sharing, interoperability, and for comparing clinical trial outcomes across the globe. Companies on the forefront of innovation are now normalizing and standardizing business and technical definitions across clinical trial datasets and protocols, then building their own interoperable technology platforms. It's these types of efforts that will be needed across the industry in order to address this global challenge.

Another potential hurdle is that most of the data that is being brought into these AI models is free text. An example of this is finding patients by scanning electronic medical records (EMRs), a large majority of which are doctor’s notes. Having to manually pull this information is a slower process that also provides the opportunity for human error, which can later impact those data sets and the models that run on them. The lack of digitization in collecting these types of data is another place that needs to be addressed in order to maintain high-quality data sets.

There is also a lack of interoperability in how data is treated across the board. The various types of software in the market means it can be difficult to incorporate data sets in a consistent way, especially if they do not share a common business definition. This leads to a large majority of time being spent normalizing data to make it consistent so that every data field is defined in the same way. Subject matter experts and thought leaders currently do and should continue to come together to agree on the business terms of a data field to ensure interoperability across all trials. Once a common business definition for the field is secured, then it is possible to begin incorporating automation in the mastering and normalization of the data.

How AI is currently being implemented into clinical trials

We are currently able to aggregate and implement data into LLMs, machine learning and other AI models. These models are more and more frequently being built into everything that we do, even outside of the clinical trial space. In participant recruiting, data matching is an example of what can be simplified with the power of AI. We can look at medical records and use AI to compare inclusion and exclusion criteria in a protocol to see if a participant would be a good fit for the trial. Utilizing AI algorithms in this way can help expedite the recruitment process without sacrificing the quality and integrity of the study start-up process.

Where these models can work proactively is in the case of warnings around anomalous data entries. In a very simple example, if while collecting data on a participant in an analgesic trial, they enter the same level of pain every day for an extended period of time, these models can flag that as a data anomaly. Of course, more complex outlier data points can be flagged in a similar way. This warning then allows site staff to look more closely at this to see if there is anything of concern that needs to be addressed. Catching these instances early on and being able to address them enables more accurate representation in clinical endpoints and will keep the trial on track

It is also possible to utilize LLMs and AI to analyze study protocols and determine how optimal they are. We can see where certain criteria may be exclusionary and lead to an increased burden on the participants, which can impact recruitment and retention. These can be instances where the location of the site is less than optimal, require long commutes or the need for overnight accommodation. With the help of advanced data sets, we can use AI algorithms to find areas of unaddressed patient burden in study protocols and revise them before they can derail a study. Ultimately, we are able to optimize the set of participants and sites to ensure the highest probability of a successful execution of a clinical trial.

Various other manual processes can be automated with the help of AI algorithms including:

  • Patient identification by processing electronic health records (EHR) and lab data.
  • Comparing changes of new and old regulatory government documents.
  • Analyzing Medicare coverage prior to patient participation.
  • Translating site/participant-facing documents into different languages more accurately and efficiently.

Looking forward

As we look to the future of utilizing datasets in advanced algorithms, there are a few considerations to keep in mind. One is the responsible, safe, and ethical use of data and AI in clinical trials, which industry groups are in the process of defining. It’s possible for there to be bias in AI models so being aware of this means we can work to ethically build these models to ensure no biases are built in that would further skew the output data.

Another potential concern is privacy of data and ensuring that when utilizing these datasets and algorithms, trials are abiding by HIPAA and other regulations. Having informed consent from participants to utilize the data being procured in the trial is paramount to data privacy as well. Additionally, cybersecurity should be top of mind when handling these datasets. Whether through encryption or other means, ensuring the safety of participant’s data will be a key focus, now and as the technology continues to develop.

Furthermore, having transparency into the algorithms can be difficult but is necessary to instill trust in the accuracy of the output. Since these data sets are so vast and the information that these algorithms are using is so broad, it can be difficult to understand the inner workings of how certain LLMs arrive at an answer. Having that knowledge is another way to ensure the quality of the output and avoid unwanted bias tainting findings.

While there are concerns and considerations around using datasets, from ensuring the accuracy of the data to the safety in using it, these advancements will further improve the current processes in clinical trials. AI algorithms and LLMs are already being used throughout the industry and will only continue to be further integrated into all aspects of clinical trials. By having these well-defined data sets we are able to implement Gen AI, machine learning, and other analytical techniques to accelerate clinical trials. This includes identifying optimal sites and patients for trials more quickly that cut across geography and all therapeutic areas. Having an understanding early on of the benefits and concerns will mean more positive progress in clinical trials and further importance put upon the development and utilization of high-quality data sets.

Paul Mancinelli, PhD, is the Chief Technology Officer of WCG

Recent Videos
© 2025 MJH Life Sciences

All rights reserved.