Study seeks to understand how different forms of data meet the needs of researchers.
Secondary use of individual patient data (IPD) generated through a randomized clinical trial (RCT) represents some of the highest-quality research data available. Clinical trial data is collected with the patient’s consent from trials designed with clearly defined endpoints derived from objective assessments. Effective reuse of these data has the potential to transform the clinical research process, improve trial design and execution, and respects the patients who donate their time and their data as part of the clinical development process.1
The transformative potential of effective data reuse has resulted in calls from groups including the World Health Organization,2 the National Institutes of Health,3 the G7,4 regulators,5 and patient advocacy groups for sponsors to openly share and reuse clinical trial data. However, achieving widespread use of clinical trial IPD also requires data contributed by trial sponsors to retain as much research utility as possible. A trial sponsor’s data sharing policies determine what data and supporting documentation to share, where and how that data will be shared, and what data transformations are needed (e.g., to protect patient privacy or intellectual property).
The policies applied to secondary use data contribution have a direct impact on determinants of research utility. These determinants include data comprehensiveness and completeness;6 the supporting documentation and contextual information provided (e.g., transformation reports and data dictionaries); and the metadata to aid end-users in determining whether the contributed data will be fit for purpose for the intended research use.
Across trial sponsors, there is significant variability in data sharing policies and applied data protection methodologies.7 This variability creates substantial challenges for Data Sharing Platforms (DSP) and their end-users, especially when research use cases require pooling or integrating IPD from multiple sponsors. This, therefore, hinders efforts to make the shared data “FAIR” (Findable, Accessible, Interoperable, Reusable).8
Two dimensions influence how well the data sharing ecosystem serves the needs of the research community. The first of these dimensions is access. Access can be thought of as the number of trials available for secondary use from a given sponsor and on a given DSP in conjunction with the DSP policies that govern research access.9,10
The second dimension and focus of this paper is data utility, or how well contributed data meets the needs of researchers. While this paper focuses on the needs of the researcher, the results are also important for data contributors. Data contributing organizations are often consumers of data. This is particularly true for biopharma sponsors, many of whom have made significant investments in internal data sharing platforms bringing together a sponsor’s internal data and data from external data sources.11 Besides being a potential end-user beneficiary of a well-functioning data sharing ecosystem, data contributors may also benefit by better meeting the research community's needs and thus ensuring their investment in data sharing infrastructure delivers a maximum return to the community.
Objectives:
The research scope included surveying information and elements that may be available to end-user researchers accessing or using secondary IPD clinical trial data irrespective of therapeutic area and specific research objective. These areas included:
Other areas that may impact research use and utility were out of scope. Specific research use cases may impact the information and elements required by IPD end-users; however, exploring the nuances of specific use cases was out of scope. The authors did not examine the research impact of dataset or documentation content or completeness. The area of patient consent and/or legal basis for data contributions was not included because, while relevant to data contributors,12 it was deemed to be outside the general scope of end-user considerations. The Data Protection methodology applied to the processing of datasets and documentation is an important determinant of data utility13 but was not included in this research because the process steps are generally not evident to the contributed data end-user. Information about the result of the data protection process (e.g., a variable-level transformation report) is included in the research scope.
The survey design goal was to determine which elements of data contribution are important to IPD end-users. The survey was designed to accommodate a range of Data Sharing Platform (DSP) access models, ranging from controlled, gated, or organization-only (internal) access to open data sharing platforms. The target audience was DSP end-users who requested, downloaded, accessed, or used clinical trial datasets containing IPD for secondary research purposes.
The survey content was derived by collating the datasets, documentation, and collected metadata for secondary clinical trial contributions across multiple DSPs. Survey questions were divided into three areas:
The survey responses were returned anonymously. However, respondents had the option to provide contact information, and, in some cases, verbatim responses included information that could identify the respondent and/or their organization.
Survey distribution represented both academia and industry and encompassed the spectrum of DSP access models. The survey was distributed to relevant research communities through multiple DSPs, biopharma companies, and non-profits. Due to the breadth of distribution, which may include organizations or channels the authors were not aware of, a list of distributing organizations is not provided. In some cases, organizations were provided with an organization-specific survey version to allow for subsequent comparison of their respondent population to overall responses. All distributed survey versions were identical. Other distribution included CRDSA newsletters, social posts (LinkedIn), and conference mentions.
Prior to analyses, the data from all organizational surveys were merged into a single master dataset. All respondent identifiers were anonymized by removing or redacting identifying information, including within verbatim responses and email addresses (if provided).
All analyses were carried out using Google Data Studio.14 The data variables were grouped into nominal categories (categorical variables). All responses were summarized with descriptive statistics (number and percentage of responses to the different nominal categories), and no statistical hypothesis testing was carried out.
Demographic sub-groups with similar characteristics were consolidated into categories to enable the interpretation of responses. Organization types were consolidated into three categories for response analysis:
To facilitate comparison by organization type when assessing responses to the survey questions, the results for organization types with low response frequencies were not presented when the sub-group had consistently similar response characteristics to another sub-group and are only presented when different in response. The computed total frequencies, however, were inclusive of all sub-groups.
The secondary use data experience level was also grouped into three categories and relative frequencies were computed for the different organization types.
Survey responders who identified themselves as “Power Users” or “Very Experienced” were combined into a single group.
Respondents were asked which DSPs they use. Respondents could select or enter multiple DSPs. To allow for analysis of the responses to all data sharing platform engagements (single or multiple), responses were grouped into five ordered categories:
The frequency of engagement was then computed and compared across the different organization types.
Further, response analysis of data sharing platform usage (single or multiple) was done by first applying SAFE data platform categorization per Bamford, Arbuckle et al. “Sharing Anonymized and Functionally Effective (SAFE) Data Standard”15 and then computing the frequencies of responses for the different organization types. See Appendix A for the list of Data Sharing Platforms (inclusive of free text responses) included in the analysis (inclusive of free text responses).
The purpose of data request/usage such as peer reviewed journal publication on new knowledge or insights, informing clinical trial design, generating safety or adverse events reports and to supporting investigational new drug applications (IND) were categorized into three groups: Academic research/publications, internal organization use, and regulatory submission. To allow for further descriptive analysis, responses were categorized as:
The frequency of usage (number and percentage) were computed for each organization type.
The number of studies requested (expressed in percentages) and the relative usability frequency or yield, as well as the reason for non-use and their frequencies, were also calculated. The usage yield was then grouped or categorized by respondent experience level to allow for comparative analysis.
Dataset types such as ADaM, SDTM and data model description documents such as data dictionaries as well as study supporting documents (e.g study protocol) are key components of secondary use data requests. Supporting documents provide context to the supplied IPD datasets and aid the researcher in understanding trial design and execution factors that may impact their research use.
Respondents were asked to assess datasets and document types that may be supplied by data contributors and assign each to four ordered categories: Mandatory for Use, Important, Useful, and Not Relevant/Not Required, and the percentages were calculated for the responses for their importance for secondary use. For analysis purposes, Mandatory for Use and Important have been grouped into a single Mandatory/Important category.
The authors are aware that certain sponsors furnish a study-specific data transformation report documenting the transformations at a variable level that took place during the data contribution anonymization process. The authors understand that this type of data transformation report is not routinely or uniformly provided by data contributors. To evaluate the value to the research community, Respondents were asked to assess the importance of this type of report on a 1 to 5 scale from Not Important (1) to Mandatory for Use (5).
Respondents were also asked to evaluate the importance of having key metadata and study parameters prior to download or access. For analysis, responses to the type of metadata was grouped into three categories:
Their frequencies expressed in percentages were then computed for each type of metadata.
The timing of access to certain supporting documents and other relevant information can provide important context prior to data request, access, or download. Based on the author's experience, key documents or information that may assist in determining suitability for a study for a particular research question or need can include a redacted study protocol and information about key data redactions such as adverse events, demographics, and laboratory values.
For the purpose of this research, survey questions on the timing of access to these supporting documents included redacted study protocol, adverse events, demographics and laboratory values.
Response analysis of the timing of redacted study protocol access and relative importance was done by calculating the response frequencies for the different timings per the following categories: Critical, Important, Useful, and Not Needed.
Response Analysis of whether data redaction information was needed and, if so when data redaction information should be made available to the researcher was calculated for each redaction type.
The analysis results were summarized in tabular formats. All survey data, redacted for respondent privacy as noted above, is available in the supplemental tables referenced in Exhibit A.
A total of 104 respondents completed the survey. Respondents were grouped by organization type and are presented in Table 1.
The number of Service or Technology Vendor responses was too low to draw conclusions reliably representative of the category. In almost all cases, the Service or Technology Vendor responses were not meaningfully different from those provided by biopharma respondents. Therefore, the presentation of comparison results focuses on Academic/Non-Profit and biopharma responses, noting where Service/Technology responses differ from the biopharma responses. “All” is inclusive of total responses from the three categories, including Service or Technology Vendor responses.
Biopharma experience levels were almost evenly distributed (Figure 1). Biopharma respondents were more likely to be in the Very Experienced/Power User Category (32.7%, n = 16 of 49) when compared to Academic / Non-Profit (23.3%, 10 of 43). Conversely, Academic respondents were more likely to be Moderately Experienced (48.8%, 21 of 43), with a lower percentage of One-Time Users (27.9%, 12 of 43) when compared to biopharma (34.7%, 17 of 49).
95.2% of overall respondents (n = 99 of 104) engaged with data sharing platforms one or more times per year (Table 2). There were no meaningful frequency differences between organization types.
On platform usage, Respondents noted 10 unique DSPs, with an average of just under two platforms (1.93) per respondent. The responses were categorized by applying the SAFE data categories and are presented in Table 3.
The Internal Secure Research Reuses category (1) includes data warehouses/data marts internal to organizations. External data sharing platforms are categorized as Highly Secure and Controlled, Moderately Controlled, Minimally Controlled, and Open Data License platforms. Compared to Academic / Non-Profit respondents, biopharma respondents are more likely to use internal data platforms (20% versus 6%, respectively). Service and Technology vendor responses (included in “All”) are not summarized separately in the table due to the limited number of respondents; however, those responses did diverge from both biopharma and Academic / Non-Profit, with 33% (4 of 12) of mentioned platforms in the Open Data License category (5).
The results of respondents purpose (secondary use) of data request per organization type is shown in Figure 2 below:
Service and Technology Vendor responses showed characteristics of both other categories, with a 92% response rate for Academic Research use (approaching Academic / Non-Profit response frequency) and a 42% Regulatory Submission response rate (similar to biopharma response frequency).
Results of respondents usage yield per experience level is presented in table 4 below.
As the respondent experience level decreases, the usable study yield decreases, with 100% yield decreasing by 29.3% for One-Time Users compared to the most experienced respondent group.
Results of responses were consistent across organization types and respondent experience levels and are considered as a pool group in Tables 5 and 6.
The results of responses on the importance variable-Level data transformation report were also consistent across organization type, and respondent experience levels and are considered as a pool group in Figure 3.
Results of the importance of key metadata required by researchers prior to access to study data are presented in Table 7 below:
Timely provision of supporting information and documentation can provide researchers with information to determine study suitability prior to data request, access, or download.
Results of when certain study supporting information and documentation as well as study data redaction information are required by researchers are shown in Table 8 and 9 respectively.
68.3% of all respondents evaluated redacted protocol access Prior to Data Request as Critical or Important, with 72.1% placing Critical or Important value on having access to the redacted protocol prior to study access or download.
The results of data redaction transparency timing for adverse events redactions, demographics redactions, and laboratory values redactions are needed are presented in Table 9.
A high proportion of respondents overall identify a need for redaction transparency. 89.4% for Adverse Events and 93.3% each for Demographics and Laboratory Values, with almost half (46%, 48%, and 49%, respectively) responding that the information is needed prior to accessing or downloading the study. There were no major respondent demographic differences in requiring the availability of the different data redactions at study access/download.
The information from the survey results provides a snapshot of the current landscape of secondary use of clinical trial data through internal and external data sharing amongst the different organization types and use cases represented by the respondents.
Of the 104 respondents, the survey had a balanced sample of responses between biopharma companies and academic institutions/non-profit organizations (47% vs. 41%). Respondents were generally experienced users of one or more DSPs, with just over 68% indicating a moderate to power user experience level. This was supported by the frequency of DSP platform engagement, with over 95% of respondents engaging with DSPs one or more times a year. This level of experience and frequency of engagement may indicate that experienced users act as conduits to broader research teams within their organization. It's also notable that responses regarding the IPD data elements (datasets, documentation, and metadata) were consistent across organization types and experience levels. This shows that the survey responses are informed by direct experience and represent the needs of broader teams using secondary clinical trial data.
Across all organization types, almost 4 in 5 respondents indicated usage of DSPs categorized as Internal Secure Research Reuse and Highly Secure and Controlled. From a data contributor perspective, these two categories provide the most control over downstream access and use and are, therefore, generally considered the “safest” data sharing environments. Given the usage prevalence, it is reasonable to assume that respondents generally calibrated expectations to what would, or should, be available on these more restrictive platforms. Based on the author’s direct experience, it is also worth noting that some internal platforms may include both organization-specific secondary use data and datasets sourced from external DSPs.
The research objectives included understanding key factors of data utility and determining relative value to researchers of various types of IPD datasets, supporting documents, and metadata. It’s notable that responses regarding the importance of the various elements, as well as timing for researcher access, were consistent across all user types. While it is clear that certain documents or metadata are more universally useful (e.g. the Study Protocol), it’s important to remember that lower-ranking elements may be equally important for a specific research question or use. Also noteworthy is that when the Useful responses are included, the percentages rise to over 90% for almost all supporting documents and metadata elements.
An important piece of information that sometimes is underappreciated is an anonymization report describing the variable-level data transformations that took place. The survey results support the potential value of this report, with under 79% of respondents rating it’s value as 4 or 5, where 5 was Mandatory for Use. The survey results further support the importance of redaction transparency to researchers, with 89.4 to 93.3% of respondents indicating a need for Adverse Event, Demographics, and Laboratory Values redaction transparency.
While the survey responses are consistent, a 2022 CRDSA review of publicly available bipPharma data sharing policies13 found significant variability in the datasets and documentation sponsors make available to researchers. This suggests that community-wide agreement on provided supporting documentation and transformation information would be highly valuable for evaluating the potential utility of shared data.
The minority of respondents not seeing value in particular documents or supporting information may be an artifact of their specific research use but more likely highlights a knowledge gap that would benefit from community education. This highlights a need to educate the research community, particularly less experienced secondary data users, on the value and use of the various supporting documents and metadata elements.
When embarking on this research, the authors hypothesized that the timing of information availability is a critical consideration. The survey results support this hypothesis, indicating the importance of not only the provision of certain information, including the redacted study protocol and key redaction information but the criticality of access timing. One of the biggest challenges faced by researchers is ensuring that the data requested/accessed is fit for purpose for their specific research question or scientific problem. Over 97% of respondents indicated the importance of access to the study protocol prior to data request or data access/download. This suggests that providing key supporting documents and metadata earlier in the data access process can increase the likelihood that requested data will be used for its intended purpose and may lead to a significant increase in meaningful data reuse.
As we reflect on the survey results, it’s clear that to best serve the research community, there is a need for the data sharing ecosystem to establish and promulgate standards for the provision and timing of supporting documents and information provided as part of secondary data study contributions. Common standards for sharing data intended for secondary data usecan create process efficiency and information transparency that would benefit the research community and, equally important, benefit data contributors by ensuring their investment in data preparation time and resources will maximize research outcomes. Greater standardization in the information shared and made available through DSPs could accelerate the reuse of trial data, improving clinical trial development and enabling new innovative medicines to reach patients faster.
Ernest Odame, director, global evidence and outcomes, oncology, Takeda Pharmaceuticals USA, Inc; Tracy Burgess, director, PD data sciences, Genentech, a member of the Roche Group; Luk Arbuckle, chief methodologist, Privacy Analytics, an IQVIA company; Andrei Belcin, senior data analyst, Privacy Analytics, an IQVIA company; Gwenyth Jones, data science and statistics; University of Michigan; Peter Mesenbrink, executive director, biostatistics, Novartis; Ramona Walls, executive director of data science, Data Collaboration Center, Critical Path Institute; and Aaron Mann, CEO, Clinical Research Data Sharing Alliance