Presenting how data transformation is measured as part of the SAFE Data Standard, which data variables influence the rating and, how the appropriate level of data transformation is calculated.
The clinical trial transparency landscape has been evolving, with rising expectations of openness and disclosure by all trial sponsors. Transparency brings benefits to trial participants, clinical trial sponsors, regulators, the scientific community, and, ultimately, patients.1-4
Disclosing trial information can inform funders and researchers on what trials are needed, enable better decision making by those who use evidence from trials, support a more robust system of quality, and foster trust between the public, sponsors, and regulators. Moreover, reuses of trial data by sponsors can improve the speed and effectiveness of future R&D.5 For example, combining data from trials can enable meta-analysis and the evaluation of new hypotheses, support improvements to trial design, foster innovation using artificial intelligence and machine learning, and enable other analysis and data science (eg, supporting pharmacoepidemiology by identifying participant groups who are at increased risk of an adverse event associated with a drug).
Research on the views of trial participants supports efforts to facilitate the sharing of clinical trial data. Although neither trial participants nor the public were consulted for this paper, participants generally believe that reuse of their data will speed up the research process and lead to greater scientific benefit.2 They are also generally comfortable when trial data is reused and even expect that this will happen.
In some cases, disclosing anonymized trial information is required for market authorization, such as clinical study document publication under the European Medicines Agency (EMA) Policy 0070 and Health Canada’s Public Release of Clinical Information (PRCI); however, in most cases, sharing or reusing anonymized trial data is voluntary and remains an important consideration for the sponsor.6 Sponsors should consider the benefits and ethical considerations to data sharing, recognizing reputational benefits and risks. Sharing and reusing data can earn trust with stakeholders by bringing new life to trial data, furthering trial transparency and supporting health research while easing the burden on trial subjects.7 At the same time, ethically questionable uses of data can erode trust, even when privacy is protected with any identifying information of trial participants removed.
The SAFE Data Standard (a standard for Sharing Anonymous and Functionally Effective data) focuses on protecting privacy to share non-identifiable clinical trial data for ethical secondary uses and disclosures. Anonymization provides a key privacy enhancing tool to enable responsible data sharing, and trial sponsors should ensure such data are otherwise ethically used and consistent with broader organizational objectives.8
To share or reuse trial data while protecting the privacy of trial participants, trial sponsors first anonymize the data.4,9 While the primary focus of this SAFE Data Standard is on structured individual participant data given its analysis-friendly format, the term “data” and the use of the SAFE Data Standard can apply more broadly to other information collected or produced during a clinical trial, including clinical study documents. As an example, under the EMA’s Policy 0070 and Health Canada’s Public Release of Clinical Information,10,11 trial sponsors are required to anonymize clinical study documents so that they can be published on the EMA and Health Canada data access portals, respectively.
Statistical (or quantitative risk-based) anonymization measures the probability of re-identifying individuals through indirectly-identifying pieces of information—such as demographic information, medical history, and medical event dates—and then reduces this probability through the use of various data transformations, such as shifting dates, generalizing disease classifications or demographic values, or removing (suppressing) outlier values in the data.12
The anonymization process renders data non-identifiable, such that the probability of re-identifying trial participants in the data is rendered sufficiently low.13,14 Identifiability can be viewed along a spectrum.15,16 As the data are increasingly transformed, the identifiability of the data is gradually reduced until it reaches a level that is below the applicable anonymization threshold. At this point, the data are no longer identifiable. The appropriate threshold is determined based on data disclosure precedents, industry benchmarks, and/or regulatory guidance. For example, both the EMA and Health Canada recommend a threshold of 0.09,11,17 which is equivalent to having at least 11 similarly looking individuals in every group based on the well-understood and adopted concept of cell-size rules and k-anonymity.18
Many of the publicized re-identification attacks pertain to data that were minimally transformed or pseudonymized, with no other controls in place.19-21 These examples demonstrate potential vulnerabilities and, as with any scientific discipline, serve as evidence to inform and evolve the field.22 Statistical anonymization in consideration of all data variables and applicable technical and organizational controls is consistent with best practices and regulatory guidelines. Novartis commissioned a motivated intruder test to evaluate the strength of anonymization privacy protection for EMA Policy 0070, publishing results with zero high confidence matches despite high effort per record expended.23
The level of identifiability in the trial data is determined by the similarity of participants in the data compared to the population. But contextual factors also matter. The more data a researcher has available to link or combine with the trial data, and the less restricted the use and environment of the trial data, the more likely re-identification becomes.
This generalized concept of breaking identifiability down into sub-dimensions (eg, data and context) along a spectrum has been used in disclosure control to guide data access decision making for years.24-27 For example, the Five Safes28 framework has been widely adopted and can be used in a process of anonymization to balance multiple considerations in making data safely accessible.29,30 Extending the concept to clinical trial data disclosure, particularly given the global context in which clinical trials are conducted across privacy jurisdictions, can empower organizations to make decisions more efficiently and consistently. Despite a wide range of global stakeholders in the clinical trial data sharing landscape, the multiple contexts in which clinical trial sponsors share these data are relatively common and consistent across the industry, with shared data-sharing platforms and portals commonly used by many sponsors to enable researcher access to these data. These shared interests create opportunity for greater standardization. With the public benefits of making clinical trial data available for reuse,1-5 alongside inherent challenges with anonymizing these data types (for example, a typical study can have thousands of intricately correlated identifiable variables about each trial participant in the structured participant data), there is a practical need for a common framework or shared standard for sponsors to use in coordination with research platform hosts.4,31
As part of the anonymization process, one must consider the likelihood of an opportunity (including an attack) to identify data subjects.32 This involves assessing potential threats, or all the means reasonably likely to be used to identify data subjects, including deliberate re-identification attacks, inadvertent recognition of a close acquaintance by someone using the data, and data breaches (all referred to as “re-identification opportunities”).33 The re-identification opportunities over time should also be contemplated, with data retention, disclosure, and periodic assessments all being important considerations.
To support standardization, a process and framework for modelling data identifiability is needed to address a range of contextual re-identification opportunities. There are different ways in which identifiability can be modelled, and we opt to provide a conceptual representation of previously published and adopted statistical anonymization methodologies for measuring and managing re-identification risk. Industry consortia, such as PHUSE,34 TransCelerate35 and the Clinical Research Data Sharing Alliance (CRDSA),36 play a role in promoting standardization in the exchanges of clinical trial data and may advance this conceptual representation to help meet practical implementation needs.
The conceptual framework and basis for standardization we introduce for the SAFE Data Standard can be extended to other forms of data, such as the outputs from remote query systems or synthetic data, assuming that privacy metrics can be established and enforced under varying contexts. As an example, Stadler, Oprisanu and Troncoso recently evaluated the use of differential privacy and found that the principles for assessing synthetic data are similar to those followed when assessing transformation methods for anonymization. The authors thus demonstrate empirically that synthetic data does not provide a better trade-off between privacy and utility than transformation techniques to anonymize data.37 The practical goal in all cases is to identify the disclosure contexts shared frequently across clinical trial sponsors and align privacy metrics to the contextual risks associated with each, for consistency and greater standardization in how data are shared across these contexts.
The process we adopt for measuring and managing identifiability is described by the equation Data x Context = Identifiability, where Data is the probability given a re-identification opportunity, and Context is the probability of a re-identification opportunity. This conditional probability establishes the level of identifiability of a data set in a particular context. Moreover, we can define an inequality based on an identifiability threshold that a data set must not exceed to be deemed anonymized, using Data x Context ≤ Threshold.
As previously highlighted, this process we use to measure and manage identifiability as a basis for the SAFE Data Standard can be extended in practice to accommodate additional measures of privacy.Within this context, the tolerance and “threshold” concept can be interpreted more broadly as a similarity metric and further extended to other privacy metrics,38 such as those applicable to other techniques for disclosure control (eg, differential privacy for remote query systems or synthetic data).39 The process can also be augmented with inspiration from other industry implementations of data disclosure management,40,41 once there is an established baseline and framework within which standardization in the sharing and reuse of clinical trial data can be adopted globally. We prioritize pragmatism in the SAFE Data Standard given the inherently subjective nature of disclosure control and diminishing returns in practice from advanced usage of objective measures.42 In balancing pragmatism with the need for an objective basis for a rating system, we hope to further develop the proposed standard should it be effective in enabling data sharing consistency and efficiency.
The specific factors considered for the inequality on identifiability described above are illustrated in Figure 1.
As trial data are increasingly shared and reused, a variety of data sharing platforms have been implemented to facilitate the process. Some were implemented to enable open access to trial documents, such as the EMA and Health Canada portals.11,17 Some were implemented to enable data sharing by sponsors with independent researchers, such as the Vivli platform,43 Yale’s Open Data Access (YODA) Project,44 and the Clinical Study Data Request consortium, ClinicalStudyDataRequest.com (CSDR).45 Others were designed to foster collaboration among pharmaceutical sponsors, such as TransCelerate.
Because contextual factors—such as platform security and enforceable terms of use—influence the likelihood of re-identification, the degree to which data are transformed in the anonymization process should be commensurate with these controls. However, without a common standard to define the degree of transformation, sponsors and platforms may adopt inconsistent methods, potentially resulting in unnecessary erosion of data utility or weaker privacy protection than needed.
Inconsistencies in the availability of individual participant data for meta-analyses have been documented, with anonymization cited as one of the barriers.46,47 The SAFE Data Standard may ease the data sharing burden for sponsors and bring greater consistency to data shared with researchers for quality meta-analyses. It can also minimize erosion of data utility through the process of anonymization, an important concern for secondary research,48,49 ultimately promoting greater outcomes in the reuses of these data.
To promote standardization and efficiency in the sharing of data, this paper proposes a SAFE Data Standard rating corresponding to a certain level of data transformation that can be used to quickly align stakeholders and effectively protect privacy. Because the design of a data-sharing portal (eg, security controls) and terms of use remain relatively constant over time for a single data platform, and certain characteristics of clinical trial data are constant, the level of data transformation needed to protect privacy can be standardized along a common scale from 0 to 5, where 0 is the raw trial data (often referred to as “coded” data due to clinical trials being blinded) and 5 is data transformed to the full extent required for access under an open data license or similar terms of use (eg, publication on EMA50 or Health Canada transparency portals).
This concept is illustrated in Figure 2, with each rating having a defined context and degree of data transformation described further in the following sections. While data utility remains higher with statistical anonymization than with traditional methods such as redaction, the relative decrease in data utility reflects the degree to which data are transformed to compensate for an absence of other mitigations (such as security controls).
To maintain an adequate level of privacy, each level of data transformation on the 5-point scale also requires appropriate data protection measures, such as security and privacy controls and user contracts. The less the data are transformed, the greater the protections they will require. For each level on the scale, the standard specifies the appropriate measures for protecting the privacy of participants in the data. (Each of these levels is further defined in Figure 6 in the final section.)
If the data were made public without terms of use (eg, posted publicly on Google with no published terms of use), the data would have even less contextual protection than what is specified by a level 5 rating. The complete public release scenario is not addressed by the SAFE Data Standard, though it may warrant transformations greater than those recommended for level 5. If those accessing data do not agree to any terms of use, the data become even more susceptible to demonstration attacks. Demonstration attacks are typically launched by the media or academics striving to prove that re-identification is possible.51,52 Given that an equivalent level of transparency can be attained through approaches adopted by the EMA and Health Canada, publishing clinical trial data with no terms of use is not typically required or recommended.
The 5-point rating is valuable because it communicates to all viewers not only the level of transformation to the data set itself, but also the protection measures that one would expect to find on a platform with a given rating. Moreover, the standard specifies the protection measures that platforms must implement to accommodate data at a particular utility level while maintaining adequate privacy. The result is a simplified concept of a numeric rating that can quickly be used to classify a data sharing platform.
Each rating level prescribes an appropriate level of data transformation along two dimensions. The first dimension is the strict data tolerance, or equivalent minimum cluster size of similarly looking individuals across all trial participants. This is also known as a group size, which is related to the concept of equivalence classes in k-anonymity while accommodating different implementations of the same concept in complex data types, such as longitudinal clinical data.33 The second dimension is the average data tolerance, or equivalent average cluster-size value, across all trial participants.
Cluster size is determined by the number of individuals who share the same indirectly identifying information. Figure 3 provides an illustrative example. In Figure 3, the highlighted data subjects form a cluster size of three since they all share identical values for the indirect identifiers of gender and year of birth.
If the minimum cluster-size value in this data set is two (strict tolerance level of 0.5), then every subject in the data set must have the same indirect identifier values as at least one other subject. In contrast, if the average cluster-size value is five (average tolerance level of 0.2), then the individuals in the data set must on average have the exact same indirect identifier values as four other subjects in the data. If a data set does not meet the desired minimum and average tolerance levels, then the indirect identifiers in the data set must be further transformed.
Average tolerances are relevant for private data-sharing releases in which the target of an adversary attempting re-identification could be any data subject (for example, an acquaintance such as an ex-spouse). The reason a strict condition is still applied to private releases is to ensure that no individual in the data is unique in the defined population.33 The strict condition helps prevent “singling out” and is applied in private releases to indirect identifiers that may be used to single individuals out (eg, demographics). Recital 26 of the GDPR explicitly mentions singling out as means of identification, and that “all the means reasonably likely to be used” need to be considered. Singling out would therefore be one such method that would always seem reasonably likely (if not a prerequisite) for the purposes of identification.53-55 Eliminating the ability to single out individuals is therefore a minimum condition, and, depending on the context and risks, average tolerance can then be evaluated for larger cluster sizes.
Tolerances also need to reflect real-world risks, which means evaluating cluster sizes for the population of individuals that gave rise to the data itself. The population we are concerned with is the one that contributes to the ability of someone to identify an individual in the shared or released data set. This may include the trial population, the population of similar trials, and the population in the same geographic area.9 Cluster sizes to determine identifiability are therefore evaluated using statistical estimators for the defined population.
When data are being made public, the minimum cluster size is more applicable in the statistical modeling because demonstration attacks are a risk. In a demonstration attack, an individual’s motive is to simply demonstrate that re-identification is possible, so the most identifiable record in the data is at greatest risk. Accordingly, for public releases, assume that an attack will occur and ensure a large minimum cluster size to protect against all types of attacks.
The degree to which data need to be transformed is determined by both the appropriate threshold and the context of the intended data disclosure (eg, security controls). This relationship is illustrated by the inequality on identifiability, shown in Figure 1, which we can rewrite as Data ≤ Threshold / Context. In other words, the probability given a re-identification opportunity is less than or equal to the threshold divided by the probability of a re-identification opportunity.
The data transformation rating can then use the relationship expressed in this inequality to prescribe the statistical point at which anonymization is achieved for an applicable context along a spectrum of identifiability. The less controlled the environment, the farther to the right the anonymization line moves and the more the data must be transformed. The more controlled the environment, the farther the anonymization line moves to the left and the more granularity and utility can be preserved in the data. Determining the optimal balance is key to ensuring that the most useful data are shared while sustaining consistent, proven privacy protection.
The minimum and average cluster-size equivalents provide the necessary guidance for transforming the data to the degree necessary for the context and applicable threshold.
Dates can be invaluable for secondary analysis, though also individually identifying. For secondary analysis and research, the sequencing and spacing in time between events for trial participants matters more than specific calendar dates. Accordingly, in conjunction with addressing the SAFE Data Standard tolerances, sponsors should zero the dates in a study through a proven method such as PHUSE date shifting.56 Offsetting dates reduces the level of date precision for participants while retaining utility. This approach can be part of a data transformation strategy to achieve the data tolerance and is also incorporated by default for all SAFE Data Standard levels.
The data transformation rating defined by the cluster-size thresholds is designed to address the risks of a deliberate re-identification attempt and of a breach, both of which are mitigated by the controls in place and why the approach taken is proportionate to the level of control. However, in some cases, there is an additional risk to be managed: inadvertent or spontaneous recognition of an acquaintance in the data by the researcher or analyst working with the data.
At the level 5 rating for open data (under a license or terms of use), the transformations recommended would inherently address this risk. But for other data rating levels, additional protections may be warranted. If the data are going to be analysedby researchers who reside in the same region as a high concentration of trial participants, the sponsor may want to transform the data further to mitigate this risk. In most cases, due to the small populations for clinical trials and the infrequency of well-recognized celebrities enrolling in clinical trials, the likelihood of spontaneous recognition of an acquaintance is lower than the likelihood of a breach or deliberate attempt to re-identify. Accordingly, the recommended tolerances introduced will typically transform the data enough to protect against this threat.
While privacy protection and relative utility are consistent at a given level in the SAFE Data Standard, the usefulness of the data for a given analytic objective can still vary. The data thresholds prescribed for a given SAFE Data Standard level can be achieved in multiple ways. Take, for example, a scenario in which the demographic indirect identifiers in a study have been transformed more than those associated with medical events, vital statistics, and substance use. That study could achieve the same rating level if the demographic identifiers were transformed less and the other identifiers were transformed more. In either scenario, the privacy level and the overall utility would be the same, but the usefulness of the data for a specific objective could vary significantly. For certain analytics objectives, the usefulness of the data resulting from one set of transformations could be much greater than from the other. For example, if exact age is critically important to retain for the desired analysis, more transformation to medical histories or substance use may be acceptable.
Measures of utility will depend on the intended use of the data. Because such uses are not always known upfront when data are shared, it is not always easy to optimize the distribution of transformations over the entirety of a given study. To the extent possible, stakeholders expecting to use the anonymized data (eg, researchers, internal data scientists) should provide input on what is most important to them upfront. For more open, generalized data disclosures serving many different audiences (eg, publication by Health Canada or EMA), guiding principles can promote sound decision making on utility trade-offs. For example,each disease area or indication can have certain prognosis factors that are prioritized for retention in the anonymization process.
While qualitative aspects of data utility are important, the degree to which utility metrics can be quantified can further augment and standardize the anonymization process. Global utility metrics can play a role when combined with other measures. Tailored utility metrics based on qualitative measures of importance (eg, weights assigned to prognosis factors by indication) can be even more effective in practice, particularly when defined in consultation with stakeholders who may use the data. As a practical example, data-sharing platforms or even industry consortia may identify priorities by indication that can be translated to retention metrics measured across clinical trial sponsors contributing anonymized data.
Open communication and a commitment to producing useful data that are nonidentifiable go a long way to meeting the needs of everyone involved. Ideally, framing that conversation around the inherent trade-offs can promote stakeholder alignment and support collaboration to make the most useful data available.
A simplified rating system can standardize data utility and privacy for more effective, efficient, and trustworthy transparency. As presented, the data transformation rating specifies how much the data at each rating level have been transformed and gives an indication of how much utility the data have retained.
To ensure that a data transformation rating is appropriate for a given situation, the proposed assessment framework not only provides a rating scale for data transformations but also prescribes appropriate uses and contextual measures to accompany each data transformation level.
From our last inequality, Data ≤ Threshold / Context, one can see that the level of data transformation needed for a data set is proportionate to the desired threshold and its relation to the data release context, or Data ∝ Threshold / Context. We mean proportionate in the general sense of forming a relationship. Once the threshold is selected, the relationship between data and context could be linear, or the relationship could be monotonic.
While some trials may entail more sensitive personal information (eg, an HIV trial versus a rheumatoid arthritis trial), the potential privacy harm should be balanced with the potential non-disclosure harm (or looked at the other way, the harm should be weighed against the benefits). While the EMA and Health Canada consider that sponsors may adopt different thresholds with evidence-based rationale, the recommended 0.09 generally applies independently of trial sensitivity. Doing so prevents disclosure bias toward less sensitive diseases (ie, favoring more transparency and disclosure for research into some conditions over others) and ensures that privacy protection is the dominant influencer on the granularity of data shared (which will naturally result in less granular information shared on rare disease trials, as an example, but for reasons of privacy protection).Given this standard, several sponsors have adopted 0.09 as a guiding principle for their external disclosure practices. However, thresholds may be adjusted based on circumstances, such as those involving highly sensitive and stigmatizing data, while taking into account participant expectations.57
Given the established industry standard for external disclosures, we narrow our focus for determining the threshold based on the nature of data use, meaning benefits and reasonably expected individual approval related to such uses. In other words, for Data ∝ Threshold / Context, the level of data transformation needed is proportionate to the threshold which is informed by the nature of use, divided by the context which is influenced by controls and trust. Threshold and context are in turn dependent on the type of data use, the controls in place to prevent re-identification, and the level of trust in data recipients, or data transformation ∝ use/controls and trust.
Thus, there are three distinct variables in the assessment framework that must be considered to determine whether a data transformation is situationally appropriate.
To standardize and achieve a common, defensible rating, these contextual factors need to be evaluated consistently. The following sections provide an assessment framework for controls and recipient trust.
As introduced earlier, sponsors should consider the ethics and bioethics of data sharing. The SAFE Data Standard pertains to data reuses and disclosures in which there are potential public or societal benefits, whether through new drug discovery or R&D, trial transparency, advancements in clinical research, or otherwise. Data reuses and disclosures that do not pertain to human health are excluded from the envisioned application of the SAFE Data Standard. For the purposes of the rating scale, two generalized types of trial data reuse can be distinguished as follows, both of which have an expectation of broader benefits and that may differ to the degree of anticipated approval by trial participants:
If no controls are in place (for example, if data are being made available to the public for download), only the data need to be evaluated. However, for platforms or internal environments that do enforce privacy and data security, the following scale can be used to characterize the level of control. If the minimal level is not achieved (ie, if the basic controls are not in place), then a “zero control” context is assumed in determining the SAFE Data Standard rating.
If there are no enforceable terms of use established with data recipients, nothing needs assessing. However, if data access is restricted to known entities who agree to terms of use, the following scale can be used to characterize the level of recipient trust. If the enforceable criteria are not demonstrated, then no recipient trust is assumed in the SAFE Data Standard rating.
In summary, the proposed data transformation rating is from 0 to 5, where 0 is the raw data and the scale from 1 to 5 reflects varying degrees of data transformation proportional to the type of use and contextual controls in place to protect data from re-identification opportunities. Consistently evaluating the uses of anonymized data and the anonymization context can speed and standardize anonymization processes applied across sponsors and enforce a common baseline for privacy while maximizing data utility and analytic benefits. Figure 6 summarizes the data transformation ratings from 0 to 5.
Once data are transformed as part of the anonymization process, sponsors should retain reports detailing the anonymization approach taken and associated justifications for auditability.
To illustrate the concept of the SAFE Data Standard, we applied the data tolerances from level 1 to 5 to simulate the transformation impacts on indirectly identifying data from a clinical study. (See Figure 7 below.) For clarity, directly identifying information or unique identifiers are masked or removed (for instance, such subject IDs are replaced with pseudonyms and site IDs are removed) and non-identifying information, such as a blood glucose reading (which can change frequently), is preserved during the anonymization process.
While the individual variable-level transformations will depend on the study characteristics in practice, including how distinguishable the participants are in the defined population and what preferences for data utility are incorporated (eg, country may be generalized to continent), the general trend of greater transformation and lower utility as you progress from level 1 to 5 is consistent. Figure 7 summarizes the results for the simulation performed, providing an example (not a ruleset) of how the SAFE Data Standard can be applied in practice across a range of disclosure contexts.
The simulation summarized in Figure 7 was based on a randomized, double-blind diabetes study sponsored by Janssen, and Janssen has since made the anonymized data available for secondary research through The YODA Project.59
Stephen Bamford, Sarah Lyons, Luk Arbuckle, and Pierre Chetelat conceived the article and were involved in writing and revising it. Stephen had the initial idea for the standard and the article, and he provided critical feedback throughout. Sarah developed the framework for the standard, produced a first draft and sought feedback on the draft from others in the field. Luk reviewed multiple iterations and added important intellectual content. Pierre reviewed the draft, enhancing content and simplifying criteria from prior standards. All authors were responsible for the content and for approving the final submitted manuscript. Stephen Korte and Ahmad Hachem (non-author contributors) ran statistical simulations of the SAFE Data Standard.
Stephen Bamford, Head of Clinical Data Standards & Transparency, The Janssen Pharmaceutical Companies of Johnson & Johnson, Sarah Lyons, General Manager, Privacy Analytics, Luk Arbuckle, Chief Methodologist, Privacy Analytics, and Pierre Chetelat, Research Associate, Privacy Analytics
2 Commerce Drive
Cranbury, NJ 08512