Best practices in data re-identification and control when designing future trials.
When it comes to human research, there is an inherent tension between enabling public access to data and protecting those who volunteer for research. Public discussions about how to balance this tension have evolved over the last decade and have informed both the public rulemaking process in the US and data stewardship practices within the medical research industry.
One of the first national policies regarding this was the National Institutes of Health (NIH) Genomic Data Sharing Policy enacted in 2014, which established the expectation of broad and responsible sharing of genomic research data. During the public comment period and subsequent publication of the final policy, the issue of adequate informed consent for this kind of far-reaching research was raised. This resulted in NIH recommending that investigators seek the broadest consent possible when first obtaining consent from participants. However, NIH recognized that in some cases, limits will still be required, and it allows for use of controlled access databases as a way to mitigate concerns.1
The policy requires that participants provide consent for sharing their data, even after it is de-identified, in order for the data to be deposited in an accessible database. The policy also requires that institutions, usually by way of their institutional review boards (IRBs), confirm the data sharing is consistent with the informed consent of study participants, and that consideration was given to the risks to individual participants and their families as well as groups or populations associated with the data.
The most recent policy enacted in the US is the NIH publication of the Data Management and Sharing Policy, effective Jan. 25 of this year. NIH was given program-level funding for fiscal year 2023 of $49.123 billion.2 To maximize the value of these taxpayer dollars, NIH seeks to ensure that scientific data is broadly shared. Sharing data in an accessible way should accelerate medical discovery, enable validation of research results, and allow data to be reused in future studies.3 The policy requires a plan for data management and sharing that includes budgeting for the storage, maintenance, and sharing of data (including plans for protecting human participants’ data) at the time a research proposal is submitted. The policy does not include any specific new requirements regarding human participant data, but it notes that data management plans must both enable stewardship of scientific data and protect individual participants within existing laws and regulations.
In addition to concerns about adequate informed consent, issues were raised regarding the usability of data. The scale of current scientific data accumulation is truly staggering. According to the National Human Genome Research Institute, genomic research is predicted to generate two to 40 exabytes of data within the next 10 years.4 An exabyte is one billion gigabytes, enough to fill two million average desktop computers. How should this ever-increasing and complex data be stored, curated, and annotated to facilitate responsible sharing?
In 2016, the FAIR Guiding Principles for scientific data management and stewardship were published.5 These principles direct that data
should be:
Since the publication of the FAIR Guiding Principles, various stakeholders, including NIH and companies in the medical research industry, have considered FAIR to be an enabler of digital transformation. They have implemented the principles in crafting their data management strategies in order to realize the promise of real-world data (RWD) to generate real-world evidence (RWE) in healthcare, which requires linking of disparate data sets.6
Outside of NIH, the 2018 revised Common Rule public comment period similarly included extensive discussion around consent and the secondary use of data. The preamble to the final rule lays out the various ways in which advances in technology and the research enterprise led to the need for updated rules around how to protect human participants’ research data. While the original Common Rule focused on physical risks and harms to participants, research stakeholders recognize that due to advances in generating and sharing data, informational harms to participants must also be considered.7
One important question addressed is whether, given current data-mining technology, data can truly be de-identified. The final rule did not go so far as to require consent for future research with de-identified data. It did, however, include a change to the definition of human subject by including “using, studying, or analyzing individuals’ information or biospecimens or generating identifiable private information or identifiable biospecimens”in the definition of human subject research.
It also expanded the definition of “identifiable” to include “identifiable private information is private information for which the identity of the subject is or may readily be ascertained by the investigator or associated with the information.” The revised Common Rule also requires federal departments to review the concept of “readily identifiable” periodically in recognition of the fact that technological and scientific advances may change the nature of what is readily identifiable.
Specifically, regarding consent, the revised Common Rule requires that consent forms contain one of two statements:
Likewise, the 2022 notice of proposed rulemaking to harmonize FDA regulations with the revised Common Rule includes adding a basic element of informed consent that would require a description of how information or biospecimens may be used for future research or distributed for future research. However, it would not be limited to the two options available under the revised Common Rule.8
Technology for mining and linking patient data is advancing at a rapid pace. FAIR principles seek to enhance the ability of both machines and humans to use this data to further scientific and medical knowledge. These increasing opportunities to use new forms of data for research could reduce the need for some placebo-controlled studies that can be ethically challenging if scientifically ideal. At the same time, as data analysis techniques grow more sophisticated, identifying the risks and benefits to participants and explaining them in a comprehensible way becomes more challenging.
The policies and rules enacted since 2014 attempt to improve informed consent for use of data, but it continues to be a challenge to obtain consent for future research use from participants. It is often impossible to conceive of all future uses when consent is being sought. But participants are engaging in an altruistic act when they agree to enroll in a research study, and they deserve to know how their data will be used. Giving permission to link their research data to future non-research data sets enhances the value of their participation, but it may also increase the risk that their personal data is disclosed or that their data is used to drive conclusions they do not support.
When and how should meaningful consent for future use be sought? If consent is obtained upfront, information may be limited about what types of sharing and linking will be done in the future, and broad access will need to be obtained. Some potential participants could reject this without more specific information. If consent is not obtained upfront and decisions to use data in ways that require consent are made later, participants may be lost to follow-up, obtaining consent would be more difficult, and the value of the data may be lost.
To maximize the benefits of data from research without compromising the protection of those who volunteer for research, care must be taken to ensure appropriate consent language is used to explain how data can be shared and linked and how far-reaching that can be. Best practices should include a clear explanation of whether and how data can be re-identified and how future research using the data will be controlled. If there are limits to a participant’s ability to withdraw their data, this should also be clearly explained. Researchers should think expansively about future uses of data when designing studies to ensure appropriate consent is obtained and that their plans for data stewardship are robust.