Evaluation of Clinical and Genomic Information Privacy Risks From Inference Attacks

Research Areas: Assured Identity and Privacy

Principal Investigator: Xukai Zou

In this study, we will examine the quantitative relationships between clinical and genomic information disclosure and associated privacy risks due to inference attacks. For inference attacks, we refer to the inference of private personal identity and other personal information without the information owner’s explicit consent or knowledge. In translational medical studies, identifiable personal information is usually anonymized and protected using a set of high-level guidelines. However, there is no explicit guarantee that such anonymization is performed to the best interests of research participants, especially with the increasing demand for open access of biobanks by researchers worldwide and, in some cases, patients themselves who are allowed to gain access to their own research results. Nor does there exist a method that can help researchers and biobank stakeholders gauge the risks for inference attacks, if the anonymized clinical database is compromised due to security leaks. Our specific aims are:

Aim 1. Evaluate how clinical information, either disclosed through authorized or unauthorized access, may be used by inference attackers to reconstruct personal identity.
- We will take actual metadata disclosed in actual cancer clinical studies in which the PI is currently involved in and perform simulated inference attacks to determine clinical information security (additional IRB approval will be sought).
- We will then broaden the research scope to include a survey of the literature-reported metadata collected in other clinical studies, in order to assess whether findings in our simulated attack have general applicability or not.
Aim 2. Determine what sets of specific personal attributes and genomic variation loci may have a higher vulnerability for inference attacks if the security is compromised.
- We will rank the set of common data attributes disclosed in clinical studies, based on their risk scores that we shall determine, based on our simulation results.
- We will also rank common genomic variation disclosed in Personal Genome Project (PGP) and the dbGAP database at NIH, to identify single nucleotide polymorphism (SNP) loci that are most discriminative of individuals.

The study has a high potential impact on future clinical and genomic data sharing/protection, which can be summarized as the following:

The findings from this study can immediately benefit the data sharing design for hundreds of biobanking projects and clinical trial projects worldwide. These studies, which involve billions of dollars and millions of participants worldwide, will be able to re-examine possible information privacy vulnerability disclosed from this study and take actions to improve privacy protection.

For biobanking or clinical study projects that require sharing of clinical data for research or individual use, the knowledge to be gained from this study (e.g., different privacy risk scores associated with each sets of clinical data or genotyping data) will help project investigators to make informed decisions that balance benefits and risks during information sharing.

Personnel

Other PIs: Jake Chen

Students: Huian Li

Keywords: inference attacks, Privacy, security leaks