Anonymizing Textual Data and Its Impact on Utility
Principal Investigator: Chris Clifton
Data Protection laws that exempt data that is not individually identifiable have led to an explosion in anonymization research. Unfortunately, how well current de-identification and anonymization techniques control risks to privacy and confidentiality is not well understood. Neither is the usefulness of anonymized data for real-world applications. The project addresses anonymization on three fronts:
- Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.
- Current anonymization research is based on unproven measures of identifiability. Through a re-identification challenge on synthetic data (but based on real healthcare data), the project is evaluating the efficacy of these measures. Interdisciplinary teams of students are given challenge problems - anonymized data with hypothetical healthcare data - and asked to make (hypothetical) inferences about health information of individuals. The results can be used to calibrate the effectiveness of different anonymization measures.
- The utility of anonymized data has been a concern among research: Does anonymized data provide credible research results? By partnering with healthcare studies at the Kinsey Institute and Purdue University School of Nursing, the project is comparing analyses on original data with analyses on anonymized data, and evaluating the impact of types of anonymization on research results. A related issue is determining the impact on data collection: Are individuals more candid in their responses if they know data will be anonymized? Outcomes are broadening the scope of research that can be performed on anonymized data, while ensuring that researchers know when access to individually identifiable data (with attendant restrictions and safeguards) is needed.
Through these tasks, the project is advancing our ability to utilize the wealth of data we now collect for the benefit of society, while ensuring individual privacy is protected.
Other PIs: Chyi-Kong (Karen) Chang (Purdue University) Racquel L. Hill (Indiana University) Wei Jiang (Missouri University of Science & Technology) Victor Raskin (Purdue University) Stephanie Sanders (Kinsey Institute) Luo Si (Purdue University)
Students: Balamurgan Anandan
Wei Jiang, Mummoorthy Murugesan, Chris Clifton, and Luo Si, “t-Plausibility: Semantic Preserving Text Sanitization”, 2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09), Vancouver, Canada, August 29-31, 2009.
Keywords: anonymization, Privacy