Privacy in Text and Search

Page Content

Principal Investigator: Chris Clifton

Text and search have been shown to pose particular privacy challenges, for example the AOL query log anonymization failure.  We are developing techniques to allow the identification of relevant texts while controlling disclosure of information, both on the part of those searching for information, and those providing content.  This builds on previous success in text mining and privacy-preserving data mining to allow search and analysis of documents while respecting privacy and security constraints.  Recent advances include:

  • A new methodology to generate “cover queries” that effectively hide user intent from a search engine.
  • A technique for efficiently comparing two document corpuses to identify similar documents, without disclosing document contents.
  • A method for generalizing text to protect against re-identification through information not removed by traditional de-identification techniques.

Ongoing research includes application of this work in support of healthcare research.

Keywords: anonymization, privacy preserving data mining, security