Differentially Private Data Synthesis: Practical Algorithms and Statistical Foundations

Research Areas: Assured Identity and Privacy

Principal Investigator: Ninghui Li

Funded by National Science Foundation: Collaborative Research: SaTC: CORE: Small: Differentially Private Data Synthesis: Practical Algorithms and Statistical Foundations. 07/05/2023 - 06/30/2026.

Data collected by organizations and agencies are a key resource in today?s information age and fuel a significant part of today's economy. However, the disclosure of those data poses serious threats to individual privacy. One important approach to using data while protecting privacy is differential private data synthesis (DPDS). That is, given as input a private dataset, one uses a differentially private algorithm to generate synthetic datasets that are ?similar? to the input dataset. While DPDS has received much attention in recent years, our understanding on this topic remains limited. This project takes a multi-disciplinary approach to advance our scientific understanding as well as improve practice techniques for DPDS. More specifically, this project?s novelties are as follows. First, it systematically explores the design space in marginal-based DPDS algorithms that have been proven to be effective in NIST competitions on DPDS, while also taking insights from data synthesis techniques developed in similar fields (often not satisfying DP). Second, it develops statistical theories that both are motivated by the empirical performances of DPDS algorithms, and guide the empirical research of these algorithms.

The project?s broader significance and importance are as follows. We are in the information economy. Data of all kinds, such as online interaction, medical sensor data, genomic data, and location data are being collected. Practical techniques that enable use of these data while protecting individual privacy are crucially needed and will greatly enhance the value of such data. Users will gain from increased control of their private information, and society as a whole will benefit from deriving maximal benefit from aggregated data. PIs plan to jointly develop and teach a graduate-level course on synthetic data based on the existing research in this area as well as research results from this project, and involve undergraduate students in research.

This project has two thrusts. The first thrust aims to develop new marginal-based DPDS algorithms that improve upon the state-of-art in empirical evaluations. The tasks include: perform an in-depth study of the ?marginal-to-dataset? problem (how to synthesize a dataset when given a set of marginals); develop and evaluate new approaches for handling numerical attributes; and develop adaptive and automated techniques for selecting marginals so that dataset synthesized with them captures as much useful information from the input dataset as possible. The second thrust complements the empirical research in the first thrust, and aims to develop statistical theory for high dimensional marginal-based data synthesis algorithms, and also a general learning theory framework to evaluate the utility of synthetic data in downstream tasks. The two thrusts are highly complementary and support each other. The experimental study in Thrust 1 will provide insights and directions for theoretical studies in Thrust 2, which will help explain the experimental findings as well as guide additional experimental studies.

Personnel

Students: Yuntao Du

Keywords: anonymization, Privacy, privacy preserving