Christine Task - Knexus Research Corporation

Students: Spring 2025, unless noted otherwise, sessions will be virtual on Zoom.

Data, Privacy---and the Interactions Between Them

Nov 09, 2022

Download:

MP4 Video Size: 276.2MB

Watch on YouTube

Abstract

Data deidentification aims to provide data owners with edible cake: to allow them to freely use, share, store and publicly release sensitive record data without risking the privacy of any of the individuals in the data set.   And, surprisingly, given some constraints, that's not impossible to do.    However, the behavior of a deidentification algorithm depends on the distribution of the data itself.

Privacy research often treats data as a black box---omitting formal data-dependent utility analysis, evaluating over simple homogeneous test data, and using simple aggregate performance metrics.   As a result, there's less work formally exploring detailed algorithm interactions with realistic data contexts.   This can result in tangible equity and bias harms when these technologies are deployed; this is true even of deidentification techniques such as cell-suppression which have been in widespread use for decades.   At worst, diverse subpopulations can be unintentionally erased from the deidentified data.

Successful engineering requires understanding both the properties of the machine and how it responds to its running environment. In this talk I'll provide a basic outline of distribution properties such as feature correlations, diverse subpopulations, deterministic edit constraints, and feature space qualities (cardinality, ordinality), that may impact algorithm behavior in real world contexts. I'll then use new (publicly available) tools from the National Institute of Standards and Technology to show unprecedentedly detailed performance analysis for a spectrum of recent and historic deidentification techniques on diverse community benchmark data.   We'll combine the two and consider a few basic rules that help explain the behavior of different techniques in terms of data distribution properties. But we're very far from explaining everything—I'll describe some potential next steps on the path to well-engineered data privacy technology that I hope future research will explore. A path I hope some CERIAS members might join us on later this year.

This talk will be accessible to anyone who's interested—no background in statistics, data, or recognition of any of the above jargon is required.

About the Speaker

Christine Task is a CERIAS alumna, who earned her PhD in Computer Science at Purdue University in 2015, and joined Knexus Research Corporation later that year. Since then she has led the first National Challenges in Differential Privacy for the National Institute of Standards and Technology, contributed to 2020 Census Differentially Private Disclosure Avoidance System, served as technical lead for non-DP Synthetic Data projects for the US Census Bureau's American Community Survey, American Housing Survey and American Business Survey, been co-lead on the United Nation's UNECE Synthetic Data Working Group, and led the development of the SDNist data deidentification benchmarking library. Back in 2012, as a doctoral student at Purdue, she gave a CERIAS seminar titled "Practical Beginner's Guide to Differential Privacy", whose success was very valuable to her career. Having begun a decade ago, she was thrilled to be invited back to present what amounts to an update on that work.

Ways to Watch

Watch Now!

Over 500 videos of our weekly seminar and symposia keynotes are available on our YouTube Channel. Also check out Spaf's YouTube Channel. Subscribe today!