The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Chris Clifton - Department of Computer Science, Purdue University

Students: Spring 2023, unless noted otherwise, sessions will be virtual on Zoom.

Data mining technology

Aug 28, 2002


Data mining technology has emerged as a means of identifying patterns and trends from large quantities of data. One of the key requirements of a data mining project is access to the relevant data. Privacy and Security concerns can constrain such access, threatening to derail data mining projects. This tutorial discusses constraints imposed by privacy and security, and presents technical solutions that enable data mining projects to proceed without violating these constraints.

Privacy and security have been a concern since early in the age of automated data processing, and these issues were raised early in the data mining community [CM96]. There has recently been a surge in solutions to these issues [AS00, LP00, AA01, KC02, VC02]. The combination of growing use of data mining (leading to more conflict with privacy and security concerns), along with the growing number of technical solutions, makes the time ripe for a tutorial on the subject.

There are many data mining situations where these privacy and security issues arise. A few examples are:

  • Identifying public health problem outbreaks (e.g., epidemics, biological warfare instances). There are many data collectors (insurance companies, HMOs, public health agencies). Individual privacy concerns limit the willingness of the data custodians to share data, even with government agencies such as the U.S. Centers for Disease Control. Can we accomplish the desired results while still preserving privacy of individual entities?

  • Collaborative corporations or entities. Ford and Firestone shared a problem with a jointly produced product: Ford Explorers with Firestone tires. Ford and Firestone may have been able to use association rule techniques to detect problems earlier. This would have required extensive data sharing. Factors such as trade secrets and agreements with other manufacturers stand in the way of the necessary sharing. Could we obtain the same results, while still preserving the secrecy of each side's data? Government entities face similar problems, such as limitations on sharing between law enforcement, intelligence agencies, and tax collection.

  • Multi-national corporations. An individual country's legal system may prevent sharing of customer data between a subsidiary and its parent.

These examples each define a different problem, or set of problems. The problems can be characterized by the following three parameters:

What is the desired data mining result? Do we want to cluster the data, as in the disease outbreak example? Are we looking for association rules identifying relationships among the attributes? There are several such data mining tasks, and each poses a new set of challenges.

Control of Data
Who controls the data? Is each entity found only at a single site (as with medical insurance records)? Or do different sites contain different types of data (Ford on vehicles, Firestone on tires)?

What are the privacy requirements? If the concern is solely that values associated with an individual entity not be released (e.g., "personally identifiable information") we can develop techniques that provably protect such information. In other cases, the notion of "sensitive" may not be known in advance. This would lead to human vetting of the intermediate results.

Sometimes it may be difficult (or impossible) to develop an exact solution that meets the privacy constraints. In data mining an approximate solution is often sufficient. The goal, then, is to obtain a solution with bounded error.

Last winter's seminar on this topic described solutions based on
distributing the data: Sites are trusted with their own data, but
nobody is trusted with all the data. Today's seminar will discuss
another class of solutions to the problem: Perturbing the data.
Several techniques for learning from distorted data will be presented.

About the Speaker

Chris Clifton is an Associate Professor of Computer Science at Purdue University. He has a Ph.D. from Princeton University, and Bachelor\'s and Master\'s degrees from the Massachusetts Institute of Technology. Prior to joining Purdue in 2001, Chris had served as a Principal Scientist at The MITRE Corporation and as an Assistant Professor of Computer Science at Northwestern University. His research interests include data mining, data security, database support for text, and heterogeneous databases.

Ways to Watch


Watch Now!

Over 500 videos of our weekly seminar and symposia keynotes are available on our YouTube Channel. Also check out Spaf's YouTube Channel. Subscribe today!