# Privacy Preserving Data Mining over Vertically Partitioned Data

Jaideep Vaidya

### Tech report number

CERIAS TR 2004-40

phdthesis

### Abstract

The goal of data mining is to extract or mine'' knowledge from large amounts of data. However, data is often collected by several different sites. Privacy, legal and commercial concerns restrict centralized access to this data. Theoretical results from the area of secure multiparty computation in cryptography prove that assuming the existence of trapdoor permutations, one may provide secure protocols for \emph two-party computation as well as for \emph multiparty computation with honest majority. However, the general methods are far too inefficient and impractical for computing complex functions on inputs consisting of large sets of data. What remains open is to come up with a set of techniques to achieve this efficiently within a quantifiable security framework. The distributed data model considered is the heterogeneous database scenario with different features of the same set of data being collected by different sites. This thesis argues that it is indeed possible to have \emph and \emph techniques for useful privacy-preserving mining of knowledge from large amounts of data. The dissertation presents several privacy preserving data mining algorithms operating over vertically partitioned data. The set of underlying techniques solving independent sub-problems are also presented. Together, these enable the secure mining'' of knowledge.

2004 – 08 – 07

Vaidya

### School

Purdue University

### Affiliation

Rutgers, The State University of New Jersey

2004-08-07

### Contents

Introduction State of the Art in Privacy, Security and Data Mining {2.1}State of the Art in Data Mining Techniques {2.2}Distributed Data Mining {2.2.1}Vertical Partitioning {2.2.2}Horizontal Partitioning {2.3}State of the Art in Secure Multiparty Computation {2.3.1}Trusted Third Party Model {2.3.2}Semi-honest Model {2.3.3}Malicious Model {2.3.4}Other (Partial) Models -- Incentive Compatibility {2.4}State of the Art in Privacy Preserving Data Mining Algorithms {2.4.1}Data Perturbation Techniques {2.4.2}Secure Multiparty Computation Based Solutions {2.5}Other Work Problems Addressed {3.1}Two-party Association Rule Mining {3.1.1}Problem Definition {3.1.2}Algorithm {3.1.3}Security Analysis {3.1.4}Computation and Communication Analysis {3.2}Three or More Party Association Rule Mining {3.2.1}Problem Definition {3.2.2}Algorithm {3.2.3}Proof of Correctness {3.2.4}Computation and Communication Analysis {3.2.5}Security Analysis {3.3}$k$-means Clustering {3.3.1}Basic Approach {3.3.2}Algorithm {3.3.3}Security Discussion {3.3.4}Handling Collusion {3.3.5}Communication Analysis {3.4}Na\"\i ve Bayes Classification {3.4.1}Na\"\i ve Bayes Classifier {3.4.2}Model Issues -- Splitting of Model Parameters {3.4.3}Building the Classifier Model {3.4.4}Evaluation of an Instance {3.4.5}Security Analysis {3.4.6}Computation and Communication Analysis {3.5}Decision Tree Classification {3.5.1}Privacy-Preserving ID3: Creating the Tree {3.5.2}Using the Tree {3.5.3}Security Analysis {3.5.4}Computation and Communication Analysis {3.6}Outlier Detection {3.6.1}Basic Approach {3.6.2}Algorithm {3.6.3}Security Analysis {3.6.4}Computation and Communication Analysis Primitives Developed {4.1}Securely Computing the Size of Set Intersection {4.1.1}Problem Definition {4.1.2}Algorithm / Protocol {4.1.3}Communication and Computation Analysis {4.1.4}Security Analysis {4.2}A More Efficient Set Intersection Protocol {4.2.1}Problem Definition {4.2.2}Algorithm / Protocol {4.2.3}Communication and Computation Analysis {4.2.4}Security Analysis {4.3}Algebraic Method for Computing Dot Product {4.3.1}Problem Definition {4.3.2}Protocol {4.3.3}Communication and Computation Analysis {4.3.4}Security Analysis {4.4}Cryptographic Method for Computing Boolean Dot Product {4.4.1}Problem Definition {4.4.2}Generic Encryption System {4.4.3}Algorithm {4.4.4}Communication and Computation Analysis {4.4.5}Security Analysis {4.5}Modified Secure Comparison Protocol Experimental Validation {5.1}Weka {5.2}Decision Tree Classification {5.3}Association Rule Mining {5.4}Summary Summary {LIST OF REFERENCES {APPENDIX Discussion of Secure Multiparty Computation Techniques {A.1}Primitive Used from the Literature {A.1.1}Permutation Algorithm {A.1.2}Scalar Product Protocol {A.1.3}Square Computation {A.1.4}Privately Computing $\mathop {\mathgroup \symoperators ln}\nolimits$ {A.1.5}Division Protocol {A.2}A Complete Solution and Analysis using the General Secure Multiparty Approach

English

### Subject

Data Mining, Privacy, Security, Secure MultiParty Computation

## BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.