The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Selective Data Expansion for Model Performance

Principal Investigator: Romila Pradhan

Machine learning systems are increasingly being used in critical decision-making, such as healthcare, finance, and criminal justice. Concerns around system fairness have resulted in several mitigation techniques that emphasize the need for high-quality data to ensure fairer decisions. However, the role of earlier stages in machine learning pipelines in addressing model unfairness remains underexplored.
We focus on the task of selective data expansion--- carefully selecting additional data points from a data pool to add to the training data--- to rapidly improve the fairness of a model learned on the modified data while also preserving model accuracy. Since not all points in the pool are equally beneficial, we propose DataSift, a data expansion framework that combines data valuation with multi-armed bandits to identify the most valuable data points for including into training data. Unlike prior methods that mitigate unfairness through data transformation or post-/in-processing,DataSift addresses the problem directly by selecting the right data. Over successive iterations,DataSift selects a partition, samples a batch of points leveraging influence functions, evaluates their impact, and updates partition utilities accordingly.
Empirical evaluation of DataSift on multiple real-world and synthetic datasets shows that model unfairness is mostly resolved by including as few as 4% of additional data with at most 2.6% reduction (and as much as 27.4% increase) in accuracy.

Personnel

Students: Jahid Hasan