Efficient Source Selection for Machine Learning Tasks
Principal Investigator: Romila Pradhan
Data quality plays a pivotal role in the predictive
performance of machine learning (ML) tasks – a challenge
amplified by the deluge of data sources available in modern
organizations. Recent efforts support data science tasks through
the discovery/construction of training dataset from a fixed set
of tables or the augmentation of training datasets by finding
new features to improve model performance. However, none
of these works consider the problem of selecting a subset of
sources that maximizes downstream task utility. We present
SPLICE, a system designed to efficiently select a suitable subset
of sources that maximizes the utility of the downstream ML
model. SPLICE is inspired by the idea of gene splicing, a core
concept used in protein synthesis. SPLICE begins with an initial
set of data sources and evaluates their utility for the target task.
Throughout execution, it maintains two lists, namely the active
and the inactive set. Each iteration evaluates a source in an
active set by computing the marginal gain of continuing to keep
it in the active set. Sources in the inactive set are evaluated by
computing the marginal gain of adding them to the active set.
Over multiple iterations, SPLICE cross-combines and mutates
high-utility sources to converge on an optimal subset maximizing
the task performance. To the best of our knowledge, SPLICE
is the first framework to leverage gene splicing for the task of
utility-aware data source selection.
Personnel
Students: Ambarish Singh

