The Role of Data Selection on the Fairness of Large Language Models
Principal Investigator: Romila Pradhan
Large Language Models (LLMs) are now a core component of modern NLP systems, supporting applications such as search, question answering, summarization, and conversational assistants. Because these models are pretrained on large, heterogeneous corpora, their behavior is strongly shaped by the statistical structure of the data they observe. In practice, this means that unfairness in model behavior is often not an isolated failure mode but a consequence of training distributions that reflect real-world stereotypes and imbalances. Prior work has shown that models with similar architectures and objectives can exhibit substantially dierent bias profiles depending on how pretraining data is collected and curated. Yet, in much of the fairness literature, pretraining is treated as an upstream constant: large corpora are assembled once, and most fairness interventions are applied later during finetuning or at deployment time. These approaches can reduce mea- sured bias, but they operate on representations that may already encode biased associations learned during pretraining. In this project, we shift the focus upstream. Instead of treating the pretraining corpus as fixed, we study pretraining data selection as a primary experimental variable and ask how much it influences fairness outcomes. Concretely, we hold the model family fixed (a BERT- style masked language model) and construct controlled pretraining corpora with dierent demographic and representational properties. We then evaluate fairness both (i) immediately after pretraining and (ii) after finetuning on a downstream task. This design allows us to isolate the eect of pretraining data selection from confounding factors such as architecture changes or downstream-only interventions.
Personnel
Students: Harshita Rathee

