The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Probabilistic Red-Teaming for Large Language and Vision-Language Models

Principal Investigator: Ruqi Zhang

As large language models (LLMs) and vision-language models (VLMs) grow more capable and widely deployed, they have also become increasingly susceptible to jailbreaks that bypass safety guardrails. Traditional red-teaming approaches often depend on heuristic search, genetic algorithms, or manually curated prompt pools, leading to limited coverage and poor scalability. These methods optimize adversarial examples one at a time, failing to capture the broader distribution of vulnerabilities that govern model behavior.

This project develops probabilistic red-teaming, a new framework that reframes adversarial prompt discovery as a problem of probabilistic inference. We introduce VERA (Variational infErence fRamework for jAilbreaking), which models jailbreak prompting as variational inference over the posterior distribution of adversarial prompts. A lightweight attacker model is trained to approximate this posterior, enabling it to efficiently generate diverse, high-quality jailbreak prompts for unseen queries without repeated optimization. By treating red-teaming as inference, VERA captures the underlying uncertainty and structure of vulnerabilities, providing a richer characterization of model failure modes.

Extending this approach to multimodal models, VERA-V formulates jailbreak discovery as learning a joint distribution over coupled text-image prompts. This probabilistic perspective allows for coordinated attacks that combine linguistic and visual perturbations to evade detection and safety filters. VERA-V integrates complementary mechanisms, typography-based text embedding, diffusion-guided adversarial image synthesis, and structured visual distractors, to fragment model attention and expose hidden weaknesses in visual reasoning. Empirical results demonstrate substantial improvements over existing methods.

Together, these efforts establish a principled, scalable foundation for red-teaming large language and multimodal models. By combining probabilistic inference with adversarial testing, this project aims to (1) advance the science of model evaluation under uncertainty, (2) provide actionable insights into real-world vulnerabilities, and (3) lay the groundwork for trustworthy, probabilistically aligned AI systems.

 

Representative Publications

  • VERA: Variational Inference Framework for Jailbreaking Large Language Models. Neural Information Processing Systems (NeurIPS), 2025

  • VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models. Preprint, 2025