🔍 Challenge: Frontier Model Systems (FMS)—applications built on LLMs and VLMs—are rapidly automating complex workflows, but often show unpredictable brittleness (biased, unsafe, or incorrect behavior), and existing benchmarks/adversarial tests lack formal guarantees.
💡 Idea: We introduce formal certification for frontier model systems that quantifies the probability of desirable behavior over realistic prompt distributions using black-box probabilistic certification.
📏 LLMCert is a family of certification frameworks for counterfactual fairness, robust reading comprehension, and agentic tool-selection reliability.
🧩 Lumos is a DSL that models prompt distributions as text-rich graphs and probabilistic programs to scale certification across domains and modalities.
✅ Takeaway: We provide a foundation for specifying and certifying trustworthiness in frontier AI systems.
We introduce Lumos, the first principled framework for specifying and formally certifying behaviors of Language Model Systems (LMS). Lumos is an imperative probabilistic programming DSL over graphs that generates i.i.d. prompts by sampling subgraphs from structured prompt distributions. It integrates with statistical certifiers to verify LMS behavior under arbitrary prompt distributions. Despite using only a small set of composable constructs, Lumos can encode existing LMS specifications, including complex relational and temporal properties. It also enables new safety specifications; for example, we develop the first specifications for vision-language models in autonomous driving. Using these, we show that Qwen-VL exhibits critical safety failures, producing unsafe responses with ≥90% probability in rainy right-turn scenarios. Lumos’s modular design makes specifications easy to modify as threats evolve and helps uncover concrete failure cases in state-of-the-art LMS.
ICLR 2025
LLMCert-B is a quantitative certification framework to certify the counterfactual bias in the responses of a target LLM for a random set of prompts that differ by their sensitive attribute. In specific instantiations, LLMCert-B samples a set of prefixes from a given distribution and prepends them to a prompt set to form the prompts given to the target LLM. The target LLM’s responses are checked for bias by a bias detector, whose results are fed into a certifier. The certifier computes bounds on the probability of obtaining biased responses from the target LLM for any set of prompts formed with a random prefix from the distribution.
AISTATS 2026 SPOTLIGHT
LLMCert-C evaluates an LLM's knowledge comprehension by (1) using knowledge graphs to represent prompt distributions, (2) generating diverse prompts with natural variations, (3) evaluating responses against ground truth, and (4) producing formal certificates with probabilistic guarantees. Through applying our framework to precision medicine and general question-answering domains, we demonstrate how naturally occurring noise in prompts can affect response accuracy in state-of-the-art LLMs. We establish novel performance hierarchies with high confidence among SOTA LLMs and provide quantitative metrics that can guide their future development and deployment in knowledge-critical applications.
AIWILD @ ICLR 2026
LLMCert-T is the first statistical certification framework for robustness in agentic tool selection. It models tool selection as a Bernoulli random variable under a strong adaptive attacker that injects adversarial tools with misleading metadata. Our evaluation reveals severe fragility: under adaptive attacks, the certified bounds on accuracy collapses to nearly 0% for state-of-the-art models such as Llama-3.1 and Gemma-3.
Isha Chaudhary
Vedaant Jain
Jehyeok Yeon
Kavya Sachdeva
Sayan Ranu
Gagandeep Singh
Qian Hu
Manoj Kumar
Morteza Ziyadi
Rahul Gupta