Certifying Large Language Models with LLMCert

TL;DR

Large Language Models (LLMs) have shown impressive performance as chatbots, and are hence used by millions of people worldwide. This, however, brings their safety and trustworthiness to the forefront, making it imperative to guarantee their reliability. Prior work has generally focused on establishing the trust in LLMs using evaluations on standard benchmarks. This analysis, however, is insufficient due to the limitations of the benchmarking datasets, their use in LLMs' safety training, and the lack of guarantees through benchmarking. As an alternative, we propose quantitative certificates for LLMs. We present the first family of frameworks, LLMCert, that gives formal guarantees on the behaviors of LLMs. Individual frameworks certify LLMs for counterfactual bias and knowledge comprehension. We provide details of the individual frameworks in their respective project pages given below.

Certifying counterfactual bias with LLMCert-B


LLMCert-B
LLMCert-B is a quantitative certification framework to certify the counterfactual bias in the responses of a target LLM for a random set of prompts that differ by their sensitive attribute. In specific instantiations, LLMCert-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM’s responses are checked for (d) bias by a bias detector, whose results are fed into a certifier. (e) The certifier computes bounds on the probability of obtaining biased responses from the target LLM for any set of prompts formed with a random prefix from the distribution.

Check the project page here.

Certifying knowledge comprehension with LLMCert-C


LLMCert-B
LLMCert-C evaluates an LLM's knowledge comprehension by (1) using knowledge graphs to represent prompt distributions, (2) generating diverse prompts with natural variations, (3) evaluating responses against ground truth, and (4) producing formal certificates with probabilistic guarantees. Through applying our framework to precision medicine and general question-answering domains, we demonstrate how naturally occurring noise in prompts can affect response accuracy in state-of-the-art LLMs. We establish novel performance hierarchies with high confidence among SOTA LLMs and provide quantitative metrics that can guide their future development and deployment in knowledge-critical applications.

Check the project page here.

Media coverage

  • [Sep 2024] IBM references LLMCert-B (QuaCer-B) https://www.ibm.com/think/insights/ai-ethics-tools as a provable measure for LLM bias.
  • [Jul 2024] Thanks to Bruce Adams and Siebel School of Computing and Data Science for writing about our work here https://siebelschool.illinois.edu/news/bias-LLMs.