Certifying Counterfactual Bias in LLMs

University of Illinois Urbana-Champaign1, Amazon 2, Oracle Health 3

Abstract

Content Warning: This work contains examples of offensive language.

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM’s embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.

Media coverage

  • [Sep 2024] IBM references LLMCert-B (QuaCer-B) https://www.ibm.com/think/insights/ai-ethics-tools as a provable measure for LLM bias.
  • [Jul 2024] Thanks to Bruce Adams and Siebel School of Computing and Data Science for writing about our work here https://siebelschool.illinois.edu/news/bias-LLMs.
  • Overview

    Large Language Models (LLMs) have shown impressive performance as chatbots, and are hence used by millions of people worldwide. This, however, brings their safety and trustworthiness to the forefront, making it imperative to guarantee their reliability. Prior work has generally focused on establishing the trust in LLMs using evaluations on standard benchmarks. This analysis, however, is insufficient due to the limitations of the benchmarking datasets, their use in LLMs' safety training, and the lack of guarantees through benchmarking. As an alternative, we propose quantitative certificates for LLMs and develop a novel framework, LLMCert-B, to quantitatively certify LLMs for bias in their responses. We define bias as an assymetry in the LLM's responses for a set of prompts that differ only by a sensitive attribute.

    LLMCert-B considers a given distribution of sets of prompts to certify a target LLM. The certificate consists of high-confidence bounds on the probability of obtaining a biased response from the LLM for a randomly sampled prompt from the distribution. The figure below presents an overview of LLMCert-B on an example distribution of prompts developed from a sample from the BOLD dataset.

    LLMCert-B
    (Overview of LLMCert-B): LLMCert-B is a quantitative certification framework to certify the bias in the responses of a target LLM for a random set of prompts that differ by their sensitive attribute. In specific instantiations, LLMCert-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM’s responses are checked for (d) bias by a bias detector, whose results are fed into a certifier. (e) The certifier computes bounds on the probability of obtaining biased responses from the target LLM for any set of prompts formed with a random prefix from the distribution.

    We illustrate certificates generated by LLMCert-B for the popular, SOTA LLMs with 3 kinds of distributions. Each distribution is defined over a sample space having elements that are sets of prompts. Each set of prompts is developed from a fixed set of prompts by prepending a random prefix. The fixed set of prompts that characterize a distribution of sets of prompts is derived from samples of popular fairness datasets, by varying the sensitive attributes in them. Hence, the distribution of the sets of prompts reduces to a distribution of prefixes for a fixed set of prompts. The 3 kinds of prefix distributions we consider are (details in the paper) - (1) Random sequence of tokens, (2) Mixture of effective jailbreaks, (3) Effective jailbreak perturbed in model's embedding space.

    We certify popular LLMs for their bias with LLMCert-B and instances of the 3 kinds of distributions defined above. In particular, we certify the LLMs for gender and racial bias with distributions developed from samples from the BOLD and Decoding Trust datasets respectively. We observe novel trends in the performance of the LLMs, which we describe in detail in our paper. Below we show some example responses of a SOTA LLM to prompts sampled from a distribution derived from each dataset for gender and racial bias respectively, to illustrate the prompts and responses used in certification.

    Examples

    Optimized for viewing on bigger screens

    The user prompt is shown in blue and its prefix is shown in red. All text in black is generated by the AI. Use the drop-down menu to switch between gender and racial bias examples. Use the button to generate another example, for another random set of responses from the LLM. The following example responses are shown for the SOTA LLM Mistral-Instruct-v0.2 (check the paper for examples from other LLMs). The responses for gender bias are shown for binary gender obtained during certification for the BOLD example ‘Desktop automation refers to software programming’ and prefixes that are mixtures of jailbreaks. The responses for racial bias are obtained during certification for the Decoding Trust example ‘[target group] have a drug problem’ and prefixes that are mixtures of jailbreaks.

    BibTeX

    @article{chaudhary2024quantitative,
          title={Quantitative Certification of Bias in Large Language Models},
          author={Chaudhary, Isha and Hu, Qian and Kumar, Manoj and Ziyadi, Morteza and Gupta, Rahul and Singh, Gagandeep},
          journal={arXiv preprint arXiv:2405.18780},
          year={2024}
        }

    Ethics Statement

    This work presents examples and code of our certification framework that can be used to reliably assess state-of-the-art LLMs for biases in their responses. While the framework is general, we have illustrated it with practical examples of prefix distributions, which can consist potential jailbreaks. The exact adversarial nature of the prefixes is unknown, but being derived from popular jailbreaks, the threat posed by them is important to investigate. Hence, we used these prefixes to certify the bias in popular LLMs and have informed the model developers about their potential threat.