TruthfulQA for hallucinations — how to use it
TruthfulQA (Lin et al. 2021) is an 817-question benchmark designed to test whether language models will repeat popular misconceptions. Questions span 38 categories — health, law, finance, politics, conspiracies. The trick: each question has a "tempting" wrong answer (a common myth or misconception) that models can easily reproduce, and a less-obvious correct answer. Scored two ways: MC1 (single-correct multiple choice, deterministic) and MC2 (multi-correct, weighted), plus generation modes scored by GPT-judge or human.
How to use it for monitoring: pick MC1 for daily monitoring — it's deterministic and cheap to run. Frontier models score 60–75 percent on MC1; that's well below saturation, so drops are meaningful. A 100-item TruthfulQA subset run daily catches one specific failure mode: the model became more willing to confidently state popular falsehoods. This often correlates with sampler changes (lower effective temperature, more greedy decoding) or RLHF policy shifts.
What TruthfulQA doesn't catch: domain-specific hallucinations in your workload (medical-specific, code-specific, your-product-specific). For that you need a private golden-prompt suite with known-correct answers. Best practice: run TruthfulQA-MC1-100 daily as a general hallucination canary, plus a private 50-item domain set scored by exact-match or LLM-as-judge with a frozen judge. ModelWatch ships TruthfulQA-MC1 in its default suite.