Hallucination Is a Distribution Problem, Not a Bug to Patch

Mar 16, 2026 · 13 min read

hallucinationproductionreliabilityragalignment

Every few months a new model release announces reduced hallucination rates. Every few months teams deploying those models encounter hallucinations anyway, in new forms and in different proportions, but present. The expectation that hallucination is a bug that will eventually be fixed misunderstands what hallucination is. It is a predictable statistical property of autoregressive generation: a model that generates the most likely next token given its training distribution will occasionally generate tokens that are fluent and contextually plausible but factually false, because fluency and factual accuracy are correlated but not identical in the training distribution. Managing hallucination in production is an engineering problem, not a waiting problem.

Hallucination taxonomy: three distinct failure modes

Conflating all hallucinations into one category makes them impossible to address systematically. Three failure modes have distinct root causes and distinct mitigations.

Closed-domain factual hallucination occurs when a model is provided with a context (a document, a retrieved passage, a database record) and generates claims that are not supported by that context or contradict it. The model "reads" the provided text but generates output that diverges from what the text actually says. The root cause is that the model's pre-trained priors about what typically follows a given context are stronger than its grounding in the specific provided text. A model that has learned that medical documents typically mention drug dosages may generate a dosage figure that is not in the provided document because the context strongly activates the dosage-generating patterns from pre-training. The document provides a signal, but the pre-training prior overrides it.

Open-domain fabrication occurs when a model generates plausible-sounding but false facts from its parametric memory: named entities that do not exist (papers that were never written, companies that were never founded, people who do not hold the positions attributed to them), incorrect dates, fabricated statistics. The root cause is the FFN key-value memory structure covered in Post 4. Facts that appeared rarely in pre-training have weak keys; the model generates a value that fits the context without a strong memory signal to anchor it to a specific correct fact. The output is fluent, confident, and wrong.

Instruction hallucination occurs when the model follows the form of an instruction but violates its substance. A model asked to extract only explicitly stated claims generates inferred claims. A model asked to produce JSON with exactly five fields produces JSON with four or six. A model asked to respond only if it is confident generates uncertain responses without flagging uncertainty. The root cause is that RLHF training on helpfulness may have rewarded complete-seeming responses over strict instruction adherence in cases where annotators could not easily detect the violation. The model has learned that appearing to comply is rewarded; strict compliance is harder to verify.

Hallucination taxonomy: three failure modes, their root causes, and the mitigation each requires

These are separate failure modes with separate root causes. A mitigation that addresses one does not address the others. Retrieval grounding dramatically reduces closed-domain factual hallucination; it has essentially no effect on instruction hallucination. Constrained decoding reduces instruction hallucination; it has no effect on open-domain fabrication. Any strategy that treats hallucination as a single phenomenon will be effective against some of its forms and blind to others.

Retrieval grounding and its limits

Retrieval-augmented generation is the most effective single intervention for closed-domain factual hallucination. By placing the relevant document in context and instructing the model to answer only from the provided context, hallucination rates on factual tasks drop by 60 to 80% in benchmarked evaluations (Maynez et al., 2020). The model shifts from recalling facts from parametric memory to reading facts from context, and reading from context is substantially more reliable than memory recall.

The reduction is not elimination. Two residual failure modes remain after RAG is applied.

The first is faithful-but-wrong-source hallucination: the model correctly reads from the retrieved document, but the retrieved document is wrong or outdated. The retrieval system returned a stale version of the document, or the most semantically similar document was not the most factually relevant one. The model generates a confidently grounded response that is wrong because the ground truth was in a different document that was not retrieved. This is a retrieval quality problem, not a generation problem. Fixing it requires improving chunk quality, embedding model fit, and recall metrics as described in Posts 10 and 11. The generation layer cannot compensate for retrieval failures upstream.

The second is fabricated citation hallucination: the model is instructed to cite the source for every claim, but generates plausible-looking citations that do not correspond to the provided documents. The model generates the citation form without having actually located the relevant passage. This failure mode is particularly insidious because citations are the primary trust signal for grounded generation systems. Users and downstream systems interpret a citation as proof of grounding; a fabricated citation is indistinguishable from a real one without verification.

The mitigation for fabricated citations is precise: require the model to quote the exact passage it is citing, not just the source identifier. A model that must reproduce the exact quoted text cannot fabricate the citation without producing a quotation that does not match any provided document, and that mismatch can be verified programmatically with a simple string search. The verification cost is near zero. The coverage is high. This is one of the rare cases where a production mitigation is both cheap and effective.

Self-consistency decoding: the statistical argument

If a model hallucinates by generating a plausible-but-wrong answer when it lacks a strong memory signal for the correct answer, then sampling multiple responses and taking the majority answer should filter out low-confidence hallucinations: the correct answer has a higher probability and will be the modal response across samples. Self-consistency (Wang et al., 2023) makes this argument formal: sample k responses, take the plurality answer, and use the consistency rate as a proxy for confidence.

In practice, self-consistency reduces hallucination rates by 30 to 50% on factual QA tasks at k=10 to 20. The reduction is larger for tasks where the hallucination involves generating a specific fabricated fact (a wrong date, a wrong name) and smaller for tasks where the hallucination is a subtle mischaracterization (a claim that is technically true but misleadingly framed). Self-consistency is a probability argument; it filters out noise but not systematic bias. If the model consistently misremembers a fact, k=20 samples will all produce the same wrong answer.

The cost is the obvious one: k=10 means 10 model calls per request. At $0.01 per call, 1 million requests per day costs $100,000 per day at k=10. Self-consistency is economically viable only for high-value queries where the cost of a hallucinated answer exceeds $0.10 per request. For consumer-facing applications at high volume, it is almost never the right default.

Self-consistency decoding: k-sample majority voting, cost vs. hallucination reduction tradeoff

A more efficient variant is adaptive self-consistency: use a cheap heuristic (token log-probability entropy) to identify low-confidence responses, then apply self-consistency only to those requests. A model that generates a high-probability response needs no sampling; a model that generates a low-entropy, high-confidence response is unlikely to be hallucinating. Only requests where the model's confidence signal is low are routed to multi-sample verification. This can reduce the k=10 cost to a k=1.3 effective average while retaining most of the hallucination reduction. The engineering requirement is a log-probability extraction layer on the generation endpoint, which most inference frameworks expose natively.

Uncertainty quantification: token log-probability as a signal

Every autoregressive generation step produces a probability distribution over the vocabulary. The probability assigned to the generated token is a rough signal of the model's confidence in that choice. The entropy of the probability distribution is a measure of uncertainty: low entropy means the model strongly preferred the generated token; high entropy means many tokens had similar probability.

Log-probability calibration for factual accuracy is imperfect but useful. Models tend to assign lower probability to tokens in hallucinated spans than to tokens in correct spans, though the correlation is loose. Studies on calibration (Guo et al., 2017; Kadavath et al., 2022) find that model self-reported confidence is systematically overconfident: a model that assigns 90% probability to an answer is correct approximately 75 to 80% of the time on factual tasks. The overconfidence is larger for rare facts and domain-specific knowledge, exactly the cases where hallucination risk is highest.

Despite imperfect calibration, token log-probability is useful as a coarse filter. Computing the mean token log-probability over the response's factual claim spans and flagging responses below a threshold (typically below -2.0 nats per token) catches a meaningful fraction of hallucinated responses with a low false positive rate on truthful responses. This signal can be used to route low-confidence responses to human review or to trigger a second-pass self-consistency check. The computational cost is zero: log-probabilities are produced during generation and require no additional inference pass.

Asking the model to self-assess its confidence produces better-calibrated estimates than log-probability alone for many-step reasoning tasks. Kadavath et al. (2022) showed that models can be reasonably well-calibrated when explicitly asked "Is the above answer correct? P(correct) = ?" The explicit self-assessment leverages the model's reasoning capabilities rather than just its generation probabilities. The combination of implicit probability signal and explicit self-assessment provides a practical uncertainty quantification layer for production systems: use log-probability as the cheap first-pass filter, and explicit self-assessment on the subset of responses that pass the log-probability threshold but involve high-stakes claims.

Constitutional self-critique: the 30% ceiling

Constitutional AI (Bai et al., 2022) introduced the self-critique pattern: after generating a response, the model critiques the response against a set of principles, identifies violations, and revises. For harmlessness violations, this works well. For factual hallucinations, the self-critique pattern has a structural limitation: the model that generated the hallucination is the same model that is asked to detect it.

Studies of self-critique for factual accuracy consistently find that models miss 25 to 35% of their own factual errors. The errors they miss are precisely the ones where the model had a strong but wrong prior: if the model confidently believes the wrong answer, its critique will not flag it as wrong. The model cannot identify an error it does not know is an error. This is not a failure of the self-critique mechanism; it is a fundamental property of using the same knowledge distribution for generation and verification.

The self-critique pattern is valuable for catching formatting violations, instruction non-adherence, and obvious logical contradictions. It is not reliable for catching factual errors on claims the model does not know are wrong. Relying on self-critique as the primary factual accuracy check will produce a system that appears to have a verification layer while actually having a 25 to 35% blind spot on the errors that matter most.

A more reliable variant: use a different model family for the critique step. A Claude-class model critiquing a GPT-4 response will catch errors that GPT-4's self-critique misses, because the two models have different training distributions and therefore different blind spots. Cross-model verification is more expensive (two model calls per response) but substantially more reliable than same-model self-critique. The key insight is that the blind spots of two independently trained model families have low overlap: errors that are invisible to one model because of its training distribution are often visible to a model trained on a different corpus with different sources.

The production mitigation stack

No single mitigation eliminates hallucination. The production approach is a stack of interventions applied in order of cost-effectiveness, where each layer reduces residual hallucination from the previous.

Retrieval grounding is the first and most effective intervention. Apply it to any task that involves factual claims about specific entities, dates, or statistics. The implementation requires a retrieval infrastructure and a grounding instruction in the prompt, but the cost is dominated by retrieval latency and embedding cost, not model inference cost. Expected reduction: 60 to 80% of closed-domain factual hallucinations.

Citation verification is the second layer, applicable to RAG systems. Require the model to quote the exact passage supporting each factual claim. Programmatically verify that the quoted passage exists in the provided context using string matching. Flag responses with quotations that do not match any provided document as likely hallucinations. This catches the fabricated-citation failure mode at near-zero additional cost and is one of the highest-leverage interventions available for grounded generation systems.

Confidence filtering using token log-probability identifies low-confidence responses for escalation. Flag responses where the mean log-probability of factual claim tokens falls below a threshold. Route flagged responses to human review or to a second-pass self-consistency check. The threshold is calibrated per domain: a legal document extraction system needs a different threshold than a general knowledge QA system. Expected reduction in hallucination rate for flagged responses: 30 to 50% through escalated verification.

Cross-model verification for high-value requests: send the response to a second model from a different family for critique on specific factual claims. This is expensive (2x model cost) but effective for the subset of requests where the cost of a hallucinated answer is significant. Define this subset explicitly by request type, not by attempting to identify hallucinated requests after the fact. The request type definition is a business decision, not a technical one: which query categories carry enough risk that 2x model cost is justified?

Human escalation is the final backstop. Define the categories of factual claims that require human verification regardless of model confidence: medical dosages, legal citations, financial figures used in decision-making. Model any hallucination that passes all automated checks in these categories as an expected operational risk, not a system failure, and include human review in the workflow by design. The categories are defined by regulatory and liability requirements, not by observed error rates.

Production hallucination mitigation stack: five layers from retrieval grounding to human escalation

No stack eliminates hallucination. The goal is to reduce it to a rate where the expected cost of residual errors is acceptable given the value of the application. That rate should be defined explicitly before deployment, not discovered empirically after users encounter errors. A 2% residual hallucination rate on a low-stakes recommendation system is acceptable. A 0.1% residual hallucination rate on a medical dosage system may not be. The engineering question is not "can we reach zero?" but "what rate does this application require, and which combination of layers gets us there at acceptable cost?"

References

Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. "On Faithfulness and Factuality in Abstractive Summarization." ACL, 2020. https://arxiv.org/abs/2005.00661
Wang, X., Wei, J., Schuurmans, D., et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR, 2023. https://arxiv.org/abs/2203.11171
Kadavath, S., Conerly, T., Askell, A., et al. "Language Models (Mostly) Know What They Know." arXiv, 2022. https://arxiv.org/abs/2207.05221
Bai, Y., Kadavath, S., Kundu, S., et al. "Constitutional AI: Harmlessness from AI Feedback." Anthropic, 2022. https://arxiv.org/abs/2212.08073
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. "On Calibration of Modern Neural Networks." ICML, 2017. https://arxiv.org/abs/1706.04599
Manakul, P., Liusie, A., and Gales, M. J. F. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP, 2023. https://arxiv.org/abs/2303.08896
Min, S., Krishna, K., Lyu, X., et al. "FActScoring: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." EMNLP, 2023. https://arxiv.org/abs/2305.14251