Technical

How to Detect Hallucinations in AI Agents: A Technical Guide

Learn proven techniques for detecting and preventing hallucinations in autonomous AI agents, from statistical methods to semantic verification and chain-of-thought analysis.

Anchor8 Team6 min read

What Are AI Hallucinations?#

AI hallucinations occur when a language model generates information that appears plausible but is factually incorrect, fabricated, or unsupported by the input data. For autonomous AI agents — systems that take actions based on LLM outputs — hallucinations are not just an inconvenience; they can trigger real-world consequences.

Consider these scenarios:

  • A financial advisory agent hallucinating a regulatory clause that doesn't exist, leading to non-compliant investment decisions
  • A customer support agent fabricating a refund policy, creating legal liability
  • A medical triage agent inventing a drug interaction, potentially endangering patient safety
  • A legal research agent citing a case that was never decided by any court

When agents act on hallucinated information, the damage compounds — incorrect outputs become incorrect actions.

Types of Hallucinations#

Understanding the different types of hallucinations is crucial for building effective detection systems.

Intrinsic Hallucinations#

Outputs that directly contradict the provided source material. For example, if an agent reads a document stating "revenue increased by 12%" but reports "revenue decreased by 12%."

Extrinsic Hallucinations#

Outputs that introduce information not present in any source material. The agent fabricates facts, citations, or reasoning that cannot be traced to any input.

Reasoning Hallucinations#

Errors in the agent's chain of thought where the logic is flawed even if individual facts are correct. For example, drawing incorrect conclusions from valid premises.

Temporal Hallucinations#

Generating information that was once true but is now outdated, or conflating events from different time periods.

Detection Techniques#

1. Semantic Consistency Checking#

Compare the agent's output against its input sources using embedding similarity. If the semantic distance between source material and generated claims exceeds a threshold, flag the output for review.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def check_consistency(source_text: str, generated_text: str, threshold: float = 0.3) -> bool:
    """Returns True if the generated text is consistent with the source."""
    source_embedding = model.encode(source_text)
    generated_embedding = model.encode(generated_text)
    
    similarity = np.dot(source_embedding, generated_embedding) / (
        np.linalg.norm(source_embedding) * np.linalg.norm(generated_embedding)
    )
    
    return similarity > (1 - threshold)

2. Token Probability Analysis#

When available, examining the token-level log probabilities of the LLM's output can reveal uncertainty. Low-confidence tokens in critical parts of the response (names, numbers, dates) are strong indicators of potential hallucination.

Key indicators:

  • Low average log probability across the response suggests the model is uncertain
  • Sudden probability drops at specific tokens indicate fabrication at those points
  • High entropy in the output distribution suggests the model is "guessing"

3. Chain-of-Thought Verification#

Analyze the agent's reasoning trace for logical consistency. Modern reasoning models (like those with "thinking" capabilities) expose their internal deliberation, which can be intercepted and analyzed.

Red flags in reasoning traces:

  • Contradictions between stated reasoning and final output
  • References to information not present in the context
  • Circular reasoning or unjustified logical leaps
  • Phrases like "I believe," "I think," or "it seems" in critical assertions

4. Cross-Reference Validation#

For claims that can be objectively verified (dates, statistics, citations), implement automated fact-checking against trusted knowledge bases.

def validate_citations(generated_text: str, known_sources: dict) -> list:
    """Check if cited sources actually exist and contain the claimed information."""
    citations = extract_citations(generated_text)
    invalid = []
    
    for citation in citations:
        if citation.reference not in known_sources:
            invalid.append({
                "citation": citation.reference,
                "reason": "Source does not exist",
                "severity": "HIGH"
            })
        elif not source_contains_claim(known_sources[citation.reference], citation.claim):
            invalid.append({
                "citation": citation.reference,
                "reason": "Source does not support this claim",
                "severity": "MEDIUM"
            })
    
    return invalid

5. Multi-Agent Debate#

Use multiple independent models to evaluate the same claim. If models disagree on factual assertions, the claim warrants human review. This technique is particularly effective because different models tend to hallucinate in different directions.

6. Temporal Drift Detection#

Monitor how an agent's outputs change over time for the same types of queries. Sudden shifts in factual claims (without corresponding changes in the underlying model or data) may indicate hallucination drift.

Building a Hallucination Detection Pipeline#

A production-grade hallucination detection system combines multiple techniques in a pipeline:

  1. Capture — Log every agent input, reasoning trace, and output
  2. Score — Apply semantic consistency, confidence, and cross-reference checks
  3. Classify — Rate hallucination risk as LOW, MEDIUM, HIGH, or CRITICAL
  4. Act — Route high-risk outputs for human review before agent action
  5. Learn — Feed confirmed hallucinations back into the detection model

Prevention Strategies#

Detection alone isn't enough. Here are proven strategies to reduce hallucination frequency:

  • Retrieval-Augmented Generation (RAG) — Ground responses in verified documents
  • Constrained decoding — Limit output vocabulary to domain-specific terms
  • Temperature tuning — Lower temperatures reduce creative (but potentially hallucinated) outputs
  • System prompt engineering — Explicitly instruct agents to say "I don't know" instead of guessing
  • Tool-use verification — Verify that tool outputs match agent claims about those outputs

How Anchorate Detects Hallucinations#

Anchorate's governance platform includes built-in hallucination detection as part of its real-time risk analysis:

  • Cognitive Firewall — Intercepts agent reasoning traces to detect inconsistencies before actions are taken
  • Multi-perspective analysis — Virtual ML experts, compliance officers, and security auditors each evaluate incidents from their domain
  • Historical pattern matching — Compares current agent behavior against known hallucination patterns from the vector database
  • Confidence scoring — Every assertion in the analysis is annotated with an evidence-backed confidence score

Frequently Asked Questions#

Can hallucinations be completely eliminated?#

No. Hallucinations are an inherent property of language models, which generate text probabilistically. However, their frequency and impact can be dramatically reduced through detection, prevention, and governance strategies.

Which LLMs hallucinate the most?#

Hallucination rates vary by model, task type, and domain. Generally, larger models with more training data hallucinate less frequently, but no model is immune. The key is having detection systems regardless of which model you use.

How does RAG help reduce hallucinations?#

Retrieval-Augmented Generation grounds the model's responses in factual documents retrieved at inference time. By providing relevant source material directly in the context, the model is less likely to fabricate information. However, RAG doesn't eliminate hallucinations entirely — the model can still misinterpret or ignore retrieved documents.

Ready to govern your AI agents?

Deploy production-grade governance, compliance, and forensic analysis in under 24 hours.

Join the Waitlist