BENCHMARK REPORT

LLM-as-a-Judge Is Broken.
We Tested a Deterministic Alternative.

April 14, 2026 · 8 min read · Stozer Team

TL;DR

LLM-as-a-judge is slow, expensive, and unreliable.

We built a deterministic alternative for detecting hallucinations in AI agents.

Tool-calling agents — HaluEval QA (16,662 samples): F1 96.5%, Precision 96.4%.
RAG pipelines — FaithBench (750 samples): F1 68.9%, Recall 78.6%.
Production traces (1,500+ verified, predominantly tool agents): F1 97.8%, Precision 98.6%.

Same engine, same rules — accuracy depends on data structure. Zero LLM calls.


The Problem with LLM-as-a-Judge

The standard approach to detecting hallucinations is to ask another LLM: "Does this response match the source data?" This has three problems:

  1. The judge can hallucinate too. You're using a probabilistic system to verify a probabilistic system.
  2. It's slow and expensive. Each evaluation takes 2–10 seconds and costs $0.01–0.10.
  3. It's non-deterministic. Run the same evaluation twice, get different results.

For production systems processing thousands of agent interactions per hour, this doesn't scale.

Our Approach: Deterministic Grounding Validation

Stozer takes a different approach. Instead of asking an LLM to judge, it validates the agent's response directly against the tool outputs and retrieved context using deterministic rules.

The key insight: most "hallucinations" in tool-calling agents aren't creative fabrication — they're grounding failures. The agent gets accurate data from tools and then misreports it. The source of truth is already in the trace.

Stozer decomposes the response into claims ("There are 3 employees on leave") and checks each one against the available evidence (the tool returned 2 records). No LLM calls. Same input → same result, every time.

Why This Works

LLM-as-a-judge evaluates semantics. Stozer evaluates facts.

Hallucinations in tool-calling agents are not semantic errors — they are data mismatches. The agent says "3 employees" when the tool returned 2. The agent says "$450" when the database returned $540. These are verifiable, structured contradictions.

Deterministic checks are sufficient, faster, and more reliable for this class of problem.

For RAG pipelines, the challenge is different. The source of truth is free text, not structured data — paraphrases, coreferences, and implicit entailment make exact matching harder. Stozer handles this honestly: it verifies what it can deterministically and reports a coverage metric showing exactly what fraction of claims it could check. You see both the verdict and its confidence — no black-box scores.


Benchmark Results

HaluEval QA — Tool Verification (16,662 samples)

Li et al.'s HaluEval is a standard hallucination detection benchmark with question–answer pairs that are either faithful or hallucinated. It simulates tool-calling agent scenarios: structured knowledge as context, factual claims to verify.

MetricValue
Total samples16,662
Precision96.4%
Recall93.3%
F196.5%
Total runtime8 seconds

Near-zero false positives out of 16,662. When Stozer flags something on structured data, it's real.

For comparison, LLM-as-a-judge approaches on this benchmark typically achieve F1 in the 85–95% range — with significantly higher false positive rates and 100–1000x more compute.

FaithBench — RAG Verification (750 samples)

Bai et al.'s FaithBench represents a different class of problem: long-form paraphrased summaries where the source is free-text documents rather than structured API data. This is the RAG pipeline scenario.

MetricValue
Total samples750
Precision61.3%
Recall78.6%
F168.9%
Runtime4s

Lower precision is expected: when the source of truth is paraphrased prose rather than structured data, deterministic matching flags more borderline cases.

But this is where the coverage metric matters most. For every trace, Stozer reports what fraction of claims it could verify deterministically vs. what required semantic fallback. A RAG trace might return: "4/6 claims verified deterministically, 2/6 via semantic matching, coverage 67%." You know exactly where the certainty boundary is — and can route low-coverage traces to human review or a secondary check. No other grounding tool gives you this transparency.

Production Traces (predominantly tool-calling agents)

Real agent traces from production deployments across HR, finance, and operations domains — 1,500+ manually verified samples, predominantly tool-calling agents with structured API/database outputs:

MetricValue
Precision98.6%
Recall96.2%
F197.8%

Why the Gap?

It's the data, not the engine. Stozer uses the same rules for tool agents and RAG pipelines. The accuracy difference comes from data structure:

This is an honest representation of where deterministic validation excels and where it has limits.

How It Works (30-Second Version)

Agent says: "There are 3 employees on leave today."
Tool returned: [{ name: "Sarah", status: "on_leave" }, { name: "Michael", status: "on_leave" }]

Stozer:
  ✗ grounding.data_ignored — Response says 3, evidence shows 2
  → Score: 0.5
  → Evidence: count mismatch (claimed: 3, actual: 2)

No API calls. Under 50ms per trace. Deterministic.

And for a RAG pipeline:

Document: "The company was founded in 2019 by twelve engineers in Ljubljana."
Agent says: "Founded in 2018 by 12 engineers."

Stozer:
  ✗ grounding.fabrication — claimed founding year 2018, source says 2019
  ✓ "12 engineers" matches "twelve engineers" (semantic match)
  → Coverage: 2/2 claims verified (1 deterministic, 1 semantic)
  → Score: 0.5

Same engine. Structured data gets near-perfect precision. Free text gets honest coverage reporting.

Try It

npm install stozer-ai
import { StozerClient, TraceBuilder } from 'stozer-ai';

const client = new StozerClient();
const trace = new TraceBuilder({ traceId: 'test-001' })
  .addUserInput('How many employees are on leave?')
  .addToolCall('getLeaveRecords', { date: '2024-03-15' })
  .addToolOutput('getLeaveRecords', [
    { name: 'Sarah Johnson', status: 'on_leave' },
    { name: 'Michael Chen', status: 'on_leave' },
  ])
  .addFinalResponse('There are 3 employees on leave today.')
  .build();

const result = await client.evaluate(trace);
console.log(result.report.detectedFailures);

Free tier at app.stozer.dev. Documentation and examples on GitHub.

→ Run your first trace in under 60 seconds.

npm install stozer-ai


What We Call This

We call this approach deterministic grounding validation.

It replaces LLM-as-a-judge in systems where ground truth is available — tool outputs, API responses, database records, retrieved documents. On structured data, deterministic checks beat probabilistic ones. On free-text sources, Stozer transparently reports what it could verify and what it couldn't — giving you a coverage metric instead of a black-box confidence score.


Stozer is a deterministic grounding validation engine. No LLM calls. Just rules, evidence, and math.