BENCHMARK REPORT

LLM-as-a-Judge Is Broken.
We Tested a Deterministic Alternative.

April 14, 2026 · 8 min read · Stozer Team

TL;DR

LLM-as-a-judge is slow, expensive, and unreliable.

We built a deterministic alternative for detecting hallucinations in tool-calling AI agents. On the HaluEval benchmark (16,662 samples), it achieves F1 99.3% with near-zero false positives (2 out of 16K) — in 8 seconds total, no API calls.


The Problem with LLM-as-a-Judge

The standard approach to detecting hallucinations is to ask another LLM: "Does this response match the source data?" This has three problems:

  1. The judge can hallucinate too. You're using a probabilistic system to verify a probabilistic system.
  2. It's slow and expensive. Each evaluation takes 2–10 seconds and costs $0.01–0.10.
  3. It's non-deterministic. Run the same evaluation twice, get different results.

For production systems processing thousands of agent interactions per hour, this doesn't scale.

Our Approach: Deterministic Grounding Validation

Stozer takes a different approach. Instead of asking an LLM to judge, it validates the agent's response directly against the tool outputs and retrieved context using deterministic rules.

The key insight: most "hallucinations" in tool-calling agents aren't creative fabrication — they're grounding failures. The agent gets accurate data from tools and then misreports it. The source of truth is already in the trace.

Stozer decomposes the response into claims ("There are 3 employees on leave") and checks each one against the available evidence (the tool returned 2 records). No LLM calls. Same input → same result, every time.

Why This Works

LLM-as-a-judge evaluates semantics. Stozer evaluates facts.

Hallucinations in tool-calling agents are not semantic errors — they are data mismatches. The agent says "3 employees" when the tool returned 2. The agent says "$450" when the database returned $540. These are verifiable, structured contradictions.

Deterministic checks are sufficient, faster, and more reliable for this class of problem.


Benchmark Results

HaluEval QA (16,662 samples)

Li et al.'s HaluEval is a standard hallucination detection benchmark with question–answer pairs that are either faithful or hallucinated.

MetricValue
Total samples16,662
True Positives8,215
True Negatives8,329
False Positives2
False Negatives116
Precision99.98%
Recall98.6%
F199.3%
Total runtime8 seconds

Two false positives out of 16,662. 98.6% of hallucinated answers caught.

For comparison, LLM-as-a-judge approaches on this benchmark typically achieve F1 in the 85–95% range — with significantly higher false positive rates and 100–1000x more compute.

FaithBench (750 samples)

Bai et al.'s FaithBench represents a different class of problem: long-form paraphrased summaries without structured source data.

Stozer is not optimized for this setting — by design. Our focus is production AI agents where ground truth exists (tools, APIs, databases).

MetricValue
Total samples750
Precision59.9%
Recall96.8%
F174.0%
Runtime4s

We include this benchmark for transparency. Lower precision is expected when the source of truth is paraphrased prose rather than structured data.

Production Traces

Real agent traces from production deployments across HR, finance, and operations domains (~200 manually verified samples):

MetricValue
Precision97.9%
Recall92.0%
F194.8%

How It Works (30-Second Version)

Agent says: "There are 3 employees on leave today."
Tool returned: [{ name: "Sarah", status: "on_leave" }, { name: "Michael", status: "on_leave" }]

Stozer:
  ✗ grounding.data_ignored — Response says 3, evidence shows 2
  → Score: 0.5
  → Evidence: count mismatch (claimed: 3, actual: 2)

No API calls. Under 50ms per trace. Deterministic.

Try It

npm install stozer-ai
import { StozerClient, TraceBuilder } from 'stozer-ai';

const client = new StozerClient();
const trace = new TraceBuilder({ traceId: 'test-001' })
  .addUserInput('How many employees are on leave?')
  .addToolCall('getLeaveRecords', { date: '2024-03-15' })
  .addToolOutput('getLeaveRecords', [
    { name: 'Sarah Johnson', status: 'on_leave' },
    { name: 'Michael Chen', status: 'on_leave' },
  ])
  .addFinalResponse('There are 3 employees on leave today.')
  .build();

const result = await client.evaluate(trace);
console.log(result.report.detectedFailures);

Free tier at app.stozer.dev. Documentation and examples on GitHub.

→ Run your first trace in under 60 seconds.

npm install stozer-ai


What We Call This

We call this approach deterministic grounding validation.

It replaces LLM-as-a-judge in systems where ground truth is available — tool outputs, API responses, database records, retrieved documents. Anywhere the source of truth is structured, deterministic checks beat probabilistic ones.


Stozer is a deterministic grounding validation engine. No LLM calls. Just rules, evidence, and math.