The Problem with LLM-as-a-Judge
The standard approach to detecting hallucinations is to ask another LLM: "Does this response match the source data?" This has three problems:
- The judge can hallucinate too. You're using a probabilistic system to verify a probabilistic system.
- It's slow and expensive. Each evaluation takes 2–10 seconds and costs $0.01–0.10.
- It's non-deterministic. Run the same evaluation twice, get different results.
For production systems processing thousands of agent interactions per hour, this doesn't scale.
Our Approach: Deterministic Grounding Validation
Stozer takes a different approach. Instead of asking an LLM to judge, it validates the agent's response directly against the tool outputs and retrieved context using deterministic rules.
The key insight: most "hallucinations" in tool-calling agents aren't creative fabrication — they're grounding failures. The agent gets accurate data from tools and then misreports it. The source of truth is already in the trace.
Stozer decomposes the response into claims ("There are 3 employees on leave") and checks each one against the available evidence (the tool returned 2 records). No LLM calls. Same input → same result, every time.
Why This Works
LLM-as-a-judge evaluates semantics. Stozer evaluates facts.
Hallucinations in tool-calling agents are not semantic errors — they are data mismatches. The agent says "3 employees" when the tool returned 2. The agent says "$450" when the database returned $540. These are verifiable, structured contradictions.
Deterministic checks are sufficient, faster, and more reliable for this class of problem.
Benchmark Results
HaluEval QA (16,662 samples)
Li et al.'s HaluEval is a standard hallucination detection benchmark with question–answer pairs that are either faithful or hallucinated.
| Metric | Value |
|---|---|
| Total samples | 16,662 |
| True Positives | 8,215 |
| True Negatives | 8,329 |
| False Positives | 2 |
| False Negatives | 116 |
| Precision | 99.98% |
| Recall | 98.6% |
| F1 | 99.3% |
| Total runtime | 8 seconds |
Two false positives out of 16,662. 98.6% of hallucinated answers caught.
For comparison, LLM-as-a-judge approaches on this benchmark typically achieve F1 in the 85–95% range — with significantly higher false positive rates and 100–1000x more compute.
FaithBench (750 samples)
Bai et al.'s FaithBench represents a different class of problem: long-form paraphrased summaries without structured source data.
Stozer is not optimized for this setting — by design. Our focus is production AI agents where ground truth exists (tools, APIs, databases).
| Metric | Value |
|---|---|
| Total samples | 750 |
| Precision | 59.9% |
| Recall | 96.8% |
| F1 | 74.0% |
| Runtime | 4s |
We include this benchmark for transparency. Lower precision is expected when the source of truth is paraphrased prose rather than structured data.
Production Traces
Real agent traces from production deployments across HR, finance, and operations domains (~200 manually verified samples):
| Metric | Value |
|---|---|
| Precision | 97.9% |
| Recall | 92.0% |
| F1 | 94.8% |
How It Works (30-Second Version)
Agent says: "There are 3 employees on leave today."
Tool returned: [{ name: "Sarah", status: "on_leave" }, { name: "Michael", status: "on_leave" }]
Stozer:
✗ grounding.data_ignored — Response says 3, evidence shows 2
→ Score: 0.5
→ Evidence: count mismatch (claimed: 3, actual: 2)
No API calls. Under 50ms per trace. Deterministic.
Try It
npm install stozer-ai
import { StozerClient, TraceBuilder } from 'stozer-ai';
const client = new StozerClient();
const trace = new TraceBuilder({ traceId: 'test-001' })
.addUserInput('How many employees are on leave?')
.addToolCall('getLeaveRecords', { date: '2024-03-15' })
.addToolOutput('getLeaveRecords', [
{ name: 'Sarah Johnson', status: 'on_leave' },
{ name: 'Michael Chen', status: 'on_leave' },
])
.addFinalResponse('There are 3 employees on leave today.')
.build();
const result = await client.evaluate(trace);
console.log(result.report.detectedFailures);
Free tier at app.stozer.dev. Documentation and examples on GitHub.
→ Run your first trace in under 60 seconds.
npm install stozer-ai
What We Call This
We call this approach deterministic grounding validation.
It replaces LLM-as-a-judge in systems where ground truth is available — tool outputs, API responses, database records, retrieved documents. Anywhere the source of truth is structured, deterministic checks beat probabilistic ones.
Stozer is a deterministic grounding validation engine. No LLM calls. Just rules, evidence, and math.