Early access — npm package is live

Your AI agent is making things up.
Stozer catches it in 50ms.

Grounding validation for tool-calling AI agents and RAG pipelines. Deterministic rules. No LLM judge. Free tier available.

Get Early Access
npm install stozer-ai
TypeScript Python (soon) OpenAI · Anthropic · Gemini

The problem

AI doesn't crash.
It confidently returns wrong answers.

Your agent calls a database and gets balance: $2,450 but tells the user $2,540 — that's a grounding failure. The correct answer was already in the trace.

TOOL OUTPUT — source of truth
{
  "name": "Emily Carter",
  "balance": 2450,
  "status": "active",
  "department": "D01"
}
AGENT RESPONSE — sent to user

"Emily's balance is $2,540 and her account is active."

FAIL numeric_mismatch — balance: expected 2450, found 2540

How it works

Three steps. Zero infrastructure.

Drop in a few lines of code, get a verdict in milliseconds

1

Capture

Send the agent trace — tool calls, retrieved context, and the final response.

2

Validate

50+ deterministic rules check every claim against the ground truth. No LLM. Zero API calls.

3

Act

Get a pass/fail verdict with exact failure reasons. Block, warn, or log — you choose.

agent.ts
import { StozerClient, TraceBuilder } from 'stozer-ai';

const stozer = new StozerClient({ apiKey: 'stozer_xxx' });

// Build the trace as your agent runs
const trace = new TraceBuilder()
  .addToolCall('getUser', { userId: 'U-42' })
  .addToolOutput('getUser', {
    name: 'Emily Carter',
    balance: 2450,
    status: 'active'
  })
  .addFinalResponse(
    "Emily's balance is $2,450 and her account is active."
  )
  .build();

// One call — get the verdict
const { report } = await stozer.evaluate(trace);

console.log(report.groundingScore);  // 1.0 — all claims grounded ✓
0%
F1 Score
HaluEval benchmark
<50ms
Latency
per trace evaluation
0
LLM Judge Calls
deterministic core
0
Detection Rules
across 4 categories

Detection

Six types of grounding failures.
All caught deterministically.

Every claim in the response is extracted, matched to source data, and verified — without an LLM judge.

Numeric Mismatches

Prices, dates, quantities, percentages — any number that drifts from the source data.

Entity Substitution

Wrong name, wrong company, wrong product. Cross-contamination between records.

Unsupported Claims

Statements with no basis in retrieved context. Fabricated policies, invented features.

Status & State Errors

Order marked "shipped" when it's "processing". Account shown "active" when suspended.

Missing Qualifications

Omitted disclaimers, dropped conditions, ignored caveats from the source material.

Temporal Errors

Outdated information presented as current. Wrong dates, expired offers, stale data.

Why Stozer

Not LLM-as-a-judge.
Deterministic-first.

On 30–70% of traces (depending on structure), Stozer closes the verdict before any LLM is called. When an LLM is needed, it gets a focused batch call — pre-filtered claims with verified anchors, not an open-ended judgment.

LLM-as-a-Judge Stozer
Structured data traces Full LLM call every time Deterministic — zero LLM cost
Ambiguous edge cases Full LLM call every time One focused batch call
Verdict reliability Non-deterministic Deterministic where provable
Hallucination risk Judge can hallucinate Only where evidence is ambiguous
Explainability Black box score Exact failure code + evidence
Scalability Rate-limited by provider 10K+ evals/sec

Adoption

Start with debug.
Graduate to blocking.

Five modes let you adopt incrementally.

Debug

Explore historical traces

CI/CD

Fail builds on regressions

Observe

Silent production monitor

Warn

Alert teams on failures

Block

Stop bad responses

Built for regulated, high-stakes industries.

Pre-configured rules for domain-specific grounding — medical records, financial transactions, legal entities, and more.

Healthcare Finance Insurance Legal Government HR E-commerce Retail Logistics Education Telecom Manufacturing Support

11 languages — EN, SR, ES, FR, PT, DE, IT, RU, HI, AR, BN

Benchmarks

Don't take our word for it.
Check the benchmarks.

Reproducible results on public and production datasets.

Read the full benchmark report
HALUEVAL QA
0% F1
Precision 99.98% Recall 98.6%

16,662 question-answer pairs. Near-perfect detection of fabricated answers.

FAITHBENCH
0% F1
Precision 59.9% Recall 96.8%

750 expert-annotated LLM summaries. Harder task — free-form text.

PRODUCTION TRACES
0% F1
Precision 97.9% Recall 92.0%

Real customer-support agent traces. 50 rules, 4 failure categories.

All benchmarks reproducible. npm package available.

Be among the first
to ship AI you can trust.

Stozer is in early access. The npm package is live. The hosted platform is coming soon.

Priority dashboard access Free tier — full quality Direct line to founders

Or start now: npm install stozer-ai