Early access — npm package is live

Your AI agent is
making things up.
Stozer catches it
in 50ms.

Name: Stozer
Author: Stozer

Grounding validation for tool-calling AI agents and RAG pipelines.
Deterministic rules. No LLM judge. Free tier available.

Get Early Access

npm install stozer-ai

TypeScript Python (soon) OpenAI · Anthropic · Gemini

The problem

AI doesn't crash.
It confidently returns wrong answers.

Your agent calls a database and gets balance: $2,450 but tells the user $2,540 — that's a grounding failure. The correct answer was already in the trace.

TOOL OUTPUT — source of truth

{
  "name": "Emily Carter",
  "balance": 2450,
  "status": "active",
  "department": "D01"
}

AGENT RESPONSE — sent to user

"Emily's balance is $2,540 and her account is active."

FAIL numeric_mismatch — balance: expected 2450, found 2540

How it works

Three steps. Zero infrastructure.

Drop in a few lines of code, get a verdict in milliseconds

1

Capture

Send the agent trace — tool calls, retrieved context, and the final response.

2

Validate

50+ deterministic rules check every claim against the ground truth. No LLM. Zero API calls.

3

Act

Get a pass/fail verdict with exact failure reasons. Block, warn, or log — you choose.

agent.ts

import { StozerClient, TraceBuilder } from 'stozer-ai';

const stozer = new StozerClient({ apiKey: 'stozer_xxx' });

// Build the trace as your agent runs
const trace = new TraceBuilder()
  .addToolCall('getUser', { userId: 'U-42' })
  .addToolOutput('getUser', {
    name: 'Emily Carter',
    balance: 2450,
    status: 'active'
  })
  .addFinalResponse(
    "Emily's balance is $2,450 and her account is active."
  )
  .build();

// One call — get the verdict
const { report } = await stozer.evaluate(trace);

console.log(report.groundingScore);  // 1.0 — all claims grounded ✓

0%

F1 Score

HaluEval benchmark

<50ms

Latency

per trace evaluation

0

LLM Judge Calls

deterministic core

0

Detection Rules

across 4 categories

Detection

Six types of grounding failures.
All caught deterministically.

Every claim in the response is extracted, matched to source data, and verified — without an LLM judge.

Numeric Mismatches

Prices, dates, quantities, percentages — any number that drifts from the source data.

Entity Substitution

Wrong name, wrong company, wrong product. Cross-contamination between records.

Unsupported Claims

Statements with no basis in retrieved context. Fabricated policies, invented features.

Status & State Errors

Order marked "shipped" when it's "processing". Account shown "active" when suspended.

Missing Qualifications

Omitted disclaimers, dropped conditions, ignored caveats from the source material.

Temporal Errors

Outdated information presented as current. Wrong dates, expired offers, stale data.

Why Stozer

Not LLM-as-a-judge.
Deterministic-first.

On 30–70% of traces (depending on structure), Stozer closes the verdict before any LLM is called. When an LLM is needed, it gets a focused batch call — pre-filtered claims with verified anchors, not an open-ended judgment.

	LLM-as-a-Judge	Stozer
Structured data traces	Full LLM call every time	Deterministic — zero LLM cost
Ambiguous edge cases	Full LLM call every time	One focused batch call
Verdict reliability	Non-deterministic	Deterministic where provable
Hallucination risk	Judge can hallucinate	Only where evidence is ambiguous
Explainability	Black box score	Exact failure code + evidence
Scalability	Rate-limited by provider	10K+ evals/sec

Adoption

Start with debug.
Graduate to blocking.

Five modes let you adopt incrementally.

Debug

Explore historical traces

CI/CD

Fail builds on regressions

Observe

Silent production monitor

Warn

Alert teams on failures

Block

Stop bad responses

Built for regulated, high-stakes industries.

Pre-configured rules for domain-specific grounding — medical records, financial transactions, legal entities, and more.

Healthcare Finance Insurance Legal Government HR E-commerce Retail Logistics Education Telecom Manufacturing Support

11 languages — EN, SR, ES, FR, PT, DE, IT, RU, HI, AR, BN

Benchmarks

Don't take our word for it.
Check the benchmarks.

Reproducible results on public and production datasets.

Read the full benchmark report

HALUEVAL QA

0% F1

Precision 99.98% Recall 98.6%

16,662 question-answer pairs. Near-perfect detection of fabricated answers.

FAITHBENCH

0% F1

Precision 59.9% Recall 96.8%

750 expert-annotated LLM summaries. Harder task — free-form text.

PRODUCTION TRACES

0% F1

Precision 97.9% Recall 92.0%

Real customer-support agent traces. 50 rules, 4 failure categories.

All benchmarks reproducible. npm package available.

Be among the first
to ship AI you can trust.

Stozer is in early access. The npm package is live. The hosted platform is coming soon.

Priority dashboard access Free tier — full quality Direct line to founders

Or start now: npm install stozer-ai

Your AI agent is making things up. Stozer catches it in 50ms.

AI doesn't crash.It confidently returns wrong answers.

Three steps. Zero infrastructure.

Capture

Validate

Act

Six types of grounding failures. All caught deterministically.

Numeric Mismatches

Entity Substitution

Unsupported Claims

Status & State Errors

Missing Qualifications

Temporal Errors

Not LLM-as-a-judge.Deterministic-first.

Start with debug.Graduate to blocking.

Built for regulated, high-stakes industries.

Don't take our word for it.Check the benchmarks.

Be among the firstto ship AI you can trust.

Thank you!

Your AI agent is
making things up.
Stozer catches it
in 50ms.

AI doesn't crash.
It confidently returns wrong answers.

Six types of grounding failures.
All caught deterministically.

Not LLM-as-a-judge.
Deterministic-first.

Start with debug.
Graduate to blocking.

Don't take our word for it.
Check the benchmarks.

Be among the first
to ship AI you can trust.