An agent is only as good as its evaluations. Without a flywheel, drift goes undetected, hallucinations reach production, and quality is unverifiable. AI Eval Service wraps the open-source eval ecosystem — Ragas, LangSmith, Phoenix, Langfuse — into a single delivery pattern.

The four stages

AI Eval Service implements a closed-loop evaluation flywheel with four stages, each engineered to be auditable and repeatable.

01
TraceEvery agent invocation, tool call, retrieval, and LLM response logged with full payload, latency, and cost. OpenTelemetry-compatible tracing through Langfuse or LangSmith depending on the customer's stack.
02
VerifyOutputs checked against ground-truth datasets curated with the customer's subject-matter experts. Both pointwise verification (single-output correctness) and pairwise (preference comparison) supported.
03
ScoreQuantitative scoring across the metrics that matter for the use case — faithfulness, answer relevance, context precision/recall, tool-call effectiveness, safety, and latency.
04
RetrainFindings feed back into prompt refinement, retrieval tuning, fine-tuning datasets, or agent redesign. The loop closes; the next iteration starts on the next deploy.

The metrics we measure

AI Eval Service scoring aligns with industry-standard frameworks (notably Ragas for RAG and OpenAI Evals patterns for general LLM tasks). The metric set is configured per use case.

For RAG & retrieval

Faithfulness — does the answer reflect what the retrieved sources actually say?
Answer relevance — does the answer address the question asked?
Context precision — were the retrieved chunks actually relevant?
Context recall — did retrieval find what was needed?

For agents & tool use

Tool-call effectiveness — did the agent pick the right tool with the right arguments?
Trajectory correctness — did the multi-step path reach the goal?
Step efficiency — how many steps vs. the minimum needed?
Recovery on failure — did the agent handle tool errors gracefully?

For safety & trust

Hallucination rate — NLI-based fact-checking against sources.
Toxicity / PII leak — output classifiers (Detoxify, Presidio).
Prompt-injection robustness — jailbreak attempt detection (per OWASP LLM01).
Refusal correctness — does the agent refuse when it should, and not when it shouldn't?

For operations

Latency — p50, p95, p99 per workflow stage.
Cost-per-query — tokens, embeddings, tool calls, retrieval, generation.
Drift — embedding-distribution shift, response-distribution shift over time.
Success rate — percentage of agent runs that complete without errors.

The stack

AI Eval Service is BYOT (bring-your-own-tools) by design. We integrate with whatever you already use, and bring opinions where there are none.

Tracing & observability — Langfuse, LangSmith, Phoenix (Arize). OpenTelemetry-compatible.
Evaluation frameworks — Ragas for RAG metrics, DeepEval for unit-test-style assertions, TruLens for agent feedback functions, OpenAI Evals for benchmarking.
LLM-as-Judge — Claude or GPT-4 as judge with rubric prompts; outputs validated against human-labelled samples for judge calibration.
Human-in-the-loop labelling — Argilla or LangSmith annotations; subject-matter experts grade a sample, the labelled set becomes the ground truth.
Custom scorers — we write domain-specific scorers in Python (e.g., for regulatory compliance, financial accuracy, medical safety).

Where it fits

AI Eval Service sits across the AI SDLC, not at the end of it. Evaluations run at multiple stages.

Offline eval (development) — before any deploy, run the candidate version against the golden set.
CI/CD gates — pull requests that touch prompts, retrieval, or model configs run the eval suite. A score regression fails the build.
A/B online eval — in production, route a percentage of traffic to a candidate; compare scores side-by-side.
Continuous monitoring — live traffic is sampled, scored, and alerted on drift or regression.

This is the same flywheel pattern documented on our Agent Evaluations page; AI Eval Service is the productised, repeatable version of it.

Related resources

We are an intent away

AI Eval Service — The AI evaluation flywheel that catches drift before users do.