The four stages
AI Eval Service implements a closed-loop evaluation flywheel with four stages, each engineered to be auditable and repeatable.
- 01TraceEvery agent invocation, tool call, retrieval, and LLM response logged with full payload, latency, and cost. OpenTelemetry-compatible tracing through Langfuse or LangSmith depending on the customer's stack.
- 02VerifyOutputs checked against ground-truth datasets curated with the customer's subject-matter experts. Both pointwise verification (single-output correctness) and pairwise (preference comparison) supported.
- 03ScoreQuantitative scoring across the metrics that matter for the use case — faithfulness, answer relevance, context precision/recall, tool-call effectiveness, safety, and latency.
- 04RetrainFindings feed back into prompt refinement, retrieval tuning, fine-tuning datasets, or agent redesign. The loop closes; the next iteration starts on the next deploy.
The metrics we measure
AI Eval Service scoring aligns with industry-standard frameworks (notably Ragas for RAG and OpenAI Evals patterns for general LLM tasks). The metric set is configured per use case.
For RAG & retrieval
- Faithfulness — does the answer reflect what the retrieved sources actually say?
- Answer relevance — does the answer address the question asked?
- Context precision — were the retrieved chunks actually relevant?
- Context recall — did retrieval find what was needed?
For agents & tool use
- Tool-call effectiveness — did the agent pick the right tool with the right arguments?
- Trajectory correctness — did the multi-step path reach the goal?
- Step efficiency — how many steps vs. the minimum needed?
- Recovery on failure — did the agent handle tool errors gracefully?
For safety & trust
- Hallucination rate — NLI-based fact-checking against sources.
- Toxicity / PII leak — output classifiers (Detoxify, Presidio).
- Prompt-injection robustness — jailbreak attempt detection (per OWASP LLM01).
- Refusal correctness — does the agent refuse when it should, and not when it shouldn't?
For operations
- Latency — p50, p95, p99 per workflow stage.
- Cost-per-query — tokens, embeddings, tool calls, retrieval, generation.
- Drift — embedding-distribution shift, response-distribution shift over time.
- Success rate — percentage of agent runs that complete without errors.
The stack
AI Eval Service is BYOT (bring-your-own-tools) by design. We integrate with whatever you already use, and bring opinions where there are none.
- Tracing & observability — Langfuse, LangSmith, Phoenix (Arize). OpenTelemetry-compatible.
- Evaluation frameworks — Ragas for RAG metrics, DeepEval for unit-test-style assertions, TruLens for agent feedback functions, OpenAI Evals for benchmarking.
- LLM-as-Judge — Claude or GPT-4 as judge with rubric prompts; outputs validated against human-labelled samples for judge calibration.
- Human-in-the-loop labelling — Argilla or LangSmith annotations; subject-matter experts grade a sample, the labelled set becomes the ground truth.
- Custom scorers — we write domain-specific scorers in Python (e.g., for regulatory compliance, financial accuracy, medical safety).
Where it fits
AI Eval Service sits across the AI SDLC, not at the end of it. Evaluations run at multiple stages.
- Offline eval (development) — before any deploy, run the candidate version against the golden set.
- CI/CD gates — pull requests that touch prompts, retrieval, or model configs run the eval suite. A score regression fails the build.
- A/B online eval — in production, route a percentage of traffic to a candidate; compare scores side-by-side.
- Continuous monitoring — live traffic is sampled, scored, and alerted on drift or regression.
This is the same flywheel pattern documented on our Agent Evaluations page; AI Eval Service is the productised, repeatable version of it.