The four stages
Eval@Core implements a closed-loop evaluation flywheel with four stages, each engineered to be auditable and repeatable.
- 01TraceEvery agent invocation, tool call, retrieval, and LLM response logged with full payload, latency, and cost. OpenTelemetry-compatible tracing through Langfuse or LangSmith depending on the customer's stack.
- 02VerifyOutputs checked against ground-truth datasets curated with the customer's subject-matter experts. Both pointwise verification (single-output correctness) and pairwise (preference comparison) supported.
- 03ScoreQuantitative scoring across the metrics that matter for the use case — faithfulness, answer relevance, context precision/recall, tool-call effectiveness, safety, and latency.
- 04RetrainFindings feed back into prompt refinement, retrieval tuning, fine-tuning datasets, or agent redesign. The loop closes; the next iteration starts on the next deploy.
The metrics we measure
Eval@Core scoring aligns with industry-standard frameworks (notably Ragas for RAG and OpenAI Evals patterns for general LLM tasks). The metric set is configured per use case.
For RAG & retrieval
- Faithfulness — does the answer reflect what the retrieved sources actually say?
- Answer relevance — does the answer address the question asked?
- Context precision — were the retrieved chunks actually relevant?
- Context recall — did retrieval find what was needed?
For agents & tool use
- Tool-call effectiveness — did the agent pick the right tool with the right arguments?
- Trajectory correctness — did the multi-step path reach the goal?
- Step efficiency — how many steps vs. the minimum needed?
- Recovery on failure — did the agent handle tool errors gracefully?
For safety & trust
- Hallucination rate — NLI-based fact-checking against sources.
- Toxicity / PII leak — output classifiers (Detoxify, Presidio).
- Prompt-injection robustness — jailbreak attempt detection (per OWASP LLM01).
- Refusal correctness — does the agent refuse when it should, and not when it shouldn't?
For operations
- Latency — p50, p95, p99 per workflow stage.
- Cost-per-query — tokens, embeddings, tool calls, retrieval, generation.
- Drift — embedding-distribution shift, response-distribution shift over time.
- Success rate — percentage of agent runs that complete without errors.
The stack
Eval@Core is BYOT (bring-your-own-tools) by design. We integrate with whatever you already use, and bring opinions where there are none.
- Tracing & observability — Langfuse, LangSmith, Phoenix (Arize). OpenTelemetry-compatible.
- Evaluation frameworks — Ragas for RAG metrics, DeepEval for unit-test-style assertions, TruLens for agent feedback functions, OpenAI Evals for benchmarking.
- LLM-as-Judge — Claude or GPT-4 as judge with rubric prompts; outputs validated against human-labelled samples for judge calibration.
- Human-in-the-loop labelling — Argilla or LangSmith annotations; subject-matter experts grade a sample, the labelled set becomes the ground truth.
- Custom scorers — we write domain-specific scorers in Python (e.g., for regulatory compliance, financial accuracy, medical safety).
Where it fits
Eval@Core sits across the AI SDLC, not at the end of it. Evaluations run at multiple stages.
- Offline eval (development) — before any deploy, run the candidate version against the golden set.
- CI/CD gates — pull requests that touch prompts, retrieval, or model configs run the eval suite. A score regression fails the build.
- A/B online eval — in production, route a percentage of traffic to a candidate; compare scores side-by-side.
- Continuous monitoring — live traffic is sampled, scored, and alerted on drift or regression.
This is the same flywheel pattern documented on our Agent Evaluations page; Eval@Core is the productised, repeatable version of it.