/ ACCELERATOR· Eval@Core

Eval@Core — The AI evaluation flywheel that catches drift before users do.

A continuous evaluation framework for AI agents and LLM applications. Trace every invocation, verify against ground truth, score the output, feed findings back into the system.

Built on Ragas, LangSmith, Phoenix (Arize), Langfuse, DeepEval, TruLens, OpenAI Evals — with LLM-as-Judge plus human-in-the-loop labelling for production-grade agent evaluation.

An agent is only as good as its evaluations. Without a flywheel, drift goes undetected, hallucinations reach production, and quality is unverifiable. Eval@Core wraps the open-source eval ecosystem — Ragas, LangSmith, Phoenix, Langfuse — into a single delivery pattern.

The four stages

Eval@Core implements a closed-loop evaluation flywheel with four stages, each engineered to be auditable and repeatable.

Eval@Core Flywheel Four-stage continuous loop: Trace, Verify, Score, Retrain. Each stage feeds the next. Eval@Core continuous loop 01 TRACE 02 VERIFY 03 SCORE 04 RETRAIN
  1. 01
    TraceEvery agent invocation, tool call, retrieval, and LLM response logged with full payload, latency, and cost. OpenTelemetry-compatible tracing through Langfuse or LangSmith depending on the customer's stack.
  2. 02
    VerifyOutputs checked against ground-truth datasets curated with the customer's subject-matter experts. Both pointwise verification (single-output correctness) and pairwise (preference comparison) supported.
  3. 03
    ScoreQuantitative scoring across the metrics that matter for the use case — faithfulness, answer relevance, context precision/recall, tool-call effectiveness, safety, and latency.
  4. 04
    RetrainFindings feed back into prompt refinement, retrieval tuning, fine-tuning datasets, or agent redesign. The loop closes; the next iteration starts on the next deploy.

The metrics we measure

Eval@Core scoring aligns with industry-standard frameworks (notably Ragas for RAG and OpenAI Evals patterns for general LLM tasks). The metric set is configured per use case.

For RAG & retrieval

  • Faithfulness — does the answer reflect what the retrieved sources actually say?
  • Answer relevance — does the answer address the question asked?
  • Context precision — were the retrieved chunks actually relevant?
  • Context recall — did retrieval find what was needed?

For agents & tool use

  • Tool-call effectiveness — did the agent pick the right tool with the right arguments?
  • Trajectory correctness — did the multi-step path reach the goal?
  • Step efficiency — how many steps vs. the minimum needed?
  • Recovery on failure — did the agent handle tool errors gracefully?

For safety & trust

  • Hallucination rate — NLI-based fact-checking against sources.
  • Toxicity / PII leak — output classifiers (Detoxify, Presidio).
  • Prompt-injection robustness — jailbreak attempt detection (per OWASP LLM01).
  • Refusal correctness — does the agent refuse when it should, and not when it shouldn't?

For operations

  • Latency — p50, p95, p99 per workflow stage.
  • Cost-per-query — tokens, embeddings, tool calls, retrieval, generation.
  • Drift — embedding-distribution shift, response-distribution shift over time.
  • Success rate — percentage of agent runs that complete without errors.

The stack

Eval@Core is BYOT (bring-your-own-tools) by design. We integrate with whatever you already use, and bring opinions where there are none.

Where it fits

Eval@Core sits across the AI SDLC, not at the end of it. Evaluations run at multiple stages.

This is the same flywheel pattern documented on our Agent Evaluations page; Eval@Core is the productised, repeatable version of it.

Related resources

We are an intent away