How routing decides
Every query arrives with the user's identity, which maps to a team/project in the registry. From there RouteIQ runs a deterministic, auditable decision before any expensive model is called.
- Identity → project. The user maps to a default project, which carries its own allowed models, budget, documentation, and routing rules.
- Project-relevance score. A hybrid signal combines semantic similarity (cosine distance of embeddings) with lexical overlap (keyword/term match) against the project docs plus the last few conversation turns — robust to both paraphrasing and project-specific vocabulary.
- On-topic vs off-topic. On-topic queries go to mid- or premium-tier models; off-topic queries are routed to cheaper models. The cut-off is a per-project percentile-calibrated threshold (an optional Bayesian confidence score is on the roadmap).
- Ambiguous band. When relevance lands in an uncertain region, a lightweight classifier (Nova Lite) makes the final call rather than defaulting to the expensive tier.
- Model selection within the tier. The admin chooses the strategy — cheapest, lowest latency, round-robin, or availability-aware (skip recently failed models). Complexity-based selection arrives in Phase 2.
- Log everything. The query, model choice, cost, and outcome signals are recorded — with a schema designed to feed future capability profiles.
Three model tiers
Each project defines its own model pool, organised by cost tier. The router only ever picks within the tier the relevance score and budget allow.
Premium
High-relevance, complex, project-critical work.
- Typical models: Opus 4.8, GPT-4o.
- Approx. cost: $0.015–$0.060 per query.
Mid
Medium-relevance or ambiguous queries.
- Typical models: Sonnet 4.6, GPT-4o-mini.
- Approx. cost: $0.003–$0.010 per query.
Cheap
Off-topic queries and simple in-scope work.
- Typical models: Qwen-7B, Haiku 4.5, Nova Lite.
- Approx. cost: $0.0005–$0.002 per query.
Selection strategy
Within a tier, route by the admin's chosen lever.
- Cheapest — lowest cost per input token.
- Lowest latency — best p50.
- Round-robin & availability-aware failover.
Budget & cost governance
RouteIQ watches spend continuously and acts before the bill does. Team-level monthly budgets are tracked with burn-rate anomaly detection, and routing degrades gracefully as thresholds are crossed.
- Tiered alerts & auto-actions. Normal (≤25% of monthly budget) — no action. Yellow (>35%) — alert only. Orange (>50%) — downgrade premium to mid, notify with a cost report. Red (>75%) — downgrade all routing to cheap, urgent escalation. Hard stop (100%) — reject with a budget-exhausted message and escalate to finance.
- Per-user-per-day cost cap. Prevents a single user from draining the team budget.
- Premium-token cap per user. Bounds how much premium-tier capacity any one user can consume per day.
- Retry-detection escalation. If a user retries the same query, RouteIQ upgrades one tier — better an answer than a loop.
- Automatic tier downgrade. When the budget burns too fast, the system lowers the routing ceiling and alerts the admin.
Project policy rules
Each project in the registry exposes a set of levers to enterprise admins — routing behaviour is configuration, not code.
- Relevance thresholds — minimum score to reach the premium tier (default 0.80) and the mid tier (default 0.50), set per project.
- Off-topic policy — what happens below the mid threshold (default: downgrade to a cheaper tier).
- Monthly budget ceiling — per project, set by the admin.
- % daily budget per user and premium-token cap per user.
- Query length cap — maximum input tokens per query (default 32K).
- Model selection strategy — cheapest, round-robin, lowest latency, or complexity-based (Phase 2).
The stack
RouteIQ is model-agnostic by design, spanning managed and self-hosted inference so you inherit better models without re-architecting.
- Adapters. Amazon Bedrock (primarily Claude models and Nova Lite) and a vLLM adapter on SageMaker hosting open-weight models (Qwen2.5-7B, Qwen3-8B, Llama-3.1-8B).
- Embeddings. bge-base-en-v1.5 or Amazon Titan; project documentation is embedded and refreshed on a weekly cadence.
- Short-term memory. The last few conversation turns per session inform relevance scoring (in-process, no persistence layer).
- Outcome logging. Query, model choice, cost, and outcome signals captured with a schema designed for future capability profiles.
Phased rollout
Phase 1 delivers project-policy-based routing — the project registry, identity mapping, hybrid relevance scoring with percentile-calibrated thresholds, three-tier routing, budget tracking with auto-downgrade, per-user caps, retry escalation, and comprehensive logging. Phase 2 adds complexity-aware model selection (query complexity plus benchmark scores) and learned, per-model capability profiles, plus calibrated confidence scoring.
Where it fits
RouteIQ sits in front of your agents and LLM apps as the routing and cost-governance layer. It is governed by Responsible AI controls, its routing quality is measured by AI Eval Service, and it pairs naturally with the GenAI Delivery Factory for production rollout.