Enterprise GenAI spend is unpredictable for structural reasons. Multiple teams build in parallel. Applications invoke multiple models per workflow. Agentic loops chain LLM calls until a task is done — what looks like a one-cent prompt becomes a multi-step run that consumes ten or twenty times more. Industry analysts have repeatedly flagged that a meaningful share of enterprise GenAI projects overrun their budgets due to architectural choices and operational immaturity rather than malice or inflation.
This guide is the practical playbook for bringing that under control. The five levers that actually move cost, the architecture choices that make optimisation possible (or impossible) later, the FinOps discipline that turns AI spend from variable mystery into managed line item, and the patterns we see working in Indian enterprise deployments in 2026.
Lever 1 — Model Routing (BYOM)
Model selection is the single biggest lever on cost. Token prices can vary by an order of magnitude or more between frontier models and efficient mid-tier or small models for the same task. A workload that runs every call against a frontier model when a mid-tier model would suffice is paying a premium for capability it does not use.
The discipline: route each workload to the smallest model that meets its accuracy bar. Triage queries with an SLM. Reserve frontier models for genuine reasoning load. Build a routing layer that can switch models without touching application code. This is what BYOM (Bring Your Own Model) exists to enable. Enterprises that lock to one vendor's API surface foreclose this lever entirely.
Lever 2 — Context and Prompt Discipline
Input tokens scale linearly with context length. Two patterns capture most of the savings:
- Retrieve, don't stuff. Pull in just the relevant context per query (RAG, summarised history, scoped documents) rather than concatenating the whole knowledge base into every prompt. See RAG vs Fine-Tuning.
- Prune aggressively. Conversation history, system instructions, and few-shot examples accumulate. Trim what is not load-bearing. Summarise long histories before passing them on.
For agentic systems, the same discipline applies between steps — pass the next agent only what it needs, not the full transcript.
Lever 3 — Caching
All three major model providers (Anthropic, OpenAI, Google) offer prompt caching that materially reduces the cost of repeated prompt prefixes — system prompts, document context, few-shot examples that recur across requests. Anthropic's prompt caching is opt-in with cache writes priced higher and cache reads priced significantly lower; OpenAI introduced automatic caching for repeated prefixes; Google's Vertex AI offers context caching with similar economics.
Where it helps: templated prompts, document-grounded chat, RAG with stable system prompts. Where it doesn't: one-off queries, prompts that change every call. Worth measuring on your actual traffic before assuming a percentage saving.
Lever 4 — Agent Budget Caps
Agentic loops are where unpredictable spend usually lives. A multi-step run can chain 10–50 LLM calls before terminating, and a bad reasoning step can chain many more. Without hard caps, a runaway loop can cost more than a month of single-prompt traffic in an afternoon.
Three controls are mandatory:
- Per-run hard caps on tokens, dollars, and wall-clock time, enforced at the orchestration layer
- Per-tenant budget caps in shared services, with explicit overage paths
- Pre-flight estimation that warns or refuses when a planned run looks likely to exceed budget
For more on agent-loop discipline, see our multi-agent orchestration guide.
Lever 5 — SLMs and Mixed Inference
Small language models (SLMs) are now competitive with frontier models on focused enterprise tasks at a fraction of the inference cost. Patterns that work in production:
- Triage with an SLM, escalate to a frontier model only on uncertainty. The SLM handles the easy 80% cheaply; the frontier model handles the hard 20%.
- Run sensitive workloads on private SLMs. On-prem or VPC deployment of an SLM removes per-token pricing for internal high-volume use cases.
- Use SLMs for embeddings and classifiers. Embedding generation, content classification, and routing decisions rarely need frontier capability.
FinOps for AI — Operating Model
Optimisation discipline only sticks when there is an operating model behind it. The pattern that works:
- Visibility per team and per workload. Token spend, dollar spend, latency — broken down by application, agent, model, and tenant.
- Unit economics. Cost per task, per user, per ticket resolved, per claim processed. Track the unit cost as a first-class metric, not just total spend.
- Cost allocation. Spend rolls up to business units, not the central AI team. The teams that consume the budget own the optimisation.
- Continuous optimisation cadence. Quarterly review of routing, caching, and prompt efficiency for top-spend workloads.
- Anomaly alerting. Sudden spikes in tokens or dollars per workload trigger investigation, not after-the-fact post-mortems.
Architecture Choices That Make Optimisation Possible
Three architecture decisions that dictate whether the levers above are even available to you later:
- Abstraction over the model layer. Application code calls a routing layer, not a vendor SDK. Without this, switching models is a refactor.
- Centralised observability. Every model call traced through one observability pipeline. Without this, you cannot see where cost actually lives.
- Per-tenant and per-workload tagging. Every request carries metadata (team, application, agent, user) that the cost reports break down by. Without tags, you have a total bill and no path to action.
These decisions are easy on day one and expensive on day three hundred. Make them up front.
What to Measure First
If you are starting today, instrument these in the next 30 days:
- Total monthly token spend, broken down by model and by application
- Per-application cost trend (week over week, month over month)
- Per-task or per-user unit cost for the top three GenAI workloads
- Cache hit rate where caching is enabled
- Agent run distribution: median run cost, P95 run cost, P99 run cost — the tail is where the surprises live
With these in hand, the optimisation prioritisation becomes obvious: pick the two highest-spend workloads, apply model routing and context pruning, measure delta, repeat. Most enterprises see 30–60% reductions in inference cost in the first focused quarter without any loss of accuracy.
Cost Optimisation and Quality Are Not Opposites
The myth worth retiring: that cheaper means worse. In production, the opposite is more often true. The right-sized model for a task often performs better than an oversized frontier model that confuses the question with its own training tendencies. Discipline on context length usually improves accuracy by removing noise. Routing through evaluations means each workload runs on the model that scored best for it — not the one that was most fashionable when the application was built.
Cost optimisation is GenAI's version of code quality: a continuous habit, not a one-time project. Bake it into the engineering culture and you end up with both lower bills and better systems.