Why are GenAI costs unpredictable in enterprises?

Three forces compound. Multiple teams build independently, multiplying API spend. Applications routinely call multiple models per workflow. Agentic loops chain LLM calls — what looks like a one-cent prompt becomes a multi-step task that consumes ten or twenty times more tokens. Without FinOps discipline applied to AI, spend grows faster than the value it produces.

What is the single biggest lever on GenAI cost?

Model selection. Token cost can vary by 10x to 100x between frontier models and small efficient models for the same task. The discipline of routing each workload to the smallest model that meets the accuracy bar — rather than defaulting every call to the most capable model — saves more than every other optimisation combined.

Does prompt caching actually help?

Yes, materially, when you have repetitive prompt prefixes. Anthropic's prompt caching, OpenAI's automatic caching, and Google's Vertex AI caching can cut input-token costs significantly for identical prompt prefixes — useful for system prompts, document context, and few-shot examples. Caching delivers most when prompts are templated and reused; it delivers little for one-off queries.

How do you cap agentic AI spend?

Hard caps per run on tokens, dollars, and wall-clock time, enforced at the orchestration layer. Per-tenant budget caps in shared services. Pre-flight estimation that warns when a planned run will exceed budget. Without these, a single bad reasoning loop can consume an order of magnitude more than expected.

What is FinOps for AI?

FinOps for AI is the operating discipline of bringing financial accountability to AI spend — visibility per team and per workload, cost allocation, unit-economics tracking (cost per task, per user, per ticket resolved), and continuous optimisation. It treats AI like any other variable cloud cost and applies the same governance enterprise teams already apply to compute and storage.

How does BYOM reduce GenAI costs?

BYOM lets you route each workload to the cheapest model that hits the accuracy bar — frontier models for hard reasoning, mid-tier for routine work, small open-weight models for high-volume or sensitive workloads. Without BYOM, every workload runs on whichever model is wired into the application, which is rarely the optimal cost point.

How do small language models reduce cost?

Small language models (SLMs) often deliver enterprise-acceptable accuracy on focused tasks at a fraction of the inference cost of frontier models — and they can run on cheaper or on-premise hardware. Common patterns: triage with an SLM, escalate to a frontier model only when the SLM is uncertain. The result is dramatic reduction in average cost per task.

What about long-context costs?

Long context is expensive — input tokens scale linearly. Two effective patterns: prune the context to what is genuinely needed (summarise, retrieve relevant chunks, drop history that is not load-bearing); and use retrieval augmentation to pull in just the right context per query rather than stuffing the whole knowledge base into every prompt.

GenAI Cost Optimization: Enterprise Playbook (2026)

Enterprise GenAI spend is unpredictable for structural reasons. Multiple teams build in parallel. Applications invoke multiple models per workflow. Agentic loops chain LLM calls until a task is done — what looks like a one-cent prompt becomes a multi-step run that consumes ten or twenty times more. Industry analysts have repeatedly flagged that a meaningful share of enterprise GenAI projects overrun their budgets due to architectural choices and operational immaturity rather than malice or inflation.

This guide is the practical playbook for bringing that under control. The five levers that actually move cost, the architecture choices that make optimisation possible (or impossible) later, the FinOps discipline that turns AI spend from variable mystery into managed line item, and the patterns we see working in Indian enterprise deployments in 2026.

Lever 1 — Model Routing (BYOM)

Model selection is the single biggest lever on cost. Token prices can vary by an order of magnitude or more between frontier models and efficient mid-tier or small models for the same task. A workload that runs every call against a frontier model when a mid-tier model would suffice is paying a premium for capability it does not use.

The discipline: route each workload to the smallest model that meets its accuracy bar. Triage queries with an SLM. Reserve frontier models for genuine reasoning load. Build a routing layer that can switch models without touching application code. This is what BYOM (Bring Your Own Model) exists to enable. Enterprises that lock to one vendor's API surface foreclose this lever entirely.

Lever 2 — Context and Prompt Discipline

Input tokens scale linearly with context length. Two patterns capture most of the savings:

Retrieve, don't stuff. Pull in just the relevant context per query (RAG, summarised history, scoped documents) rather than concatenating the whole knowledge base into every prompt. See RAG vs Fine-Tuning.
Prune aggressively. Conversation history, system instructions, and few-shot examples accumulate. Trim what is not load-bearing. Summarise long histories before passing them on.

For agentic systems, the same discipline applies between steps — pass the next agent only what it needs, not the full transcript.

Lever 3 — Caching

All three major model providers (Anthropic, OpenAI, Google) offer prompt caching that materially reduces the cost of repeated prompt prefixes — system prompts, document context, few-shot examples that recur across requests. Anthropic's prompt caching is opt-in with cache writes priced higher and cache reads priced significantly lower; OpenAI introduced automatic caching for repeated prefixes; Google's Vertex AI offers context caching with similar economics.

Where it helps: templated prompts, document-grounded chat, RAG with stable system prompts. Where it doesn't: one-off queries, prompts that change every call. Worth measuring on your actual traffic before assuming a percentage saving.

Lever 4 — Agent Budget Caps

Agentic loops are where unpredictable spend usually lives. A multi-step run can chain 10–50 LLM calls before terminating, and a bad reasoning step can chain many more. Without hard caps, a runaway loop can cost more than a month of single-prompt traffic in an afternoon.

Three controls are mandatory:

Per-run hard caps on tokens, dollars, and wall-clock time, enforced at the orchestration layer
Per-tenant budget caps in shared services, with explicit overage paths
Pre-flight estimation that warns or refuses when a planned run looks likely to exceed budget

For more on agent-loop discipline, see our multi-agent orchestration guide.

Lever 5 — SLMs and Mixed Inference

Small language models (SLMs) are now competitive with frontier models on focused enterprise tasks at a fraction of the inference cost. Patterns that work in production:

Triage with an SLM, escalate to a frontier model only on uncertainty. The SLM handles the easy 80% cheaply; the frontier model handles the hard 20%.
Run sensitive workloads on private SLMs. On-prem or VPC deployment of an SLM removes per-token pricing for internal high-volume use cases.
Use SLMs for embeddings and classifiers. Embedding generation, content classification, and routing decisions rarely need frontier capability.

FinOps for AI — Operating Model

Optimisation discipline only sticks when there is an operating model behind it. The pattern that works:

Visibility per team and per workload. Token spend, dollar spend, latency — broken down by application, agent, model, and tenant.
Unit economics. Cost per task, per user, per ticket resolved, per claim processed. Track the unit cost as a first-class metric, not just total spend.
Cost allocation. Spend rolls up to business units, not the central AI team. The teams that consume the budget own the optimisation.
Continuous optimisation cadence. Quarterly review of routing, caching, and prompt efficiency for top-spend workloads.
Anomaly alerting. Sudden spikes in tokens or dollars per workload trigger investigation, not after-the-fact post-mortems.

Architecture Choices That Make Optimisation Possible

Three architecture decisions that dictate whether the levers above are even available to you later:

Abstraction over the model layer. Application code calls a routing layer, not a vendor SDK. Without this, switching models is a refactor.
Centralised observability. Every model call traced through one observability pipeline. Without this, you cannot see where cost actually lives.
Per-tenant and per-workload tagging. Every request carries metadata (team, application, agent, user) that the cost reports break down by. Without tags, you have a total bill and no path to action.

These decisions are easy on day one and expensive on day three hundred. Make them up front.

What to Measure First

If you are starting today, instrument these in the next 30 days:

Total monthly token spend, broken down by model and by application
Per-application cost trend (week over week, month over month)
Per-task or per-user unit cost for the top three GenAI workloads
Cache hit rate where caching is enabled
Agent run distribution: median run cost, P95 run cost, P99 run cost — the tail is where the surprises live

With these in hand, the optimisation prioritisation becomes obvious: pick the two highest-spend workloads, apply model routing and context pruning, measure delta, repeat. Most enterprises see 30–60% reductions in inference cost in the first focused quarter without any loss of accuracy.

Cost Optimisation and Quality Are Not Opposites

The myth worth retiring: that cheaper means worse. In production, the opposite is more often true. The right-sized model for a task often performs better than an oversized frontier model that confuses the question with its own training tendencies. Discipline on context length usually improves accuracy by removing noise. Routing through evaluations means each workload runs on the model that scored best for it — not the one that was most fashionable when the application was built.

Cost optimisation is GenAI's version of code quality: a continuous habit, not a one-time project. Bake it into the engineering culture and you end up with both lower bills and better systems.

Audit your GenAI spend with humaineeti

GenAI Cost Optimization.