How to cut LLM costs 30–60% without losing quality
Most teams running AI in production are paying three to ten times what they should. Not because they made bad choices, but because the bill grew faster than anyone’s ability to see inside it. Spend becomes a single line item on a cloud invoice, and every proposed cut is a guess.
This is the exact playbook we run in a four-week cost engagement. It is vendor-agnostic — it works whether you’re on OpenAI, Anthropic, open-weight models, or a mix — and it consistently finds 30–60% with no measurable quality loss.
Step 0: You can’t cut what you can’t see
Before touching a single prompt, instrument every LLM call. For each request you want four numbers: input tokens, output tokens, latency, and cost, tagged by route (which feature/endpoint made the call) and model.
This is the highest-leverage hour in the whole engagement. The moment you can group spend by route, the waste announces itself: one background job quietly burning 40% of the budget, a “summarize” endpoint sending the entire document every time, a retry loop double-paying on timeouts.
If you take one thing from this post: spend is a distribution, not a number. The average call is fine. The p99 call is where your money goes.
Step 1: Right-size the model per task
The single biggest lever is using an expensive frontier model for work a cheaper one does just as well. But “just use the cheap model” is how you tank quality and get the project killed. The discipline is per-route routing with quality guards:
- Classify each route by how much reasoning it actually needs.
- Route the easy ones (classification, extraction, formatting, routing itself) to a smaller/cheaper model.
- Keep the frontier model on the routes where quality is load-bearing.
- Gate every downgrade behind an eval (see step 4) so you can prove quality held.
A realistic split: 60–70% of call volume is low-stakes and can move down a tier, while the 30% that matters stays on the best model. That alone is often a 40% cost cut.
Step 2: Cache the repeats
Production traffic is far more repetitive than it feels. Two caches pay off:
- Exact / normalized prompt cache — identical (or trivially-normalized) requests return a stored response. Huge for system-prompt-heavy workloads and anything with a hot set of common inputs.
- Provider prompt caching — most major APIs now bill cached prompt prefixes at a steep discount. If you have a large static system prompt, structuring it so the provider caches the prefix can cut input cost dramatically for free.
Step 3: Batch and fall back
- Batching: for anything not user-facing-latency-critical, batch APIs run at a large discount. Nightly enrichment, evals, backfills — move them to batch.
- Fallback chains: a single provider hiccup shouldn’t mean a failed request or an expensive panic-retry on your priciest model. Define an explicit fallback order with timeouts so you degrade gracefully and cheaply.
This is exactly the logic we encoded into our open-source router, fast-litellm — cost-aware routing, caching, batching, and fallback as configuration rather than scattered application code.
Step 4: Lock the gains behind evals
Every change above is a quality risk until proven otherwise. Before you ship a downgrade or a cache, you need a small eval set per route — a few dozen representative inputs with a way to score outputs (exact match, an LLM judge, or a rubric). Then each optimization becomes a measurable trade, not a leap of faith.
Evals also stop the savings from rotting. Without them, the next prompt tweak silently re-inflates cost or quietly drops quality, and six months later you’re back where you started.
Step 5: Make the savings visible — and keep them
The work isn’t done when the number drops; it’s done when the number stays down without you. That means a live dashboard of cost and latency by route, plus alerts when a route’s spend or token-per-request jumps. Savings that nobody can see get re-spent within a quarter.
What this looks like end to end
In a typical four-week engagement: week one is instrumentation and the audit, week two is routing and caching behind evals, week three is batching, fallbacks, and dashboards, week four is verification and handoff. The deliverable is a documented before/after, code you own, and a runbook so your team keeps tuning.
If your LLM bill is climbing and nobody can say why, that’s the gap we close. The first conversation is free, and we’ll tell you straight whether there’s real money to find.