Optimization16 min read

Optimizing LLM Infrastructure Spend at Production Scale

Model selection, context management, caching, and request shaping strategies that reduce inference spend without degrading production quality.

Overview

LLM cost optimization at production scale is a structured engineering discipline, not a one-time configuration change. The highest-leverage optimizations require understanding which cost driver is dominant for a given workload, applying targeted reduction strategies, and measuring quality impact before accepting the optimization.

This guide covers the four primary optimization levers — model selection, context management, caching, and request shaping — with before/after cost scenarios for common enterprise workload patterns.

Tip

Measure before optimizing. Pull the actual cost breakdown per feature — input %, output %, model — before deciding which lever to pull. Optimizing the wrong cost driver is a common waste of engineering time.

When to use this guide

Monthly AI spend has grown faster than request volume (cost per request is increasing)
A feature's cost per workflow is above acceptable margin thresholds
Engineering is being asked to reduce AI infrastructure spend without reducing feature scope
A workload is scaling and projected costs require optimization before the growth hits
A new feature is being designed and cost targets have been set in advance

Key concepts

Routing policy: A decision rule that selects which model handles a given request based on task complexity, latency requirements, or cost constraints. Effective routing policies reduce expensive model usage to cases where the capability is genuinely required.
Context compression: The process of reducing input token count without losing information required for the completion. Techniques include prompt trimming, chunk summarization, and dynamic context selection in RAG pipelines.
Prompt caching: Provider-side storage of frequently-repeated prompt prefixes that reduces cache-read tokens to a fraction of full input token cost. Effective for static system prompts, tool definitions, and reused document context.
Output shaping: Constraints applied to generation that reduce output token count — explicit max_tokens limits, structured output schemas, and task decomposition. Output shaping is the highest-leverage lever for models with a 4–6× output premium.
Quality regression testing: Evaluation of model outputs against a set of labeled test cases before and after applying an optimization. Required before accepting any optimization that changes model or prompt — cost savings that come at the cost of undetected quality degradation are not real savings.

Lever 1: Model selection

Model selection is the single highest-leverage cost decision. A task routed to claude-opus-4 instead of claude-3-5-haiku costs 19× more at equivalent token counts. Most production workloads contain a mix of tasks where frontier model capability is required and tasks where a capable economy model is sufficient.

Task type	Frontier model needed?	Economy alternative	Typical savings
Multi-step reasoning over ambiguous data	Yes	None — quality risk is high	N/A
Document classification (3–10 classes)	No	gpt-4o-mini, claude-3-5-haiku, gemini-2.0-flash	85–95%
Structured data extraction (fixed schema)	No	gpt-4.1-mini, claude-3-5-haiku	80–90%
Summary of structured documents	Rarely	gpt-4.1-mini, gemini-2.0-flash	75–85%
Code review comments	Situational	gpt-4.1 (not o1)	50–80% vs o1
Conversational RAG (factual)	Rarely	gpt-4o-mini with strong retrieval	70–80%

1
Audit current model usage by feature
- Pull cost by model and feature for the previous 30 days
- Identify features where frontier models are used for tasks that do not require frontier capability
2
Define quality acceptance criteria before testing
- What output quality is acceptable for this task? Define a measurable criterion (e.g., classification F1 > 0.92, extraction accuracy > 95%)
- Without a pre-defined threshold, optimization evaluation becomes subjective
3
Run the economy model on a representative sample
- Test the economy model on 100–500 real production examples with ground truth
- Measure quality against the acceptance criteria before making any routing change
4
Deploy routing change with a feature flag
- Route a percentage of traffic to the economy model while monitoring quality metrics and cost
- Ramp from 10% → 50% → 100% with review at each stage

Warning

Never migrate a production feature to a cheaper model without a quality evaluation. Cost reductions that introduce silent quality regressions are liabilities, not savings.

Lever 2: Context window management

Input token count grows with context window usage: system prompt length, conversation history accumulation, retrieved document size, and tool definition count. Each of these is a controllable variable. Unbounded context accumulation is the most common source of slow-burn cost increases in production.

System prompt audit: review system prompt length and identify redundant instructions; prompts that grew through iteration often contain contradictions and obsolete instructions that add tokens without adding value
Conversation history truncation: set a maximum turn count for conversations; summarize older turns into a compact context block rather than preserving verbatim history beyond 4–6 turns
RAG chunk sizing: calibrate retrieved chunk count to the minimum required for answer quality; passing top-20 chunks when top-5 is sufficient wastes 4× the input tokens on context the model does not use
Tool definition trimming: remove tool definitions from requests that do not use those tools; dynamic tool selection reduces per-request input tokens when not all tools are relevant to every request
Dynamic context selection: score retrieved chunks by relevance before including them; low-relevance chunks add input cost without improving answer quality

Before/after example — RAG pipeline: Original configuration sends 8,000-token retrieved context (top-20 chunks, 400 tokens each) per query. After switching to top-5 precision-optimized retrieval: 2,000-token context. At gpt-4.1 input rates and 100,000 monthly queries: (8000-2000)/1000 × $0.002 × 100,000 = $1,200/month saved on input tokens alone, before output token impact.

Lever 3: Prompt caching

Prompt caching eliminates redundant input token charges for repeated prefix content. The optimization is most valuable when a large static block — system prompt, tool definitions, or document context — appears in every request for a session or across multiple sessions.

1
Identify cacheable content
- Static system prompts that repeat verbatim across requests are the primary candidate
- Tool definitions for a stable tool set (not dynamically selected) are cacheable if they exceed the minimum prefix length
- Document context in RAG can be cached per session if the same document is referenced multiple times
2
Restructure prompts to place static content first
- Cached prefixes must appear at the start of the prompt — dynamic content (user message, current date, personalization) must be placed after the static prefix
- Inserting dynamic content within the system prompt breaks caching for everything after it
3
Validate cache hit rate from response metadata
- Check usage.prompt_tokens_details.cached_tokens in responses (OpenAI) or cache_read_input_tokens (Anthropic)
- Cache miss rates above 30% for expected cache-eligible content indicate a configuration issue — prefix instability or content below minimum length

Before/after example — customer support system: System prompt is 2,400 tokens and appears on every call. At $0.002/1k input rate, 500,000 monthly calls: original cost for system prompt tokens = 2400/1000 × $0.002 × 500,000 = $2,400/month. With caching at $0.0005/1k: 2400/1000 × $0.0005 × 500,000 = $600/month. Savings: $1,800/month from one configuration change.

Lever 4: Output shaping

Output tokens are priced at a 4–8× premium over input tokens depending on the model. Reducing output token count without losing required information content is the highest per-token cost lever available.

Set explicit max_tokens limits on every production inference call — open-ended generation is appropriate for prototypes, not production workloads
Use structured output schemas (JSON with defined fields) to prevent verbose prose responses for extraction and classification tasks
Decompose multi-part tasks into sequential focused calls rather than one large call that generates a long response
Use system prompt instructions that enforce terse output: 'respond in 2–3 sentences', 'return only the requested fields', 'do not include explanations unless asked'
For summarization tasks, specify output length explicitly in tokens or words rather than asking the model to determine appropriate length

Before/after example — analysis report generation: Original prompt asks for 'a comprehensive analysis' with no length constraint; average output is 1,800 tokens. After adding max_tokens=600 and a structured template, output averages 520 tokens. At claude-3-7-sonnet output rates ($0.015/1k) and 20,000 monthly calls: (1800-520)/1000 × $0.015 × 20,000 = $384/month saved on output tokens.

Prioritizing optimization work

With four optimization levers available, engineering effort should be prioritized by cost impact per hour of work. For most workloads, the sequence is: output shaping (high impact, low effort), prompt caching for large static prompts (high impact, low effort), model routing audit (highest impact, moderate effort), context compression (moderate impact, higher effort for RAG pipelines).

Lever	Typical impact	Engineering effort	Quality risk
Output shaping (max_tokens, schema)	20–60% output cost reduction	Low — prompt change only	Low with testing
Prompt caching (large static prefix)	50–80% input savings on cached tokens	Low — structural prompt change	Minimal
Model routing (economy model for simple tasks)	70–95% cost reduction on routed tasks	Medium — requires eval harness	Medium — requires quality validation
Context compression (RAG chunk reduction)	30–70% input cost reduction	Medium — retrieval changes	Medium — retrieval quality must be validated
Conversation history trimming	15–40% input cost reduction	Low to medium	Low — diminishing return beyond 6 turns

Common pitfalls

Optimizing before measuring — applying cost reduction to the wrong workload or the wrong cost driver
Accepting cost savings without quality validation — classification accuracy, extraction correctness, and output usefulness must be verified against labeled data
Caching dynamic content — embedding user names, timestamps, or session data within the cached prefix invalidates the cache
Over-restricting max_tokens — setting max_tokens too low truncates valid responses and creates silent quality failures
Routing all traffic to an economy model instead of a percentage — ramp routing changes gradually and monitor quality metrics at each stage
Removing context without verifying it is unused — some retrieved chunks or system instructions that appear redundant are actually load-bearing for edge cases

Recommended approach

1
Always start with cost attribution data
- You need to know which feature is most expensive before deciding where to optimize
2
Apply output shaping and caching first
- These changes have the lowest quality risk and the highest return on engineering time for most workloads
3
Use a feature flag for every model routing change
- Ramp from 5% to 100%; measure quality metrics at each step; never flip the entire workload at once
4
Define and measure quality before optimizing
- Without a quality baseline, you cannot know whether an optimization has introduced a regression
5
Validate savings in the cost dashboard post-deployment
- Confirm that the expected cost reduction appears in actual per-feature spend; verify it is not offset by increased retry rates

CostLynx alignment

Use the Costs dashboard to identify cost-per-request trends by feature and model, which surfaces the highest-impact optimization candidates. The Spend by Model view identifies which models are receiving traffic that may be better served by an economy model. Alert rules on cost-per-day by feature detect cost regressions after optimization changes are deployed, confirming they are having the expected effect.

← Back to all guides