Understanding Token Costs in Production LLM Systems
How input, output, cached, and reasoning tokens accumulate in production, and how to calculate per-request cost accurately across providers.
Overview
Token billing is the primary cost driver for every production LLM workload. The mechanics are straightforward at small scale but become complex under production conditions: multiple token types, varying per-provider pricing semantics, model-specific reasoning overhead, and prompt caching eligibility all interact to produce costs that are difficult to forecast from first principles.
This guide explains each token category in production context, shows how to compute accurate per-request cost, and identifies which token types drive the most variance in enterprise workloads.
Note
All cost calculations should use token counts from provider response payloads — not character estimates, not model-specific heuristics. Provider counts are the billing source of truth.
When to use this guide
- Estimating inference cost for a new feature before it ships to production
- Investigating why monthly AI spend increased despite stable request volume
- Setting realistic token budgets for a project or team
- Onboarding engineers who are instrumenting LLM calls for the first time
- Validating that cost estimates in planning documents reflect how providers actually bill
Key concepts
- Input tokens
- All tokens submitted before generation: system prompt, conversation history, retrieved context, tool definitions, and tool outputs. Input tokens are the highest-variance cost driver because context window usage grows with every turn in a conversation and every chunk added by a retrieval system.
- Output tokens
- Tokens generated during the model response. Output tokens are priced 4–6× higher per token than input for most frontier models. They vary with requested verbosity, chain-of-thought instructions, structured output complexity, and the absence of explicit max_tokens limits.
- Cached tokens
- Prompt-cache reads that reuse a previously stored prefix. Providers charge a reduced rate for cache hits — typically 50–90% less than standard input pricing — but caching applies only to eligible prefixes above a minimum length. Cache misses are billed at full input rate.
- Reasoning tokens
- Internal chain-of-thought tokens consumed by reasoning models (OpenAI o1, o3; Anthropic extended thinking). These tokens are not visible in the completion but appear in usage metadata. They can exceed the visible output token count and are billed at output token rates.
- Tool tokens
- Tokens consumed by function/tool definitions injected into the system prompt and by structured tool outputs returned from tool calls. Each tool definition adds to input tokens on every request that includes it.
Cost per request: calculation
Per-request cost is the sum of three token charges. Using provider-published per-1k-token rates:
cost_per_request =
(input_tokens / 1000) × input_rate_per_1k
+ (output_tokens / 1000) × output_rate_per_1k
+ (cached_tokens / 1000) × cached_rate_per_1k
// Example: gpt-4.1 ($0.002 input / $0.008 output / $0.0005 cached per 1k)
// 1,500 input tokens, 600 output tokens, 800 cached tokens:
cost = (1500/1000 × 0.002) + (600/1000 × 0.008) + (800/1000 × 0.0005)
= $0.003 + $0.0048 + $0.0004
= $0.0082 per requestWarning
For reasoning models, add reasoning_tokens × output_rate to the calculation. Reasoning token counts appear in usage.completion_tokens_details.reasoning_tokens in OpenAI responses.
At scale, even small per-request cost differences compound significantly. A workload running 100,000 requests per day where output tokens grow from 400 to 600 — a 200-token increase — adds $160/day at gpt-4.1 output rates, or approximately $4,800/month.
Input vs output cost dynamics
Output tokens dominate per-request cost in most production workloads despite being fewer in count. A typical enterprise workload with 2,000 input tokens and 500 output tokens on gpt-4.1 splits cost as follows: input contributes $0.004, output contributes $0.004 — already equal, even at a 4:1 token ratio. Workloads with chain-of-thought completions, verbose structured outputs, or no max_tokens constraint shift this split heavily toward output.
| Model | Input $/1M | Output $/1M | Output premium | Cached read $/1M |
|---|---|---|---|---|
| gpt-4.1 | $2.00 | $8.00 | 4× | $0.50 |
| gpt-4o-mini | $0.15 | $0.60 | 4× | $0.075 |
| claude-opus-4 | $15.00 | $75.00 | 5× | $1.50 |
| claude-3-5-haiku | $0.80 | $4.00 | 5× | $0.08 |
| gemini-2.5-pro | $1.25 | $10.00 | 8× | provider-dependent |
| gemini-2.0-flash | $0.10 | $0.40 | 4× | provider-dependent |
The output premium ratio means that output token reduction has the highest cost impact per token saved. Setting explicit max_tokens limits, using structured output schemas that constrain length, and breaking multi-part responses into separate focused calls all reduce output token spend without changing model quality.
Prompt caching: when it applies
Prompt caching reduces input token cost for repeated prefix content. It applies when the prefix — system prompt, tool definitions, static instructions, or document context — exceeds the provider's minimum cacheable length and has not expired.
Cache eligibility is provider-specific. OpenAI automatically caches prefixes of at least 1,024 tokens with a 5–10 minute TTL. Anthropic requires explicit cache_control markers and supports longer TTLs. Neither provider guarantees a cache hit; billing reflects actual hit/miss status in response metadata.
- Static system prompts exceeding 1,024 tokens are primary caching candidates — they repeat verbatim on every request
- Tool definitions for large tool sets add significant per-request input overhead; caching eliminates this if the tool set is stable
- Document context in RAG pipelines can be cached per document if the retrieved chunk set is reused across a session
- Cache hit rate degrades when system prompt content changes frequently — dynamic fields in prompts (user names, current timestamps) must be placed after the cacheable prefix, not within it
- Measure actual cache hit rate from response metadata before assuming caching is reducing costs
Reasoning tokens: the invisible cost layer
Reasoning models generate internal chain-of-thought tokens before producing the visible response. These tokens are billed at output rates but are not visible in the completion text. For OpenAI o1 and o3 models, reasoning tokens often exceed the visible output token count by 2–5× for complex tasks.
For a request to o1 with 2,000 input tokens, 300 visible output tokens, and 1,500 reasoning tokens: total cost is (2000/1000 × $0.015) + (300/1000 × $0.06) + (1500/1000 × $0.06) = $0.03 + $0.018 + $0.09 = $0.138. A naive estimate ignoring reasoning tokens would calculate $0.048 — a 65% underestimate.
Warning
Do not use reasoning models for tasks that do not benefit from multi-step reasoning. Classification, summarization, and extraction workloads rarely justify reasoning model costs. Route to reasoning models only for tasks where the reasoning process demonstrably improves output quality.
Cost flow: from call site to invoice
- 1
Application sends inference request
- The LLM client assembles prompt (system + conversation + context) and sends to provider API
- Token count at this stage is not known precisely — only estimated if counted client-side
- 2
Provider tokenizes and executes
- Provider tokenizer determines exact input token count (includes system, user, tool definitions, tool outputs)
- Model generates completion tokens; reasoning models consume reasoning tokens before the visible response
- 3
Provider returns usage in response
- Response payload includes usage.prompt_tokens, usage.completion_tokens, and for cached calls, usage.prompt_tokens_details.cached_tokens
- These are the authoritative numbers for cost calculation — not estimates
- 4
Application emits usage event
- Extract inputTokens, outputTokens, and cachedTokens from response usage object
- Emit to cost ingestion pipeline with provider, model, requestId, and attribution labels
- If estimatedCostUsd is available from the provider response (some SDKs surface this), include it to bypass catalog lookup
- 5
Ingestion pipeline stores and attributes cost
- Catalog lookup assigns per-token rates if estimatedCostUsd is not supplied
- Event stored under org, project, environment, and feature scope for dashboard attribution
Practical examples
Customer support ticket classifier — gpt-4o-mini, system prompt 800 tokens, user message 300 tokens, classification output 50 tokens, no caching: cost = (1100/1000 × $0.00015) + (50/1000 × $0.0006) = $0.000165 + $0.00003 = $0.000195 per request. At 200,000 tickets/month: $39/month.
Document analysis — claude-opus-4, 12,000-token document + 500-token instruction, 800-token analysis output: cost = (12500/1000 × $0.015) + (800/1000 × $0.075) = $0.1875 + $0.06 = $0.2475 per document. At 5,000 documents/month: $1,237.50/month. Adding prompt caching for the 12,000-token document at $0.0015/1k reduces per-document cost to: (500/1000 × $0.015) + (12000/1000 × $0.0015) + (800/1000 × $0.075) = $0.0075 + $0.018 + $0.06 = $0.0855 — a 65% reduction.
RAG pipeline — gemini-2.5-pro, 500-token query + 3,000-token retrieved context, 400-token answer: cost = (3500/1000 × $0.00125) + (400/1000 × $0.01) = $0.004375 + $0.004 = $0.008375 per query. At 50,000 queries/month: $418.75/month.
Common pitfalls
- Estimating token counts from character length — tokenization varies by model and language; character-to-token ratios are not stable
- Ignoring tool definition tokens — a set of 20 function definitions can add 800–2,000 input tokens on every call that passes them
- Assuming caching reduces cost without verifying hit rate — cache misses are billed at full input rates
- Using output token averages without tracking the distribution — p95 and p99 output token counts are often 3–5× the median and drive disproportionate cost
- Conflating completion tokens with reasoning tokens for o1/o3 models — visible output is only part of what is billed
- Treating cost estimates as fixed — provider pricing changes; catalog lookups should use current rates, not hardcoded constants
Recommended approach
- 1
Always extract token counts from provider response objects
- Use usage.prompt_tokens and usage.completion_tokens from the response, not pre-call estimates
- 2
Track output token distribution, not just mean
- p50, p90, and p99 output token counts reveal whether long-tail completions are inflating cost
- 3
Identify your largest cost driver before optimizing
- For most workloads, output token reduction yields more savings per engineering hour than input compression
- 4
Measure cache hit rate if caching is enabled
- Use response metadata to verify that cache hit rates justify any prompt restructuring investment
- 5
Treat reasoning tokens as a separate budget category
- Reasoning models require separate cost modeling — do not apply non-reasoning model rates to reasoning workloads
CostLynx alignment
CostLynx ingests inputTokens, outputTokens, and cachedTokens per event via POST /api/v1/usage/ingest. If estimatedCostUsd is provided in the event payload, it is stored directly without a pricing catalog lookup — useful for reasoning models where provider-supplied cost figures are more accurate than catalog estimates. The Costs dashboard breaks down spend by input, output, and model dimension over time.