Foundations12 min read

Understanding Token Costs in Production LLM Systems

How input, output, cached, and reasoning tokens accumulate in production, and how to calculate per-request cost accurately across providers.

Overview

Token billing is the primary cost driver for every production LLM workload. The mechanics are straightforward at small scale but become complex under production conditions: multiple token types, varying per-provider pricing semantics, model-specific reasoning overhead, and prompt caching eligibility all interact to produce costs that are difficult to forecast from first principles.

This guide explains each token category in production context, shows how to compute accurate per-request cost, and identifies which token types drive the most variance in enterprise workloads.

Note

All cost calculations should use token counts from provider response payloads — not character estimates, not model-specific heuristics. Provider counts are the billing source of truth.

When to use this guide

Estimating inference cost for a new feature before it ships to production
Investigating why monthly AI spend increased despite stable request volume
Setting realistic token budgets for a project or team
Onboarding engineers who are instrumenting LLM calls for the first time
Validating that cost estimates in planning documents reflect how providers actually bill

Key concepts

Input tokens: All tokens submitted before generation: system prompt, conversation history, retrieved context, tool definitions, and tool outputs. Input tokens are the highest-variance cost driver because context window usage grows with every turn in a conversation and every chunk added by a retrieval system.
Output tokens: Tokens generated during the model response. Output tokens are priced 4–6× higher per token than input for most frontier models. They vary with requested verbosity, chain-of-thought instructions, structured output complexity, and the absence of explicit max_tokens limits.
Cached tokens: Prompt-cache reads that reuse a previously stored prefix. Providers charge a reduced rate for cache hits — typically 50–90% less than standard input pricing — but caching applies only to eligible prefixes above a minimum length. Cache misses are billed at full input rate.
Reasoning tokens: Internal chain-of-thought tokens consumed by reasoning models (OpenAI o1, o3; Anthropic extended thinking). These tokens are not visible in the completion but appear in usage metadata. They can exceed the visible output token count and are billed at output token rates.
Tool tokens: Tokens consumed by function/tool definitions injected into the system prompt and by structured tool outputs returned from tool calls. Each tool definition adds to input tokens on every request that includes it.

Cost per request: calculation

Per-request cost is the sum of three token charges. Using provider-published per-1k-token rates:

Cost formula

cost_per_request =
  (input_tokens  / 1000) × input_rate_per_1k
  + (output_tokens / 1000) × output_rate_per_1k
  + (cached_tokens / 1000) × cached_rate_per_1k

// Example: gpt-4.1 ($0.002 input / $0.008 output / $0.0005 cached per 1k)
// 1,500 input tokens, 600 output tokens, 800 cached tokens:
cost = (1500/1000 × 0.002) + (600/1000 × 0.008) + (800/1000 × 0.0005)
     = $0.003 + $0.0048 + $0.0004
     = $0.0082 per request

Warning

For reasoning models, add reasoning_tokens × output_rate to the calculation. Reasoning token counts appear in usage.completion_tokens_details.reasoning_tokens in OpenAI responses.

At scale, even small per-request cost differences compound significantly. A workload running 100,000 requests per day where output tokens grow from 400 to 600 — a 200-token increase — adds $160/day at gpt-4.1 output rates, or approximately $4,800/month.

Input vs output cost dynamics

Output tokens dominate per-request cost in most production workloads despite being fewer in count. A typical enterprise workload with 2,000 input tokens and 500 output tokens on gpt-4.1 splits cost as follows: input contributes $0.004, output contributes $0.004 — already equal, even at a 4:1 token ratio. Workloads with chain-of-thought completions, verbose structured outputs, or no max_tokens constraint shift this split heavily toward output.

Model	Input $/1M	Output $/1M	Output premium	Cached read $/1M
gpt-4.1	$2.00	$8.00	4×	$0.50
gpt-4o-mini	$0.15	$0.60	4×	$0.075
claude-opus-4	$15.00	$75.00	5×	$1.50
claude-3-5-haiku	$0.80	$4.00	5×	$0.08
gemini-2.5-pro	$1.25	$10.00	8×	provider-dependent
gemini-2.0-flash	$0.10	$0.40	4×	provider-dependent

The output premium ratio means that output token reduction has the highest cost impact per token saved. Setting explicit max_tokens limits, using structured output schemas that constrain length, and breaking multi-part responses into separate focused calls all reduce output token spend without changing model quality.

Prompt caching: when it applies

Prompt caching reduces input token cost for repeated prefix content. It applies when the prefix — system prompt, tool definitions, static instructions, or document context — exceeds the provider's minimum cacheable length and has not expired.

Cache eligibility is provider-specific. OpenAI automatically caches prefixes of at least 1,024 tokens with a 5–10 minute TTL. Anthropic requires explicit cache_control markers and supports longer TTLs. Neither provider guarantees a cache hit; billing reflects actual hit/miss status in response metadata.

Static system prompts exceeding 1,024 tokens are primary caching candidates — they repeat verbatim on every request
Tool definitions for large tool sets add significant per-request input overhead; caching eliminates this if the tool set is stable
Document context in RAG pipelines can be cached per document if the retrieved chunk set is reused across a session
Cache hit rate degrades when system prompt content changes frequently — dynamic fields in prompts (user names, current timestamps) must be placed after the cacheable prefix, not within it
Measure actual cache hit rate from response metadata before assuming caching is reducing costs

Reasoning tokens: the invisible cost layer

Reasoning models generate internal chain-of-thought tokens before producing the visible response. These tokens are billed at output rates but are not visible in the completion text. For OpenAI o1 and o3 models, reasoning tokens often exceed the visible output token count by 2–5× for complex tasks.

For a request to o1 with 2,000 input tokens, 300 visible output tokens, and 1,500 reasoning tokens: total cost is (2000/1000 × $0.015) + (300/1000 × $0.06) + (1500/1000 × $0.06) = $0.03 + $0.018 + $0.09 = $0.138. A naive estimate ignoring reasoning tokens would calculate $0.048 — a 65% underestimate.

Warning

Do not use reasoning models for tasks that do not benefit from multi-step reasoning. Classification, summarization, and extraction workloads rarely justify reasoning model costs. Route to reasoning models only for tasks where the reasoning process demonstrably improves output quality.

Cost flow: from call site to invoice

1
Application sends inference request
- The LLM client assembles prompt (system + conversation + context) and sends to provider API
- Token count at this stage is not known precisely — only estimated if counted client-side
2
Provider tokenizes and executes
- Provider tokenizer determines exact input token count (includes system, user, tool definitions, tool outputs)
- Model generates completion tokens; reasoning models consume reasoning tokens before the visible response
3
Provider returns usage in response
- Response payload includes usage.prompt_tokens, usage.completion_tokens, and for cached calls, usage.prompt_tokens_details.cached_tokens
- These are the authoritative numbers for cost calculation — not estimates
4
Application emits usage event
- Extract inputTokens, outputTokens, and cachedTokens from response usage object
- Emit to cost ingestion pipeline with provider, model, requestId, and attribution labels
- If estimatedCostUsd is available from the provider response (some SDKs surface this), include it to bypass catalog lookup
5
Ingestion pipeline stores and attributes cost
- Catalog lookup assigns per-token rates if estimatedCostUsd is not supplied
- Event stored under org, project, environment, and feature scope for dashboard attribution

Practical examples

Customer support ticket classifier — gpt-4o-mini, system prompt 800 tokens, user message 300 tokens, classification output 50 tokens, no caching: cost = (1100/1000 × $0.00015) + (50/1000 × $0.0006) = $0.000165 + $0.00003 = $0.000195 per request. At 200,000 tickets/month: $39/month.

Document analysis — claude-opus-4, 12,000-token document + 500-token instruction, 800-token analysis output: cost = (12500/1000 × $0.015) + (800/1000 × $0.075) = $0.1875 + $0.06 = $0.2475 per document. At 5,000 documents/month: $1,237.50/month. Adding prompt caching for the 12,000-token document at $0.0015/1k reduces per-document cost to: (500/1000 × $0.015) + (12000/1000 × $0.0015) + (800/1000 × $0.075) = $0.0075 + $0.018 + $0.06 = $0.0855 — a 65% reduction.

RAG pipeline — gemini-2.5-pro, 500-token query + 3,000-token retrieved context, 400-token answer: cost = (3500/1000 × $0.00125) + (400/1000 × $0.01) = $0.004375 + $0.004 = $0.008375 per query. At 50,000 queries/month: $418.75/month.

Common pitfalls

Estimating token counts from character length — tokenization varies by model and language; character-to-token ratios are not stable
Ignoring tool definition tokens — a set of 20 function definitions can add 800–2,000 input tokens on every call that passes them
Assuming caching reduces cost without verifying hit rate — cache misses are billed at full input rates
Using output token averages without tracking the distribution — p95 and p99 output token counts are often 3–5× the median and drive disproportionate cost
Conflating completion tokens with reasoning tokens for o1/o3 models — visible output is only part of what is billed
Treating cost estimates as fixed — provider pricing changes; catalog lookups should use current rates, not hardcoded constants

Recommended approach

1
Always extract token counts from provider response objects
- Use usage.prompt_tokens and usage.completion_tokens from the response, not pre-call estimates
2
Track output token distribution, not just mean
- p50, p90, and p99 output token counts reveal whether long-tail completions are inflating cost
3
Identify your largest cost driver before optimizing
- For most workloads, output token reduction yields more savings per engineering hour than input compression
4
Measure cache hit rate if caching is enabled
- Use response metadata to verify that cache hit rates justify any prompt restructuring investment
5
Treat reasoning tokens as a separate budget category
- Reasoning models require separate cost modeling — do not apply non-reasoning model rates to reasoning workloads

CostLynx alignment

CostLynx ingests inputTokens, outputTokens, and cachedTokens per event via POST /api/v1/usage/ingest. If estimatedCostUsd is provided in the event payload, it is stored directly without a pricing catalog lookup — useful for reasoning models where provider-supplied cost figures are more accurate than catalog estimates. The Costs dashboard breaks down spend by input, output, and model dimension over time.

← Back to all guides