← Resources

Glossary

Enterprise AI FinOps definitions for production cost governance, ingestion, and multi-provider operations.

On this page

Core LLM and Pricing Fundamentals

Foundational concepts used in production AI cost accounting and model operations.

Input tokens (prompt tokens)

#input-tokens

Input tokens (prompt tokens) include all tokens sent before generation: system instructions, user content, tool output, and retrieved context. In enterprise AI workloads, prompt growth is usually the largest driver of inference spend variance.

Why it matters: Controlling input token volume is one of the highest-leverage cost controls in production systems.

See also: Context window, Cost per request, Cost optimization

Output tokens (completion tokens)

#output-tokens

Output tokens (completion tokens) are tokens generated by the model during response decoding. They are commonly priced higher per token than input tokens and can vary significantly by prompt design and output constraints.

Why it matters: Unbounded output token growth can rapidly increase per-request cost and destabilize budget controls.

See also: Cost per request, Cost anomaly detection

Context window

#context-window

The context window is the maximum input token length a model can process in a single inference request. It constrains how much conversation history, retrieval context, and tool state can be included before truncation or summarization.

Why it matters: Context-window strategy directly impacts latency, response quality, and total token spend.

See also: Input tokens (prompt tokens), Throughput vs cost tradeoffs

Inference

#inference

Inference is the runtime execution path from prompt submission to model response. FinOps-grade inference telemetry should capture provider, model, input tokens, output tokens, latency, and request identifiers.

Why it matters: Inference is the billable operation, so operational cost control depends on request-level instrumentation.

See also: Request-level tracking, Observability (AI context), Rate limiting

LLM

#llm

An LLM is a language model used for generation, transformation, and reasoning tasks through provider APIs or managed endpoints. Enterprise platforms typically run multiple LLMs under routing policies rather than relying on a single model.

Why it matters: LLM portfolio design determines baseline cost structure, resilience options, and operational complexity.

See also: Multi-provider architecture, Vendor lock-in, Model pricing

Model pricing

#model-pricing

Model pricing is the provider tariff for inference, usually expressed per million input tokens and output tokens. Effective pricing depends on route mix, regional contract terms, and selected provider tiers.

Why it matters: Accurate model-pricing assumptions are required for realistic forecasting and policy guardrails.

See also: Pricing catalog normalization, Cost per request, Unit economics

AI FinOps and Unit Economics

Financial operating concepts that convert model usage into accountable business outcomes.

AI FinOps

#ai-finops

AI FinOps is the operating discipline for managing inference cost with shared ownership across engineering, platform, and finance. It links technical drivers (tokens, retries, model routing) to financial outcomes (variance, margin, and budget adherence).

Why it matters: Without AI FinOps, production scale amplifies cost leakage faster than teams can correct it.

See also: Cost governance, Cost visibility, Cost attribution

Cost attribution

#cost-attribution

Cost attribution maps spend to accountable dimensions such as team, feature, project, environment, and customer segment. In production systems, attribution requires labels captured at call time and preserved through usage ingestion.

Why it matters: Attribution is the prerequisite for chargeback, ownership, and targeted optimization.

See also: Feature-level attribution, Project/environment segmentation, Chargeback

Unit economics

#unit-economics

Unit economics measures cost relative to a business unit of value, such as a resolved ticket or completed workflow. For AI systems, this requires combining inference spend with product outcomes rather than relying only on provider invoice aggregates.

Why it matters: Unit economics determines whether AI features scale with positive margin or hidden loss.

See also: Cost per workflow, Cost per request, Budgeting

Cost per request

#cost-per-request

Cost per request is the observed or expected inference cost for one model invocation based on input tokens, output tokens, and model pricing. It is most informative when request templates and model routes are stable.

Why it matters: It is an early-warning metric for prompt regressions, routing drift, and retry amplification.

See also: Model pricing, Cost anomaly detection, Cost optimization

Cost per workflow

#cost-per-workflow

Cost per workflow is the total inference cost across all model calls and orchestration steps required to complete one end-to-end task. It captures multi-step system behavior that per-request metrics cannot represent alone.

Why it matters: Workflow-level cost is required for pricing strategy, roadmap prioritization, and margin governance.

See also: Unit economics, Feature-level attribution, Cost optimization

Budgeting

#budgeting

Budgeting defines spend targets and thresholds by accountable scope, including model, project, environment, and feature. Effective budget controls combine static limits with threshold-based escalation rules.

Why it matters: Budgeting turns cost management into a controllable operational process instead of month-end reconciliation.

See also: Cost governance, Cost visibility, Cost anomaly detection

Cost optimization

#cost-optimization

Cost optimization is the structured reduction of inference spend while preserving quality, latency, and reliability requirements. Typical levers include prompt compression, route-to-cheaper-model policies, output controls, and retry tuning.

Why it matters: Optimization increases throughput capacity under fixed spend and delays forced budget expansion.

See also: Cost per request, Throughput vs cost tradeoffs, Rate limiting

Platform Instrumentation and Allocation

Operational mechanics required for trustworthy multi-provider cost tracking.

Usage ingestion

#usage-ingestion

Usage ingestion is the controlled capture of model-usage events into a centralized cost pipeline, such as `/api/v1/usage/ingest`. Each event should include provider, model, input tokens, output tokens, requestId, and feature/project/environment labels.

Why it matters: Centralized ingestion enables consistent cross-service and cross-provider spend accounting.

See also: Idempotency, Request-level tracking, Pricing catalog normalization

Idempotency

#idempotency

Idempotency ensures repeated submissions of the same logical usage event are counted once. In production ingestion pipelines, a stable requestId is commonly used to deduplicate retries and delayed replays.

Why it matters: Without idempotency, duplicate events inflate spend metrics and break financial trust.

See also: Usage ingestion, Request-level tracking, Cost visibility

Request-level tracking

#request-tracking

Request-level tracking ties all inference telemetry to a unique request identifier propagated across services, queues, and workers. It creates end-to-end lineage between application behavior and provider billing records.

Why it matters: Request-level lineage is critical for fast root-cause analysis of cost and reliability incidents.

See also: Usage ingestion, Observability (AI context), Cost anomaly detection

Feature-level attribution

#feature-attribution

Feature-level attribution tags each usage event with the product feature that triggered the call. It separates feature-owned spend from shared platform spend within the same project.

Why it matters: Feature tagging creates clear ownership boundaries for cost controls and optimization work.

See also: Cost attribution, Cost per workflow, Showback

Project/environment segmentation

#project-environment-segmentation

Project/environment segmentation partitions usage into project and environment scopes, such as production, staging, and development. Segmentation allows different alert thresholds, budgets, and enforcement policies per scope.

Why it matters: Segmentation prevents non-production traffic from obscuring production cost behavior and risk.

See also: Feature-level attribution, Budgeting, Cost governance

Chargeback

#chargeback

Chargeback allocates inference cost to consuming teams' budgets using predefined rules and audit trails. Successful chargeback depends on consistently high attribution coverage and controlled exception handling.

Why it matters: Chargeback aligns technical consumption decisions with financial accountability at scale.

See also: Showback, Cost attribution, Project/environment segmentation

Showback

#showback

Showback provides transparent spend reporting to consuming teams without direct budget transfer. It is commonly used to establish ownership behavior and data quality before introducing chargeback.

Why it matters: Showback drives cost-aware engineering decisions with lower organizational friction.

See also: Chargeback, Cost visibility, Feature-level attribution

Pricing catalog normalization

#pricing-catalog-normalization

Pricing catalog normalization maps provider-specific model names and pricing entries into a canonical internal catalog. It ensures ingestion and reporting systems evaluate costs with the same model and provider taxonomy.

Why it matters: Normalized pricing is required for accurate multi-provider comparisons and policy enforcement.

See also: Model pricing, Usage ingestion, Multi-provider architecture

Enterprise Governance and Operations

Control-plane concepts for resilient and financially governed AI platforms.

Multi-provider architecture

#multi-provider-architecture

Multi-provider architecture routes inference across multiple providers to balance cost, latency, capability, and resilience. It introduces heterogeneous APIs, rate limits, and pricing semantics that require centralized normalization.

Why it matters: Without unified governance, multi-provider scale quickly fragments cost control and accountability.

See also: Pricing catalog normalization, Vendor lock-in, Cost governance

Vendor lock-in

#vendor-lock-in

Vendor lock-in is the technical and commercial friction of moving workloads away from a provider due to API coupling, contract terms, or model-specific dependencies. Lock-in risk increases when abstraction and routing layers are weak.

Why it matters: High lock-in reduces negotiation leverage and increases long-term cost and availability risk.

See also: Multi-provider architecture, Cost governance

Cost anomaly detection

#cost-anomaly-detection

Cost anomaly detection identifies abnormal spend or token behavior against historical baselines at scoped levels such as feature, model, project, or environment. Effective rules combine statistical sensitivity with minimum-spend floors to reduce noise.

Why it matters: Early anomaly detection limits financial blast radius from retries, abuse, and routing failures.

See also: Cost visibility, Observability (AI context), Budgeting

Cost governance

#cost-governance

Cost governance defines enforceable policies for who can spend, where, and under what conditions, including model allowlists, budget caps, and escalation paths. Governance policies should be configurable without changing application business logic.

Why it matters: Governance prevents uncontrolled inference growth from becoming an enterprise financial incident.

See also: Budgeting, Cost visibility, Rate limiting

Cost visibility

#cost-visibility

Cost visibility is timely, shared access to normalized spend metrics by provider, model, feature, project, and environment. It requires consistent definitions across engineering and finance reporting systems.

Why it matters: Shared visibility enables coordinated operational decisions and reduces cross-team cost disputes.

See also: Cost attribution, Showback, Observability (AI context)

Observability (AI context)

#ai-observability

AI observability correlates runtime metrics (latency, errors, retries, route decisions) with inference spend and token distributions at request granularity. It extends traditional observability to include model behavior as a cost driver.

Why it matters: Correlated observability shortens incident response when failures affect both reliability and spend.

See also: Request-level tracking, Cost anomaly detection, Inference

Rate limiting

#rate-limiting

Rate limiting enforces request, token, or concurrency ceilings at API and service boundaries by tenant, feature, or environment. It is usually paired with backoff and queue controls to stabilize traffic.

Why it matters: Rate limiting protects both system capacity and spend from traffic spikes and abuse.

See also: Throughput vs cost tradeoffs, Cost governance, Inference

Throughput vs cost tradeoffs

#throughput-vs-cost

Throughput vs cost tradeoffs describe the balancing of volume targets, latency requirements, and inference spend under constrained budgets. Decisions around batching, concurrency, model tiering, and routing directly change both capacity and unit cost.

Why it matters: Explicit tradeoff management is necessary to scale workloads without breaching financial guardrails.

See also: Rate limiting, Cost optimization, Context window