AI Cost Observability: Measuring and Justifying Token Spend in 2026

We recently hosted a webinar on applying FinOps practices to AI token spend. Brooke, Vantage CTO and co-founder, covered the problem landscape, and Rem, Vantage Product Lead, walked through how we're solving it. The Q&A at the end drove some of the best discussion - it's clear that engineering and finance leaders are actively working through these challenges.

Here's a recap of what was covered.

The Landscape: Why AI Token Spend Needs FinOps Now

LLM token usage has grown rapidly across two fronts. On one side, developers are using tools like Cursor, Claude Code, and Codex as part of their daily workflows. On the other, production applications are increasingly powered by agentic AI - autonomous agents consuming tokens at scale with no human in the loop.

Both categories are usage-based and non-deterministic. Token costs vary based on model selection, context window depth, session length, and the behavior of the agent or developer driving the interaction. For engineering organizations that have spent years building predictability into their cloud spend, this is a familiar problem in an unfamiliar shape.

The webinar focused on four core areas: the non-deterministic nature of token costs, how existing FinOps practices apply to AI spend, the data gaps across providers, and what companies are doing today to measure and justify their AI investment. The sections below summarize each.

Token Costs Are Variable by Design

Brooke opened the webinar with a comparison that resonated: an EC2 instance running for 720 hours produces the same bill every month. A developer using Cursor or Claude Code for the same workload can generate a token bill that varies by an order of magnitude depending on model choice, context window depth, session length, and whether an agent goes down an unproductive path.

This variance shows up on both sides of the balance sheet. Developers using AI coding tools generate token costs on the R&D side. Deployed agents powering production features generate them on the COGS side. Neither is predictable, and the volume on both continues to grow.

That R&D distinction is worth paying attention to. Traditional cloud infrastructure costs map neatly to cost of goods sold - they power the product your customers use. But the bulk of new token spend is coming from engineering workflows - developer tooling, AI-assisted code generation, and model training for companies building their own. It's a line item that didn't exist two years ago, and CTOs are increasingly being asked to justify it.

The ask is straightforward: your engineering leader needs to go to the finance team or the board and say, "yes, this spend is valid and here's why," backed by data. That requires the same measurement infrastructure that FinOps teams have built for cloud spend.

The FinOps Playbook Maps Directly to AI Spend

Budgeting, anomaly detection, cost allocation, forecasting - all of it applies to token spend. The data sources are different, but the questions are the same ones you're already asking about your AWS bill: who is spending, on what, and is it worth it?

One notable difference is that AI spend is actually more tractable than cloud infrastructure in one respect: you can drill down to individual developers. Cloud cost allocation typically stops at the team, product, or business unit level. Token usage providers like Anthropic and Cursor expose spend per API key or per email address. That granularity is available from day one, before you've done any tagging or allocation work.

Engineering leaders are using specific developers as bellwethers. Pick the team's heaviest token consumer, study their workflow, understand whether their usage is efficient, and use that as a baseline for what "healthy" looks like at scale. When you have a thousand developers with a huge range of token usage, you need some reference points. The individual-level data gives you that.

We're also seeing companies spin up dedicated developer experience teams focused on tracking and optimizing AI tool usage across engineering. Sometimes that sits within the existing FinOps org, sometimes it's a new function entirely - but the toolset is the same.

Provider Data Is All Over the Map

If the FinOps practices map well, the data does not. Every AI provider structures their billing data differently, and building a unified view across them is a real challenge.

Anthropic and Cursor give you the most to work with out of the box - both break down costs to the API key or developer level, which is immediately useful for attribution and anomaly detection. OpenAI provides some breakdown in billing and more in supplemental APIs, but it takes more work to get developer-specific visibility. AWS Bedrock is the least granular. Because you're proxying through AWS to models from Anthropic, OpenAI, and others, the costs come through as marketplace purchases - you get model and account, but you lose API key and developer-level attribution entirely.

And none of these providers offer the kind of flexible tagging you'd put on an AWS resource. Team, environment, cost center - that metadata doesn't exist natively in AI billing data.

The workaround is enrichment from other sources. AI gateways like OpenRouter and Cloudflare's AI Gateway sit between your application and the model providers, capturing request-level telemetry that billing data lacks. Bedrock's model invocation logging does something similar - writing detailed request data to CloudWatch that you can project onto billing costs to fill the attribution gaps.

We've been building Vantage's LLM token allocation around this exact problem. Pull in telemetry from gateways and logging pipelines, enrich billing data with team and environment metadata, normalize it all into a single view across providers. The goal is parity regardless of whether you're buying tokens from Anthropic directly or routing through Bedrock.

Measuring ROI, Not Just Cost

You can track every dollar and still miss the point. The real question isn't what you're spending on tokens - it's what those tokens are producing.

Tokens per feature is starting to show up as a real metric in R&D planning. If a team burns $3,000 in tokens shipping a feature, that's a meaningful input to the budget conversation. Cost per experiment matters too - how much did a model training run cost, and how did the resulting model actually perform in production? That connects research spend to business outcomes in a way most teams haven't been tracking. And when you combine token costs with headcount for a cost-per-PR or cost-per-release view, the numbers start telling you something useful about what engineering output actually costs all-in.

Then there's model comparison, which is getting more complicated every quarter. Earlier this spring, both Anthropic and OpenAI released models with million-token context windows. Deeper context means better results for complex tasks, but the cost scales with depth - a developer working in a million-token context window is spending meaningfully more per session than one working in a 200K window. Most teams haven't fully accounted for how much that difference affects their bill.

During the webinar Q&A, someone asked directly: are teams actually switching models based on cost? Six months ago, the answer was no - you just used the best available model. Today, yes. The cost difference between Anthropic's Opus and GPT-4o, or between GPT-4o at medium versus high reasoning effort, is real enough that developers are making these tradeoffs actively. Some teams are picking specific models for specific workloads, which would have seemed excessive a year ago.

Spend Controls That Don't Kill Productivity

Most LLM providers and Cursor offer spend limits - per account and per individual user or API key. Every major platform has been forced to build these controls because it's so easy to consume large token volumes quickly.

The pattern we're seeing in practice is generous limits. Companies set caps high enough to prevent genuine runaways but low enough that a misconfigured agent can't generate a five-figure bill overnight. Then they adjust over time as they build a baseline for normal usage. The alternative - tight per-developer caps - sounds responsible but creates a different problem: developers hitting their limits daily and losing productive time waiting for resets.

On the Vantage side, we're layering cross-provider visibility with the standard FinOps toolkit. LLM token allocation to normalize and enrich provider data, virtual tagging to add the business context that billing data lacks. Budgets and anomaly detection do what they've always done - just applied to a new cost category. And for providers or gateways we haven't built a native integration for yet, you can upload data through the custom provider and FOCUS spec path and still get LLM enrichment on top of it.

What Companies Are Actually Doing

Someone in the webinar Q&A asked this directly, and Brooke's answer was candid: most companies are still in the "measure and understand" phase. The developers burning the most tokens might be the most effective on the team. High spend isn't inherently bad, and cutting it prematurely could kill the productivity gains that justified the tools in the first place.

But the urgency is shifting. The spend numbers are getting large enough that deferring the problem is no longer viable. CTOs are building measurement infrastructure now - connecting token costs to engineering output and establishing baselines - so they can make informed decisions instead of reactive ones.

The FinOps discipline already has the vocabulary and the organizational patterns for this. The AI providers are starting to surface the data. The gap between those two things is where most of the work is right now, and it's closing faster than the cloud cost management cycle ever did.