Real-time AI billing: latency, architecture, and trade-offs

In real-time AI billing, the platform authorizes a request against the customer's current balance; the usage is then recorded, priced, and debited in one atomic step the moment it happens. Invoice-based billing records usage now and settles it at the end of a cycle, days later. The gap between usage and the debit is where margin leaks under agentic load; even Google's mostly real-time Gemini API billing carries a roughly 10-minute window where prepaid spend can overshoot a cap. This article argues that billing-path latency is an architectural decision, not an implementation detail.

Why billing latency became a financial problem

Billing latency became a financial problem the moment inference cost started landing before the invoice did. For most of SaaS history the lag was harmless: a customer ran a report on Tuesday, saw the charge next month, and the vendor earned more than it spent regardless. AI broke that arrangement. Every inference call spends real compute the instant it runs, and the bill arrives whether or not the customer ever pays. Closing the gap between cost incurred and cost recorded is the entire job of real-time billing infrastructure, and it is why a question that used to be back-office is now an architecture decision.

The 2026 numbers make the point better than any argument. By April, Uber had burned through its entire 2026 AI coding budget after rolling Claude Code out to roughly 5,000 engineers; Microsoft revoked Claude Code licenses over consumption it called unsustainable; one engineer reportedly spent $40,000 on tokens in a single month (TechCrunch, June 2026). The same reporting cites the FinOps Foundation finding companies running 3x over their full-year 2026 token budgets in April and May, with per-developer token consumption up 18.6x in nine months. When spend moves that fast, a billing cycle measured in days is not a control surface. It is a delay between a decision and its bill.

What real-time AI billing actually means in a pipeline

Real-time billing means the platform authorizes a request against the customer's current balance, the action runs, and then the usage is recorded, priced, and debited in a single atomic step. Authorization comes first; billing follows the moment the usage happens, not at the end of a cycle. Invoice-based billing records usage now and turns it into money later. These are two different architectures, not two settings on the same one, and the difference is the whole story.

The invoice-based path has four stages separated by time. Events fire from the product, a metering layer captures and aggregates them across the period, and at cycle end a reconciliation job reads the meters and feeds totals into a billing engine. The engine produces an invoice, and payment follows days or weeks later.

Real-time monetization closes that gap. The real-time monetization model authorizes against the customer's current balance, lets the action run, then records, prices, and debits the usage in one atomic step as it arrives. It is the same authorize-then-record sequence we traced through the Salesforce m3ter acquisition. There is no reconciliation job because there is nothing to reconcile. The balance already moved.

The practical difference is latency from event to billed. Invoice-based systems measure that latency in days. Real-time systems measure it in milliseconds. For a flat SaaS product, days is fine. For a product whose cost lands the instant a request runs, days means exposure to business risk.

Where latency enters invoice-based pipelines

Latency in an invoice-based pipeline is structural, not a bug to be tuned away. The pipeline is designed to aggregate first and settle later, which means there is always a window between usage happening and the system being able to act on it.

Google states this plainly in its own documentation. The Gemini API billing docs note that prepay billing is subject to overages during an approximately 10-minute billing-pipeline latency. Long-running tasks like batch jobs and agent sessions can incur cost beyond a project's spend cap. That is a vendor describing the limit of its own enforcement. Even a prepaid model, which is about as close to real-time as an asynchronous billing layer gets, leaves a window where usage outruns the meter, whether the lag comes from cycle-end reconciliation or from an async pipeline straining at scale.

The same structural ceiling shows up elsewhere. Stripe's metering accepts usage backdated up to 35 days, which is useful for late-arriving events and is also a precise statement of how loosely coupled the meter is from the moment of use (Stripe Billing docs, January 2025).

None of this is a flaw at low or predictable usage. It becomes a problem when a single customer, or a single agent, can spend a month's worth of budget in an afternoon.

The real risk is the time-to-bill gap, not race conditions

The failure that actually drains AI margins is rarely two requests racing the same balance. It is the gap between usage happening and the debit landing, plus what goes wrong inside that gap: events dropped, double-counted, or mispriced. In invoice-based pipelines, usage is metered through the period and reconciled into charges later, and reconciliation is where customers get billed twice or not at all.

The defense is correctness at ingestion. Each usage event should be recorded, priced, and debited once, atomically, with retries that are safe to repeat and never double-charge. When billing is immediate and idempotent, the balance stays current, so the next authorization and any fraud or abuse rule act on real numbers. That is what makes real-time authorization to contain fraud and abuse practical: the check that stops a shared credential reads a balance that already reflects every prior request.

Hard concurrency control is usually the wrong place to spend the budget. Locking a balance against every simultaneous request is expensive and rarely necessary; if you know your unit economics, a customer slipping a few cents into the red on parallel calls is not the threat. Reserving cost before usage is harder still for AI, because the expensive part, the output, is unknown until after the model runs. The rare jobs that genuinely need a hard ceiling, like ten concurrent ten-minute video renders, are usually best capped in the application layer.

Cursor's June 2025 pricing change is the public version of the time-to-bill gap, not a concurrency story. The company replaced request caps with a monthly credit pool priced near frontier-model rates and charged overages automatically. Users ran through credits in days, were billed without a current balance to watch, and the company apologized and opened a refund window for affected users (TechCrunch, June 2026). The lesson is not that usage pricing is dangerous. It is that opaque meters, invisible balances, and overages discovered after the fact are all symptoms of latency between usage and the debit.

Throughput or latency: the trade-off every platform makes

Every billing platform optimizes for either throughput or per-action latency, and the choice shows up in its architecture. Invoice-based platforms optimize for ingesting enormous event volume and settling it accurately at cycle end. Real-time platforms optimize for per-action latency: authorize against a current balance before the action, then record and debit the usage atomically the moment it happens. These are different targets serving different workloads, not a ranking.

The named platforms sort cleanly along this axis. Orb, Metronome, and Lago are invoice-based: usage accrues across the period and is billed at cycle end, with no real-time balance debit. Metronome is built for throughput, with a streaming aggregation architecture designed to ingest enormous event volume for customers like OpenAI and Anthropic. Lago is precise about where billing settles: its authoritative balance moves only when an invoice is finalized, and the faster ongoing_balance figure is a premium estimate, not the point where enforcement happens.

The rest sit at different points on the same axis. Stripe Billing is subscription-first, with credits modeled as invoice adjustments rather than a live balance. Flexprice is a hybrid: it has a balance primitive, but its automatic usage debit fires on invoice payment, not on the event itself. Stigg sits in the middle, with a real-time entitlement check that gates access before usage, while the actual billing settles downstream through a connected billing system after consumption.

Platform	Architecture	When enforcement can act	Source of truth
Orb, Metronome, Lago	Invoice-based	After the action, at cycle end	Invoice computed at reconciliation
Stripe Billing	Subscription-first	After the action	Subscription and invoice
Flexprice	Hybrid	Automatic debit at invoice payment	Invoice-driven balance
Stigg	Real-time entitlement gate, billing downstream	Access checked before; billing after	Entitlement state, then downstream invoice
Credyt	Real-time, end-to-end	Authorize before the action; debit atomically as usage happens	Balance, updated on every event

In our conversations with early-stage AI teams through the first half of 2026, the same story kept coming back: nobody had designed for the gap between a usage event firing and the balance actually moving. Teams find it the hard way, usually after a single customer or one runaway agent has already spent the money. The choice between throughput and latency is fine to make deliberately. The trouble starts when it is made by accident. For a deeper look at the engineering of usage-based billing under low latency, the constraint is always the same: the balance has to be current at the moment of the check.

Billing is settlement; authorization is control

Billing is settlement: the accounting of what was used and what is owed. Authorization is control: the decision about whether the next unit of usage should happen at all. Invoice-based architectures are settlement systems; they report what happened, they do not decide whether it should happen. Real-time architectures make that decision first, at authorization, and let settlement follow automatically.

That reframing changes the axis people argue about. The real question is not whether usage is billed before or after the action, because in real-time systems there is no separate billing decision; there is a usage decision the platform makes, and the billing follows. The honest framing is simple: in real-time billing, usage is authorized before it runs and billed the moment it does; in invoice-based billing, usage is billed after the action, at cycle end.

The shift the AI market is living through is the move from recording usage to deciding whether usage should happen. Once you see billing as control rather than bookkeeping, the latency in the path is no longer a performance metric. It is the size of the window in which you have no control at all.

When invoice-based billing is the right call

Invoice-based billing is the correct architecture for a large set of real workloads, and pretending otherwise would be dishonest. Where throughput matters more than per-action latency, the invoice model is not a compromise. It is the right tool.

Enterprise contracts with quarterly true-ups and negotiated terms are the clearest case. Metronome serves OpenAI, Anthropic, and NVIDIA precisely because high-volume event ingestion and accurate cycle-end settlement are exactly what those contracts need. The cost lands inside a committed contract, so the enforcement window is not an exposure.

Predictable, low-variance products are the second case. Snowflake runs the bulk of its revenue on consumption-based pricing through ordinary invoice cycles without trouble, because usage is steady and the per-unit cost is well understood. Bain put that share at about 93% in 2022, the most recent broad benchmark of its kind (Bain & Company, 2022). There is even a forward case for patience: Gartner projects inference costs on large models falling more than 90% by 2030, which would soften the urgency of real-time control for commodity models even as frontier models and multi-step agents keep their high per-interaction variance (Gartner, March 2026). If your usage is predictable, your margins are comfortable, and your customers expect a monthly invoice, real-time authorization adds machinery you do not need.

Real-time AI billing is an architecture decision, not a feature

Real-time AI billing is not a feature you toggle on; it is a decision about how much time can pass between a customer using your product and your system being able to act on it. For a flat SaaS tool, the answer can be a month and nothing breaks. For a product whose cost is real-time, concurrent, and driven by agents, the answer has to be measured in milliseconds, because the window between usage and enforcement is the window in which your unit economics are undefended. That is the practical content of real-time economic control: not a dashboard that shows yesterday's overage, but a balance that is current at the moment a request asks to run.

The systems that authorize a request first, then price and settle the usage the moment it happens, are becoming the default for products whose costs are real-time. Credyt is built for that path: it debits each customer's balance atomically as usage arrives, so the platform can decide whether the next request should proceed against a balance that is actually current. It does not replace a payment processor or a finance stack; it adds the real-time layer those systems were never designed to provide. For teams whose unit economics depend on what happens in the seconds around a request, that layer is no longer optional.

See Credyt for AI companies.