Skip to content
Diosh Lequiron
AI & Digital Transformation12 min read

The True Cost of AI in Production: A TCO Framework

The license fee is the smallest line item in running AI in production. A total cost of ownership framework for the inference, review, monitoring, and failure costs that surface only at scale.

A pilot priced an AI feature at the API fee and nothing else. The model call cost a fraction of a cent, the demo ran on a developer's laptop, and the projected monthly bill fit inside a rounding error. Eleven months later the same feature carried a review team, an evaluation harness, two monitoring dashboards, an on-call rotation, and a quarterly prompt-maintenance cycle. The model API was still a rounding error. Everything around it was not.

This is the most consistent pattern I have seen in AI adoption across program-delivery work: the cost that gets quoted is the cost that scales least, and the costs that scale most are the ones nobody put in the spreadsheet. Total cost of ownership for AI is not the price of the model. It is the price of running the model responsibly, at volume, for as long as the feature lives. That number is structurally larger, structurally later, and structurally harder to see during a pilot.

This article decomposes that cost stack and gives you a model you can apply before you commit. It deliberately stays on the cost side. The question of whether the spend earns its return is a separate discipline with its own integrity problems; here the only question is what the spend actually is.

Why the Pilot Lies About the Cost

A pilot is a controlled environment designed to prove that something works. That design is exactly what makes it a poor cost estimator. The pilot runs on a small, clean dataset, serves a handful of friendly users, tolerates a human watching every output, and runs for weeks rather than years. Every one of those conditions suppresses a cost category that production will reintroduce.

The pilot also tends to bundle one-time effort into the running picture. The team that built the pilot absorbed the integration work, the prompt design, and the first round of evaluation as part of the build, and none of it appears as a recurring line. Production inherits all of it as recurring maintenance, because models change, data drifts, and the surrounding systems keep moving.

The structural error is treating the API fee as the unit of cost. The API fee is a single line in a stack of seven, and it is usually the smallest one — the floor, with the multiplier sitting in the operating model around it. A TCO framework exists to make that multiplier visible before it becomes a surprise.

Inference Cost, and Why It Scales Non-Linearly

Start with the line everyone does count, because even this one is usually estimated wrong. Inference cost is the per-call charge for running the model, and the mistake is to model it linearly against users. It rarely behaves that way.

Suppose inference runs a hypothetical one cent per thousand tokens, and a pilot with one hundred users averages a handful of short calls each per day. The bill is trivial, and the spreadsheet extrapolates straight to ten thousand users and concludes the cost is still small. The extrapolation breaks for three structural reasons.

First, adoption changes behavior. Light pilot users send short, single-shot prompts. Engaged production users send longer context, paste in documents, ask follow-ups, and retry when the first answer disappoints. The tokens per user climb as users get more capable, so cost per user rises with engagement rather than holding flat.

Second, the architecture inflates the token count invisibly. A production system rarely sends the user's words alone. It prepends a system instruction, retrieved context, prior conversation, and formatting scaffolding. A user-visible request of fifty tokens can become a billed request of several thousand once the surrounding context is attached, and that ratio is set by your design, not by the user.

Third, quality work multiplies calls. Reliable systems re-ask, run a verification pass, or call a larger model when a cheaper one returns low confidence. Each of those patterns is a defensible engineering choice, and each one turns one billed call into several. The result is that inference cost compounds along two axes at once — more tokens per call and more calls per task — which is why a feature that costs a rounding error at a hundred users can cost a real budget line at ten thousand. Model the cost against tokens and call patterns, not against user headcount, or the number will be wrong by an order of magnitude in the direction that hurts.

The Human Review Layer Nobody Budgets

This is the line item most teams forget, and it is frequently the largest one. AI that touches anything consequential needs a human in the loop, and that human is a recurring operating cost, not a one-time setup.

The logic is unavoidable. Models produce output that is plausible whether or not it is correct, and they produce it confidently in both cases. Any output that can cause harm — a wrong financial figure, a misclassified support ticket, a contract clause, a medical-adjacent summary — needs a reviewer between generation and consequence. The pilot hid this cost because a developer was already watching every output for free. Production cannot hide it, because the whole premise of scaling is that there is far too much output for incidental human attention to cover.

Review cost scales with volume and with stakes. Consider a hypothetical operation generating ten thousand AI outputs a day where policy requires a human to check every high-stakes one. If a tenth are high-stakes and a reviewer clears one in two minutes, that is over thirty hours of review daily — several full-time roles whose salary dwarfs the inference bill many times over. The instinct is to push the review rate down with sampling, and sampling is legitimate, but it is a risk decision dressed as a cost decision: every output you stop checking is an output you have chosen to ship unverified.

The defensible move is to design the review layer as a deliberate part of the operating model rather than letting it accrete. Decide which output classes require full review, which tolerate sampling, and which can run unattended, and price each tier honestly. The mistake is assuming AI removes the human cost. In most consequential workflows it relocates the human from doing the work to verifying it — and verification at volume is its own staffed function with its own line in the budget.

Monitoring, Evaluation, and Observability

A deterministic system either returns the right answer or throws an error you can alert on. An AI system can return a fluent, well-formed, completely wrong answer and emit no error at all. That single property is why AI carries a monitoring cost ordinary software does not, and why the cost is recurring rather than one-time.

Three distinct functions live in this category, and conflating them understates the spend. Observability is the plumbing that records what went in, what came out, how many tokens it cost, and how long it took — the same operational telemetry any production service needs, plus the token and latency dimensions specific to models. Evaluation is the harder, AI-specific function: a maintained set of test cases that measures whether output quality is holding as the model, the prompts, and the input distribution shift underneath you. Drift detection watches the live input stream for the slow divergence between the data the system was tuned on and the data it now receives.

Each of these costs engineering time to build and, more importantly, ongoing time to maintain. An evaluation harness is not a fixture you write once; it is a living asset that decays the moment the system's behavior changes, which is constantly. Teams routinely budget the build and forget the upkeep, then discover months later that the quality dashboard has been green while reflecting a test set that no longer resembles production. The honest TCO treats evaluation as a standing maintenance commitment, because an unmaintained evaluation layer is worse than none — it manufactures false confidence, which is the most expensive failure mode of all.

Drift, Retraining, and Prompt Maintenance

AI systems do not hold still, and the cost of keeping them current is a recurring line that pilots never see because pilots do not run long enough to drift.

Drift arrives from several directions at once. The input distribution shifts as the world the system observes changes. The model itself shifts when the vendor deprecates a version or silently updates one, so a prompt that produced reliable output one quarter degrades the next with no change on your side. And the surrounding system shifts as features get added and the assumptions baked into your prompts quietly go stale.

The upkeep response depends on the architecture, and each carries a different cost shape. A prompt-and-API system carries prompt maintenance — the ongoing work of re-tuning instructions as models and inputs move, which is cheaper than retraining but is real, skilled, recurring labor that needs the evaluation harness above to know whether a change helped or hurt. A fine-tuned or custom-model system carries retraining: the periodic compute spend plus the far larger cost of assembling, cleaning, and labeling fresh training data, where the data work usually dominates the compute.

The structural point is that there is no maintenance-free AI system in production. The only choice is which maintenance shape you are signing up for, and that choice should be made deliberately at design time, with the recurring cost named, rather than discovered later when output quality has already eroded enough for someone to notice.

Integration, Plumbing, and the Cost of Failure

Two more categories complete the recurring picture, and both are easy to omit because neither shows up on a model invoice.

Integration and plumbing maintenance. The model is a small component inside a larger system of data pipelines, retrieval layers, queues, caches, retry logic, and the connective code between them. That surrounding apparatus is often the majority of the engineering and it does not stop needing maintenance once it ships. APIs change, data sources move, the retrieval index needs rebuilding, the queue needs tuning under load. None of this is exotic — it is ordinary software maintenance — but it is consistently left out of AI cost estimates because attention fixates on the model and treats the plumbing as already paid for. It is not. It is a standing line for as long as the feature lives.

The cost of failure. This is the category teams resist pricing because it feels speculative, but it is the one with the widest range and the longest tail. A wrong output is not free even when it is rare. It has a direct remediation cost — the effort to detect, correct, and clean up after a bad result. It has an incident cost when a failure is large enough to demand investigation, communication, and a fix. And it carries a trust cost that compounds: users who get burned by a confidently wrong answer stop trusting the correct ones, and rebuilding that trust is slow and expensive in ways no invoice captures. You cannot forecast failure cost precisely, but you can reason about it structurally — failure probability multiplied by the consequence of each failure class — and a system whose failures are cheap can run with lighter governance, while a system whose failures are expensive must carry heavier review, and that governance weight is itself a cost the failure exposure justifies.

Opportunity and Switching Costs

The last two categories are not cash outflows, which is precisely why they get ignored, and ignoring them is how organizations end up locked in.

Opportunity cost is what the same engineering effort would have produced elsewhere. Every hour spent building and maintaining the AI stack is an hour not spent on something else, and when AI is treated as obviously worthwhile, that comparison never gets made. The discipline is to name it: the cost of an AI capability includes the value of the next-best use of the same scarce engineering attention.

Switching cost is the price of changing your mind later, and it accrues quietly with every dependency. Prompts get tuned to one model's quirks. Workflows assume one provider's output format. Data and infrastructure settle around one vendor's ecosystem. Each of these is a reasonable local decision and each one raises the cost of moving, until a provider's price increase or quality regression that should be a routine vendor swap becomes a re-engineering project. Switching cost does not show up until you try to switch, which is the worst possible time to discover its size. The defensible posture is to track which decisions are creating lock-in while they are still cheap to reverse, and to keep the abstraction boundary between your system and any single provider deliberate rather than accidental — because an architecture that assumes one model will be the model forever is making a bet it has not priced.

Building a TCO Model You Can Actually Apply

A framework is only useful if it survives contact with a budget meeting, so reduce the stack to something you can fill in before committing. Sum the seven categories over a realistic operating horizon — not the pilot window, but the period the feature will actually live, which for anything meaningful is years.

Run it as a one-page model. For inference, estimate full billed tokens per task — including system context and verification calls — times realistic task volume at target adoption, not pilot adoption. For human review, define output tiers, set a review rate per tier, and convert reviewer minutes into staffed roles at loaded salary. For monitoring, count the standing engineering time to maintain observability, evaluation, and drift detection, not just to build them once. For drift and maintenance, name your architecture's upkeep shape — prompt maintenance or retraining plus its data work — and budget it as a recurring cycle. For integration, treat the surrounding plumbing as ordinary software maintenance and size it as a fraction of the build that recurs annually. For failure, reason structurally about failure probability times consequence per class, and let that set the governance weight. For opportunity and switching, name them explicitly even where you cannot price them precisely, because an unnamed cost is an unmanaged one.

Three budgeting mistakes make AI look cheap in the pilot and expensive in production, and naming them is half the defense. The first is quoting the API fee as the cost, when it is the smallest of seven lines. The second is extrapolating pilot volume linearly, when inference compounds along tokens and calls and review scales with stakes. The third is booking one-time costs as one-time, when integration, evaluation, prompts, and data are recurring maintenance for the life of the feature. A model that corrects all three will produce a number several times larger than the pilot estimate — and that larger number is the real one.

None of this argues against running AI in production. It argues against running it on a fictional budget. The discipline is the same one that governs every durable system: count the full cost of ownership before you commit, not the cheapest line after you already have. The license fee was never the cost. The operating model around it always was — and a system whose true cost you have named is a system you can actually afford to run.

ShareXLinkedInFacebookThreads

Continue Reading

AI & Digital Transformation

Shadow AI: Governing the Tools Your Team Already Uses

Before any official AI rollout, your team is already pasting company data into consumer tools. Prohibition fails. Here is how to discover, classify, and govern shadow AI through enablement.

Read
AI & Digital Transformation

From Assistants to Agents: What Agentic AI Changes for Operations

An assistant suggests and a human acts. An agent acts within bounds. That single shift moves AI errors from bad advice to direct consequences — and changes what governance has to do.

Read
AI & Digital Transformation

When AI Fails in Production: An Incident Response Playbook

AI failures are silent, plausible, and propagate through automated downstream actions. This is the operational sequence for the first hour, the rollback, the postmortem, and the readiness you build before the first incident.

Read
AI & Digital Transformation

Build vs. Buy for AI Capabilities: A Decision Framework

Most teams get the AI build-vs-buy question backward — building commodities and buying differentiators. A framework for deciding by strategic value, rate of change, and where a capability sits in its lifecycle.

Read
AI & Digital Transformation

AI-Assisted Services People Will Actually Pay For

AI-assisted services become sellable when they focus on business outcomes, quality control, and risk reduction rather than tool novelty.

Read
AI & Digital Transformation

SEO Content as a Long-Term Online Income Asset

SEO content becomes an online income asset when keywords, topic clusters, internal links, maintenance, and offers are designed as one system.

Read

Explore more

← All Writing