The AI Engineering Stack in Production: What Actually Breaks at 10 Million Requests a Day
A system that handles 100 requests per day during development will encounter failure modes that were statistically invisible at that scale when it reaches 10 million requests per day. An input that causes context budget exhaustion appears once every few thousand requests. At 100 requests per day, it may never surface during development. At 10 million requests per day, it occurs thousands of times daily. A prompt that triggers a model refusal 0.1% of the time generates 10,000 refusals per day at production scale, creating visible user-facing failure that a development team never tested for because the probability was too low to encounter in a test set.
Scale does not just increase the frequency of known failure modes. It reveals failure modes that are structurally impossible to observe at low volume: cache poisoning from repeated identical queries, embedding model drift as document distributions shift over weeks, GPU memory fragmentation from heterogeneous request lengths, and cost spikes from tail inputs with unusually long outputs. Building for production scale means anticipating these failures before they occur, not debugging them after the system is already serving millions of users.
The failure taxonomy
Each failure mode in a production AI system has a distinct root cause, a distinct scale threshold at which it becomes observable, and a distinct detection signal. Treating them as a single "reliability problem" leads to mitigations that address one failure mode while leaving others invisible.
GPU memory pressure is the most operationally acute failure mode. At sustained load, KV cache memory fills gradually as the active request count grows. Static batch size limits set during testing assume a fixed maximum context length. In production, the tail of the request length distribution includes inputs that are three to five times longer than the median. A single long-context request can consume as much KV cache as 30 short requests, because KV cache memory scales linearly with sequence length and batch size jointly. Without dynamic memory management (PagedAttention, as described in the Kwon et al. SOSP 2023 paper) and a request queue that accounts for current memory pressure rather than fixed batch slots, a surge of long-context requests causes OOM crashes that take the serving node offline. The failure is not gradual: the system serves requests normally until memory is exhausted, then crashes entirely, producing a step-function availability drop rather than a gradual latency increase. The correct mitigation combines dynamic request admission control, which estimates KV cache cost from input length before admission and rejects or queues requests when memory headroom is insufficient, with graceful degradation that returns a clear error rather than an OOM crash.
Embedding model drift is slow, invisible, and dangerous precisely because of those properties. The embedding model evaluated and deployed at launch was trained on text similar to the documents indexed at that time. As the document corpus grows and the query distribution shifts, the alignment between the embedding space and the retrieval task degrades. A model trained on a corpus dominated by product documentation and FAQ content will not represent newly added legal agreements or technical specifications with the same fidelity. This manifests as gradual retrieval quality degradation: a system that achieves 78% recall@5 at launch may be at 65% recall@5 six months later if the embedding model's representation of the current corpus has drifted relative to the document distribution it was optimized for. At 65% recall@5, one in three retrieval requests is returning a set of documents that does not contain the most relevant chunk, and the generation layer cannot compensate for what was not retrieved. The drift is undetectable without continuous measurement. It does not produce errors or exceptions. It produces subtly worse answers, which manifest as user satisfaction decline before any metric catches it.
Context budget exhaustion is a tail-input problem. Every system has a maximum context length, and inputs that exceed it must be truncated or rejected. In development, inputs rarely approach the limit because test sets are constructed by engineers who are mindful of the constraint. In production, 0.01% to 0.1% of inputs may exceed the limit due to unusually long user messages, large retrieved chunks, accumulated conversation history in multi-turn systems, or adversarial inputs designed to maximize context consumption. At 10 million requests per day, 0.01% is 1,000 requests per day hitting the context limit. Without explicit handling, the system either truncates silently (producing answers that lack key context), throws an error (producing user-facing failure), or passes a too-long request to the model (which may produce erratic behavior at the boundary). The correct mitigation is explicit context budget management: allocate token budgets to system prompt, retrieved context, conversation history, and user input separately, with overflow handling that truncates conversation history first (preserving the most recent turns), then retrieved context (preserving the highest-ranked chunks), and never truncates system prompt or the current user turn.
Tool call storms appear in agent systems when a failure in one tool call causes the model to retry the call with slightly modified parameters. If the tool is temporarily unavailable due to rate limiting, network timeout, or service degradation, the model may generate 10 to 20 retry attempts before exhausting its step budget. The model is not misbehaving; it is following its instructions to complete a task that requires the tool's output. At production scale with concurrent agents, a tool that begins experiencing failures triggers simultaneous retry storms from all active agents. A tool handling 100 concurrent agent requests that begins failing at 5% error rate generates not 5 retry requests but 50 to 100, because each of the failing agents retries multiple times. This thundering herd effect amplifies the original failure: a transient tool degradation becomes a full tool outage as retry volume overwhelms the recovering service. The three-part mitigation is exponential backoff with jitter on tool retries (reducing correlated retry load), circuit breakers that stop calling a failing tool and return a clear error to the model after three consecutive failures (preventing retry storms from propagating), and explicit handling in the agent system prompt for tool unavailability scenarios (so the model gracefully acknowledges the limitation rather than spinning on retries).
Silent regressions are the failure mode that causes the most damage per incident, because they are the hardest to detect. A model version update, a prompt change, a new document corpus batch, or a configuration flag change can silently degrade output quality on a specific query category without producing any infrastructure alert. The system continues to serve requests; error rates remain stable; latency metrics are unchanged. The regression manifests only in the quality metrics that require model-driven evaluation: hallucination rate, instruction following rate, answer relevance. Teams without continuous quality monitoring may not detect a silent regression until user-reported errors accumulate over days or weeks. By that time, the change that caused the regression may no longer be visible in the deployment history.
Each of these five failure modes requires different instrumentation. A single aggregate error rate metric will not detect embedding drift, silent regression, or context budget exhaustion. A GPU utilization alert will not detect a silent regression. A cost anomaly alert will not detect a tool call storm until it has already amplified into a major incident. The failure taxonomy determines the instrumentation strategy, which is covered in the observability section below.
Cost-at-scale arithmetic
The economics of a production AI system are dominated almost entirely by a single line item. Understanding the arithmetic in advance determines which optimizations are worth engineering time.
Consider a production RAG plus agent system serving 10 million requests per day at a typical configuration. Embedding costs are minimal: each request embeds the query (one embedding call, approximately 500 tokens). At text-embedding-3-small pricing of $0.02 per million tokens, 10 million requests at 500 tokens each produces 5 billion tokens per day, costing $100 per day. If documents are re-embedded on a weekly index refresh of 10 million documents at 500 tokens each, the refresh costs $100 per run or approximately $15 per day amortized. Embedding total: roughly $115 per day.
Vector retrieval at 10 million queries per day against an index of 10 million documents, using a managed vector database, costs approximately $300 per day at current pricing for a configuration supporting this query volume. This covers storage, query compute, and index maintenance.
LLM generation is the line item that matters. Each request involves approximately 4,000 input tokens (system prompt 1,500 tokens, retrieved context 2,000 tokens, query 500 tokens) and approximately 500 output tokens. At GPT-4o pricing, input costs $2.50 per million tokens and output costs $10 per million tokens. Per request: input is $0.010, output is $0.005, total is $0.015. At 10 million requests: $150,000 per day.
Reranking with a cross-encoder that scores 50 candidates per query at 200 million parameters requires approximately 0.5 milliseconds of A100 compute per candidate. 10 million requests at 50 candidates at 0.5 milliseconds equals 250,000 GPU-seconds per day, approximately three A100 GPUs dedicated to reranking. At $2 per GPU-hour, reranking costs roughly $140 per day.
Total daily cost: $115 (embedding) plus $300 (vector retrieval) plus $150,000 (LLM generation) plus $140 (reranking) equals approximately $150,555 per day. LLM generation is 99.6% of total spend.
Every optimization must target token volume or price per token. Optimizing retrieval infrastructure, reranking throughput, or embedding costs has negligible impact on total spend compared to a 10% reduction in generation cost.
This arithmetic makes the priority ordering of cost optimizations unambiguous. Semantic caching at a 40% hit rate saves $60,000 per day. Routing 60% of simple queries to a 10x cheaper model saves roughly $54,000 per day. Prompt compression reducing input tokens by 30% saves approximately $13,500 per day. Optimizing the vector database configuration saves a few hundred dollars per day. Teams that spend engineering time on retrieval infrastructure optimization while their generation cost runs unchecked are optimizing the wrong variable.
Semantic caching: cutting costs at the query layer
Semantic caching stores previously generated responses indexed by the embedding of the query that generated them. When a new query arrives, it is first compared against the cache: if the cosine similarity between the new query embedding and a cached query embedding exceeds a threshold (typically 0.95 to 0.98), the cached response is returned without calling the LLM. The cache is implemented as a vector database with TTL-based expiry, and the lookup is a single nearest-neighbor query against the cache index, completing in single-digit milliseconds.
For FAQ-style workloads where many users ask semantically equivalent questions, cache hit rates of 30 to 60% are achievable. At a 40% cache hit rate on a $150,000 per day LLM spend, the daily saving is $60,000. At $22 million per year in savings, the engineering cost of building and maintaining a semantic cache with a quality vector database is recovered in days. The operational cost of the cache itself (a vector database storing query embeddings and response pairs with TTL management) runs a few thousand dollars per month at this scale.
Threshold calibration is the critical engineering decision. At 0.95 cosine similarity, the cache returns responses for queries that are semantically very similar but not identical. This is correct for factual queries where two users asking the same question in different words should receive the same answer. It is wrong for personalized or context-dependent queries. Two users asking "what are my recent transactions" should not share a cached response; their transactions are different. Two users asking "what is the capital of France" should share a cached response; the answer is identical. The practical implementation partitions the cache by query category: semantic caching is applied only to query categories where the response is not user-specific, session-specific, or time-sensitive. The categorization is a routing decision made before the cache lookup, using the same lightweight classifier that routes queries to different model tiers.
TTL management is the other critical parameter. Cached responses become stale as the underlying data changes. A cached response about current pricing is accurate when cached and wrong six hours later when pricing updates. TTLs should be set per query category based on the expected freshness requirement: a cached response about historical facts may have a TTL of 30 days; a cached response about live inventory may have a TTL of 60 seconds. Using a single global TTL will either over-expire cache entries for stable facts (wasting compute by regenerating answers that have not changed) or under-expire cache entries for rapidly changing data (serving stale responses).
Model versioning and silent regressions
Model providers version their APIs and periodically deprecate old versions. GPT-4-0613 behaved differently from GPT-4-0314 on structured output tasks: the 0613 version had stronger instruction following for function calling but produced different outputs for some ambiguous prompts. Teams that migrated without running their evaluation suite on the new version discovered silent regressions in structured extraction tasks weeks after the migration, when user-reported errors accumulated to the point of visibility. The model version update was a breaking change that passed the infrastructure tests (latency normal, error rates stable) while failing the quality tests that were not being run continuously.
The correct protocol for model version migration is to run the full evaluation suite on the new model version before changing the API call in production. Comparing per-failure-mode metrics, not just aggregate accuracy, is essential: a new model version may improve factual accuracy by 3% while degrading instruction following by 8%, producing a net apparent improvement on an aggregate metric that masks a regression on the more important dimension for the application. If the new version degrades on any failure mode by more than the acceptable threshold, treat the migration as a breaking change and investigate before proceeding. The evaluation suite should be designed to run within 30 minutes and produce a go/no-go recommendation to make this pre-migration check operationally cheap enough to run on every version change.
The same protocol applies to embedding model updates. If the embedding model changes, all indexed documents must be re-embedded with the new model and the index rebuilt. A hybrid index containing embeddings from two different model versions will produce inconsistent retrieval quality. The cosine similarity between a new-model query embedding and an old-model document embedding is not a meaningful similarity score because the two embeddings occupy different geometric spaces. The similarity value is computed but represents nothing semantically coherent. Teams that incrementally update their index by embedding new documents with a new model while leaving existing documents embedded with the old model create this inconsistency silently, and the retrieval quality degradation is distributed across all queries that touch documents from both embedding generations.
The operational lesson is that embedding model updates require coordinated dual-index management: maintain the old index for production traffic, build the new index in parallel, run evaluation against both, cut over when the new index passes the quality threshold, and decommission the old index only after a holding period confirms the new index is stable. This is more expensive than incremental updates, but it is the only protocol that avoids quality degradation during the transition.
Observability architecture
The observability stack for a production AI system requires four measurement layers that standard software observability tooling does not provide. Teams that deploy AI systems with only infrastructure metrics are operating without visibility into the metrics that predict user experience.
Infrastructure metrics are the foundation and fully covered by standard APM tooling. GPU utilization, memory usage, request queue depth, TTFT (time to first token) at p50/p95/p99, and TPOT (time per output token) at p50/p95/p99 are the same metrics any distributed system collects. Datadog, Grafana, and equivalent tools instrument them without custom development. Monitor at one-minute granularity with alerting when p95 TTFT exceeds 2x the baseline. This layer catches GPU memory pressure, serving node failures, and latency regressions caused by infrastructure changes.
Token consumption metrics are the first layer that requires custom instrumentation. Standard APM tools do not capture tokens per request, cost per request, or total daily spend because these concepts do not exist in non-AI systems. Building this instrumentation requires wrapping every LLM API call to capture input token count, output token count, model version, request category, and timestamp, then aggregating into a time series that supports per-segment cost analysis. Alert on 20% day-over-day spend increase: this catches prompt regressions (a prompt change that produces verbose responses), traffic spikes (expected or anomalous), and cost anomalies from tail inputs consuming disproportionate output budget.
Retrieval quality metrics run on a schedule rather than per-request. Recall@5 on a held-out evaluation set of (query, expected document) pairs runs every six hours, providing a measurement that detects embedding model drift before it becomes severe. The embedding similarity distribution of retrieved chunks is a secondary signal: a shift in the distribution indicates that the embedding model's representation of the current corpus has changed, either from a model update or a corpus distribution shift. Mean reranker score is the third metric: a sustained drop indicates degraded retrieval relevance across the query distribution, regardless of recall. Alert when recall drops below the production threshold defined at deployment. This layer catches embedding drift, index corruption, and retrieval quality regressions caused by corpus changes.
Answer quality metrics are the most expensive to compute and the most directly predictive of user experience. Hallucination rate on a sampled evaluation set (using automated fact-checking against source documents or LLM-as-judge), instruction following rate measured via programmatic output schema validation, and refusal rate on a set of known-legitimate queries are the three core metrics. These require model-driven evaluation and cannot run on every request economically; a 1% sample of production traffic evaluated every hour is the standard operational pattern. Alert on 10% relative degradation from the deployment baseline on any metric. This layer catches silent regressions from model version updates, prompt changes, and corpus quality changes.
Standard APM covers Layer 1 fully. Layers 2 through 4 require custom instrumentation specific to AI systems. Teams that deploy without Layers 3 and 4 have no visibility into retrieval quality or answer quality until user complaints reach a volume that triggers manual investigation. By that point, the regression has typically been running for days.
The five runbooks before go-live
A runbook is a documented procedure for responding to a specific class of incident. Before a production AI system goes live at scale, five runbooks should exist and be tested against synthetic failure conditions. The runbook is not documentation after the fact; it is the operational agreement that defines what constitutes a valid response to each failure mode and who is responsible for executing each step.
The latency spike runbook responds when p95 TTFT exceeds the SLO. The decision tree has four branches. First, check GPU utilization: if above 90%, scale out by adding capacity and monitor whether p95 TTFT returns to baseline within five minutes. Second, if GPU utilization is normal, check request queue depth: if the queue is growing, reduce maximum batch size or add admission control that rejects or queues long-context requests. Third, if queue depth is normal, check for anomalously long inputs in the current traffic mix: if the p99 input length has increased significantly, apply input length filtering to cap requests at a maximum token count. Fourth, if input length is normal, check the model endpoint's status page for provider-side degradation: if degradation is confirmed, activate the fallback model or serve from cache for the affected query categories. Each branch has a named owner and a clear resolution criterion. On-call engineers should not be making judgment calls at 3am about which branch to take.
The hallucination spike runbook responds when the hourly hallucination rate on the monitoring sample exceeds the alert threshold. The first step is to check whether retrieval recall has dropped in the same window: if recall has fallen, the hallucination increase is a retrieval quality problem upstream of generation. A generation mitigation applied to a retrieval failure will not resolve the incident. If recall is normal, the second step is to check whether a new document corpus batch was indexed recently: if so, check for an embedding model version mismatch between the new documents and the existing index. If the index is consistent, the third step is to check whether the system prompt or model version was changed within the alert window: if so, roll back to the previous configuration and run the evaluation suite against both. If none of these upstream causes are found, the fourth step is to apply self-consistency to the affected query category and escalate for investigation, because an unexplained hallucination spike with stable infrastructure and retrieval signals indicates a distribution shift in incoming queries that requires analysis rather than a runbook procedure.
The cost spike runbook responds when daily spend increases more than 20% day-over-day. The four diagnostic branches are: a traffic volume increase (expected from a marketing campaign or user growth, or anomalous from a traffic replay attack or scraping), an output length increase (a prompt regression that causes the model to generate verbose responses, detectable by comparing mean output token count against the prior day's baseline), a cache hit rate degradation (a TTL expiry event or cache invalidation that dropped the semantic cache hit rate, detectable by checking cache metrics directly), and a new query type that bypasses the cheap-model routing (a change in the query distribution that routes queries intended for GPT-4o-mini to GPT-4o, detectable by comparing the routing distribution against the prior day's baseline).
The tool call storm runbook responds when a tool's error rate exceeds 5% and the agent step count for requests using that tool is growing. The immediate step is to activate the circuit breaker: stop routing requests to the failing tool and return a clear error to the model. This prevents the retry storm from amplifying the original failure. The second step is to check the tool's error logs for the root cause: is it a rate limit, a network timeout, a downstream service failure, or a malformed request? The third step is to update the agent system prompt with an explicit instruction for handling this tool's unavailability: "if the [tool name] tool returns an unavailable error, inform the user that this capability is temporarily unavailable and offer the available alternatives." Without this prompt update, models will continue attempting workarounds that generate additional tool calls. The fourth step is to communicate estimated recovery time to users through an in-product message if the outage exceeds five minutes.
The index corruption runbook responds when retrieval recall drops precipitously, defined as more than 20 percentage points in a single six-hour measurement window. A gradual recall decline over weeks is an embedding drift problem; a sudden drop is a structural problem. The diagnostic sequence is to first check whether a recent index update completed successfully or was interrupted: a partial index rebuild leaves the index in an inconsistent state where some documents are indexed and others are not. Second, check whether the embedding model version is consistent between the current index and the query embedding function: a deployment that updated the query-side embedding model without rebuilding the index creates the version mismatch described in the model versioning section. Third, check whether document metadata filters are returning empty sets for previously valid filters: a metadata schema change or a data pipeline failure may have stripped filter fields from documents, causing filtered queries to retrieve nothing. If none of these structural causes are found, trigger a full index rebuild from the source documents as the definitive remediation. A full rebuild from source is expensive (hours of embedding compute) but is the only remediation that guarantees a consistent, correct index state.
The production reality
None of these failure modes are novel. GPU memory pressure at high request volume is a well-studied systems problem. Embedding drift is a form of dataset shift, a classical machine learning problem. Context budget management is a constraint satisfaction problem. Tool call storms are a specific form of retry storm, which every distributed systems engineer has encountered. Silent regressions are the same category of problem as silent failures in any stateful system.
What makes these failure modes specific to AI systems is the combination of their scale thresholds (invisible in development, critical in production), their detection mechanisms (requiring model-driven evaluation rather than infrastructure metrics), and their remediation complexity (fixing an embedding model drift incident requires a full index rebuild, not a service restart). The operational practices that work for traditional web services, meaning deploy, monitor error rates, page on anomalies, are necessary but insufficient. The additional layers required for AI systems, continuous retrieval quality monitoring, token economics instrumentation, answer quality sampling, are not difficult to build, but they must be built before the system goes to production scale, not after the first incident demonstrates their absence.
The runbooks described here are starting points, not complete procedures. Every production system has idiosyncrasies that require customization: a retrieval architecture that uses sparse and dense retrieval jointly will have a different recall measurement protocol than a purely dense retrieval system. An agent system using a tool registry with dozens of tools will need a more granular circuit breaker configuration than a system with three tools. The structure (failure mode, scale threshold, detection signal, decision tree, named owner, resolution criterion) is the invariant. The content is specific to the system.
What is not optional is the decision to build this infrastructure before the system reaches scale. At 100 requests per day, the cost of an undetected silent regression is ten bad answers. At 10 million requests per day, it is 100,000 bad answers per hour. The gap between those two numbers is not a reason to wait until the system reaches scale to build the observability. It is the reason to build it first.
References
Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP, 2023. https://arxiv.org/abs/2309.06180
Bang, Y., Cahyawijaya, S., Lee, N., et al. "A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity." AACL, 2023. https://arxiv.org/abs/2302.04023
Agrawal, A., Kedia, N., Panwar, A., et al. "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." OSDI, 2024. https://arxiv.org/abs/2403.02310
Chen, B., Zhang, Z., Langrené, N., and Zhu, S. "Unleashing the Potential of Prompt Engineering in Large Language Models: a Comprehensive Review." arXiv, 2023. https://arxiv.org/abs/2310.14735
Zhao, Z., Lee, W. S., and Huang, D. "Large Language Models Are Not Robust Multiple Choice Selectors." ICLR, 2021. https://arxiv.org/abs/2309.03882
Opsahl-Ong, K., Ryan, M. J., Hardy, J., et al. "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs." arXiv, 2024. https://arxiv.org/abs/2406.11695