KV Cache Is the Bottleneck You're Not Measuring
When a transformer generates text, it does not recompute attention over the full sequence from scratch at each step. That would be computationally catastrophic: generating a 1,000-token response would require 1,000 full forward passes, each more expensive than the last. Instead, the model caches the key and value tensors computed during each previous step and reuses them. This KV cache is what makes autoregressive generation tractable. It is also, at production scale, the primary constraint on how many requests you can serve concurrently, how long your context windows can be, and why your inference nodes run out of memory in ways that are hard to predict from weight size alone.
Most engineers sizing inference infrastructure stop at model weights. They see "LLaMA-2-70B requires 140 GB in fp16" and provision accordingly. What they miss is that the KV cache is a second, dynamic memory consumer that grows with every token generated, scales with both sequence length and batch size, and can exceed the weight footprint entirely at long context or high concurrency. The engineers who do understand this tend to be the ones who have already watched a serving node crash at 2 AM because a batch of unusually long requests filled the remaining HBM headroom.
KV cache mechanics and the memory formula
During the prefill phase, the model processes all input tokens in parallel, computing query, key, and value tensors for every token at every layer. The K and V tensors from this pass are stored in the cache. During decoding, each new token attends to all previous tokens using those cached K and V values, appending its own K and V to grow the cache one entry at a time. The Q tensor is computed fresh for each new token from its embedding, but there is no reason to recompute K and V for tokens the model has already seen.
This is not a trick or an approximation. It is mathematically identical to full recomputation. The only reason recomputation produces the same result as caching is that, in standard transformer attention, the K and V tensors for a given token depend only on that token's representation at that layer, not on any tokens that came after it. Unidirectional masking enforces this causal structure. If you break that assumption, as some architectures do with bidirectional cross-attention layers, caching requires more care.
The memory consumed by the KV cache follows a precise formula:
KV cache size = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element
The factor of 2 accounts for both the K and V tensors. The num_kv_heads term is where Grouped Query Attention (GQA) and Multi-Query Attention (MQA) pay dividends: if the model uses 8 KV heads instead of 64, the cache is 8x smaller for the same model depth.
For LLaMA-2-70B specifically: 80 layers, 8 KV heads from GQA, head dimension 128, fp16 precision at 2 bytes per element. The KV cache per token per sequence is therefore 2 × 80 × 8 × 128 × 2 = 327,680 bytes, or 320 KB per token. At a sequence length of 4,096 tokens and batch size 1, that is 320 KB × 4,096 = 1.28 GB per sequence. Scale to batch size 16, and the total KV cache is 1.28 × 16 = 20.5 GB. Combined with the 140 GB of model weights, you need 160.5 GB of HBM, which lands at the absolute limit of two A100-80GB cards. There is no headroom for activation memory, no room for sequences that run longer than 4,096 tokens, and no margin for anything else the serving framework needs.
Now extend the context. At 8,192 tokens, the per-sequence KV cache doubles to 2.56 GB. At batch size 16, that is 41 GB of cache. Total HBM required: 181 GB, requiring three A100s. At 128k tokens, the per-sequence KV cache is 40.96 GB. A single request at 128k context on LLaMA-2-70B requires 41 GB of KV cache alone, on top of 140 GB of weights, for a total of 181 GB before considering anything else. At batch size 4, the 128k cache alone is 160 GB, and the combined requirement reaches 300 GB, demanding four A100s for a batch of just four requests.
These numbers make the practical constraint concrete: long context is not just slow, it is geometrically expensive in memory. Every 2x increase in context length costs 2x in KV cache memory, directly displacing batch capacity. A serving cluster that handles 100 concurrent 4k-context requests can handle at most 50 concurrent 8k requests with the same hardware, assuming no other changes. The engineering decisions around context length limits, batch size caps, and hardware allocation all trace back to this formula.
Why this kills throughput: static batching vs. continuous batching
The original inference serving approach, common before 2022, used static batching. A batch is assembled from N requests, all requests are padded to the same sequence length, the batch runs to completion (all requests finish generating), and then a new batch is assembled. This is simple to implement and maps cleanly onto GPU batch execution, but it wastes memory in two distinct ways.
First, static batching pre-allocates KV cache for each request's maximum possible context length upfront, regardless of how many tokens that request has actually generated. If you configure a maximum context of 4,096 tokens and batch size 10, you reserve 10 × 1.28 GB = 12.8 GB of KV cache at initialization time, even when most requests in the batch have only generated 50 tokens. The reserved memory sits idle, unavailable for other requests.
Second, requests finish at different times, but the batch cannot release a slot until all requests in the batch have completed. A request that generates a 20-token response occupies its reserved KV cache memory for the full duration of the batch, even if the other 9 requests are still generating 2,000 tokens. This creates a low-water-mark problem: GPU utilization is high only when all requests in a batch are actively generating, which is rarely the case when output lengths are heterogeneous.
PagedAttention, the core innovation introduced in the vLLM paper (Kwon et al., 2023), applies the virtual memory paging concept to KV cache management. Instead of allocating one contiguous memory region per request, the entire KV cache pool is divided into fixed-size blocks, typically 16 or 32 tokens per block. Each request is assigned blocks on demand as it generates tokens. When a request completes, its blocks are immediately returned to the free pool and become available for any new request. This eliminates two failure modes simultaneously: there is no upfront pre-allocation waste, and there is no held-memory problem from finished requests in a running batch.
PagedAttention also enables a second optimization: block sharing. If two requests have identical prefixes (a common pattern with system prompts), the K and V blocks for those shared tokens can be stored once and referenced by both requests without duplication. This prefix caching further reduces effective KV cache memory consumption at the cost of some bookkeeping complexity.
The throughput improvement is dramatic. The vLLM paper reports 20 to 30x higher throughput compared to naive static batching on equivalent hardware, primarily because PagedAttention enables much higher effective batch sizes by eliminating wasted cache memory. Continuous batching compounds this by allowing new requests to join the batch at each generation step rather than waiting for the entire current batch to complete. When a request finishes generating, the scheduler immediately schedules a waiting request to fill that slot. The batch never goes idle between requests; GPU utilization stays high by construction.
The combination of PagedAttention and continuous batching is now the standard in production serving frameworks. vLLM ships it by default. TensorRT-LLM implements an equivalent mechanism called in-flight batching. SGLang uses a similar approach with additional optimizations for programs that share prefix structure. If you are running a serving stack that predates these frameworks, or using a custom serving setup without these mechanisms, the static batching overhead is likely the largest single contributor to low GPU utilization on your inference fleet.
Quantized KV cache: trading precision for memory
KV cache quantization reduces the bytes per element in the cache from 2 (fp16) to 1 (int8) or 0.5 (int4 or fp4), cutting cache memory by 2x or 4x respectively. The arithmetic is appealing: 2x reduction in cache size at batch size 16 and 4k context takes the LLaMA-2-70B KV cache from 20.5 GB down to 10.25 GB, freeing 10 GB of HBM that can be reallocated to larger batches or longer context.
The accuracy impact deserves careful attention because it is not uniform across precision levels.
Int8 KV cache stores K and V tensors as 8-bit integers with a per-token scale factor. Accuracy degradation on standard benchmarks is typically 0.3 to 1.0 perplexity points. For most production tasks, including question answering, summarization, and chat applications, this is imperceptible in practice. The model's outputs are statistically indistinguishable from fp16 for a user who is not specifically running controlled A/B experiments. vLLM and TensorRT-LLM both support int8 KV cache quantization as a production-ready feature, and it is widely deployed in large-scale serving.
FP8 KV cache uses the E4M3 floating-point format: 4 exponent bits and 3 mantissa bits. The larger dynamic range of FP8 compared to int8 handles models with wide activation distributions more gracefully. In practice, FP8 KV cache achieves roughly 0.1 to 0.5 perplexity degradation, lower than int8 for architectures where activation scales vary significantly across tokens. NVIDIA H100 hardware provides native FP8 compute support at the tensor core level, which means FP8 KV cache operations do not incur the dequantization overhead that int8 requires on A100. On H100 clusters, FP8 KV cache is the preferred default.
Int4 KV cache is a different story. At 4-bit precision, the quantization error is large enough to affect model behavior on tasks that depend on fine-grained attention patterns, particularly multi-step reasoning, retrieval over long contexts, and tasks with high sensitivity to numerical precision. Int4 should not be deployed in production without per-task evaluation, and even then it warrants skepticism. The 4x memory reduction is tempting, but the accuracy risks are real enough that int8 or FP8 is almost always the right choice unless you have specific evidence that the tasks you are serving are tolerant of 4-bit precision.
The practical decision tree: for most 7B to 70B models on general-purpose tasks, use int8 KV cache as the production default. On H100 hardware, use FP8 instead. Validate on task-specific metrics before deploying int4 for any production workload, and treat it as an optimization of last resort when memory constraints are extreme.
Speculative decoding: hiding latency with a draft model
The per-token generation latency is determined by the memory bandwidth required to load model weights and KV cache values for each decoding step, not by arithmetic compute. Modern GPUs are compute-bound during prefill, when they process many tokens in parallel and the matrix multiplications are large enough to saturate the tensor cores. During decoding, however, the model processes exactly one token at a time. The matrix operations are tiny relative to the memory reads needed to load 140 GB of weights and the accumulated KV cache. The GPU spends most of each decoding step waiting for data to arrive from HBM rather than doing arithmetic.
This memory-bandwidth bottleneck has a specific implication: generating two tokens takes almost exactly twice as long as generating one, because both steps do the same amount of memory loading regardless of how many useful computations happen. The arithmetic work for a second token is negligible compared to the memory traffic. If you could somehow process multiple tokens in a single decoding step, you would get those extra tokens nearly for free.
Speculative decoding achieves this by running two models in sequence. A draft model, small and fast, generates k candidate tokens. The target model, the large production model, then verifies all k tokens in a single parallel forward pass. Verification is a prefill-like operation: the target model processes k tokens simultaneously, which saturates the tensor cores and amortizes the weight-loading cost across all k tokens at once. Tokens where the target model's distribution agrees with the draft model are accepted; the first token where they disagree triggers rejection of that token and all subsequent speculative tokens from that step. On average, the number of accepted tokens per step ranges from 2 to 4 for well-matched draft and target models, yielding 2 to 4x end-to-end latency reduction.
The theoretical basis for why this is correct, not just an approximation, comes from the rejection sampling formulation. The accepted tokens are sampled exactly from the target model's distribution, not the draft model's. The algorithm guarantees that the distribution of the output sequence is identical to what the target model would have produced if run autoregressively without a draft. This is a strong guarantee and distinguishes speculative decoding from approximations that produce a distribution slightly different from the original model.
The efficiency depends entirely on the acceptance rate, and the acceptance rate depends on how well the draft model approximates the target. For code generation, a small code-specialized draft model can predict common syntactic patterns, function signatures, and idiomatic constructs with high accuracy. Acceptance rates of 70 to 80% are achievable in practice, giving close to 3x speedup. For open-ended creative writing or adversarial prompts where the target distribution is diverse and unpredictable, acceptance rates drop to 40 to 60%, and the speedup is correspondingly lower.
The draft model must be small enough that running it adds negligible overhead relative to the target model. The standard sizing heuristic is 1/10 to 1/20 of the target model's parameter count. A 70B model typically pairs with a 1.3B or 7B draft. An 8B model pairs with a 160M to 500M draft. The draft model also accumulates its own KV cache during speculative generation, adding a small but nonzero memory cost. For a 1.3B draft with 4 KV heads and 24 layers, the additional cache at 100 tokens is roughly 4.8 MB, negligible compared to the target model's cache.
One practical consideration: speculative decoding requires the draft and target models to have compatible tokenizers. If the target model switches tokenizer versions or vocabulary sizes between model updates, the draft model needs to be retrained or replaced. In high-cadence model update environments, this coupling introduces operational friction. Some teams handle this by using a smaller version of the same model family as the draft (e.g., LLaMA-3-8B as draft for LLaMA-3-70B), which ensures tokenizer compatibility and generally achieves better acceptance rates than a cross-family draft.
The three metrics that actually predict user experience
Infrastructure teams often optimize aggregate throughput, measured as total tokens per second across all requests, while users experience something quite different. Total throughput and user-perceived latency are not the same metric and do not move in the same direction under the same interventions. Measuring only throughput while ignoring per-request latency produces systems that are efficient for the operator but frustrating for users.
The three metrics that actually predict user experience are distinct, require separate instrumentation, and have different hardware drivers.
Time to first token (TTFT) is the latency from request submission to the delivery of the first output token. For a user at a chat interface, this is the perceived "thinking time," the gap between pressing Enter and seeing the first word appear. TTFT is dominated by prefill latency, which scales with input length. A 128k-token context request has a TTFT of 2 to 5 seconds on A100 hardware with Flash Attention. A 4k-token request finishes prefill in under 200ms. The right metric to track is TTFT at the 95th percentile. The average hides tail latency that causes visible stalls for the unlucky 5% of users, and it is exactly those stalls that drive abandonment and complaints. Chunked prefill, which splits long prefill operations across multiple steps so they do not block shorter requests from starting, is the primary technique for improving TTFT without reducing throughput.
Time per output token (TPOT) is the latency between consecutive output tokens, which a user experiences as the streaming speed of text appearing. TPOT is dominated by memory bandwidth. At each decoding step, the model must load its full weight matrix and the accumulated KV cache from HBM. For LLaMA-2-70B in fp16 on an A100 with a well-sized batch, TPOT is typically 20 to 50ms per token. At TPOT below 40ms, streaming text feels fluid and continuous, indistinguishable from a human typing fast. At TPOT above 100ms, users perceive the output as choppy and the experience degrades noticeably. At TPOT above 200ms, the streaming metaphor breaks down entirely and users perceive it as the model "thinking" between each word. Quantization, GQA, and speculative decoding all improve TPOT directly because they reduce the memory bandwidth cost of each decoding step.
Tokens per second per GPU (TPS/GPU) is the infrastructure efficiency metric. It measures how many output tokens each GPU produces per second in aggregate across all concurrent requests. TPS/GPU determines the cost per token: higher TPS/GPU means more tokens produced per dollar of GPU time. Optimizing TPS/GPU reduces operating cost and is the right metric for infrastructure sizing decisions and hardware procurement. It is orthogonal to per-request latency: very high TPS/GPU is achievable by running enormous batch sizes, which increases queuing latency and makes TTFT worse for individual requests.
The tension between these three metrics is fundamental and cannot be engineered away.
Large batch sizes improve TPS/GPU by spreading fixed memory-loading costs across more requests, but they increase queuing latency because new requests must wait for a slot, raising TTFT. Short-context requests have excellent TTFT but leave most KV cache capacity idle, wasting HBM that could serve long-context requests. Long-context requests improve KV cache utilization but consume so much HBM per request that they reduce the maximum achievable batch size, hurting TPS/GPU.
The right operating point is not universal. A consumer chat application with tight latency SLOs needs low TTFT and TPOT, which means limiting batch sizes and accepting lower TPS/GPU. A batch inference pipeline processing millions of documents overnight cares only about TPS/GPU and can tolerate arbitrarily high per-request latency. A retrieval augmented generation pipeline with long prompts and short outputs is primarily prefill-bound and needs chunked prefill and fast TTFT on long inputs. Each deployment context has a different optimum, and finding it requires measuring all three metrics separately rather than relying on aggregate throughput as a proxy for user experience.
The engineering implication is concrete: instrument TTFT p95, TPOT p50 and p95, and TPS/GPU separately in your serving infrastructure. Set SLOs on each independently. Tune batch size, prefill chunking, and quantization against those SLOs rather than against aggregate throughput. The serving framework configurations that maximize TPS/GPU are rarely the configurations that minimize TTFT, and treating them as equivalent is a common source of production serving systems that look efficient on dashboards while delivering poor user experience.
References
Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP, 2023. https://arxiv.org/abs/2309.06180
Leviathan, Y., Kalman, M., and Matias, Y. "Fast Inference from Transformers via Speculative Decoding." ICML, 2023. https://arxiv.org/abs/2211.17192
Chen, C., Borgeaud, S., Irving, G., et al. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv, 2023. https://arxiv.org/abs/2302.01318
Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv, 2019. https://arxiv.org/abs/1911.02150
Ainslie, J., Lee-Thorp, J., de Jong, M., et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP, 2023. https://arxiv.org/abs/2305.13245
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS, 2022. https://arxiv.org/abs/2205.14136