Why Your 1M-Token Context Window Is Mostly Wasted
In 2024, model providers raced to announce context window sizes as if they were horsepower ratings on a sports car. Gemini 1.5 Pro shipped with 1 million tokens. Claude 3 followed with 200k. GPT-4 Turbo landed at 128k. The implicit promise was that you could feed an entire codebase, a legal document corpus, or a research library to a model and ask questions across all of it seamlessly. Engineers started reaching for long context as a way to avoid the complexity of building retrieval systems.
The reality is more complicated. A large context window is a real and useful capability, but it is not a uniform one. The model does not attend equally to everything in its context. Information placed in the middle of a long sequence is processed less reliably than information placed at the beginning or the end. The cost of long context grows with the square of sequence length from the previous post, making naive use prohibitively expensive at scale. And for most production applications, a well-built retrieval system beats a long context window on both quality and cost.
This post goes through the research on why long context behaves the way it does, what the actual numbers look like for cost and quality, and how to make engineering decisions that account for these realities rather than the marketing.
The lost-in-the-middle problem: a U-shaped performance curve
The most important empirical result about long-context model behavior comes from a 2023 paper by Liu et al., titled "Lost in the Middle: How Language Models Use Long Contexts." The experiment was straightforward. The researchers presented models with a multi-document question answering task where the relevant document was placed at different positions within the context window. Position 1 placed it at the very beginning. Position 20 placed it at the very end. Positions 2 through 19 buried it progressively deeper in the middle.
The results were stark. For GPT-3.5-Turbo with a 16k context, accuracy on the retrieval task was approximately 71% when the relevant document appeared at position 1 (the start of the context). It was 69% when the document appeared at position 20 (the end). When the document appeared at positions 9 through 11, accuracy dropped to below 50%, worse than random for a binary task. The performance curve plotted as a function of position looks like a U: strong at both ends, degraded in the middle.
This U-shape was not specific to GPT-3.5-Turbo. The authors tested GPT-4, Claude 1.3, and several open-source models and found the same qualitative pattern across all of them, with varying severity. Longer context windows tended to exhibit more pronounced degradation in the middle, not less. Instruction-tuned models showed the effect more strongly than base models, suggesting that RLHF training may have reinforced a tendency to focus on the beginning (where system prompts live) and the end (where the most recent user message appears).
The model doesn't read your context the way you read a document. It has a primacy bias toward the beginning, a recency bias toward the end, and a dead zone in between.
Understanding why this happens requires going back to what attention actually computes. As established in the first post in this series, attention weights are determined by the softmax of scaled dot products between queries and keys. For very long sequences, the attention distribution becomes increasingly diffuse. A query token has to "choose" how to distribute its attention across tens of thousands of key tokens. The softmax distribution over a long sequence tends to concentrate weight on a few highly similar tokens and distribute the remainder thinly across everything else. Tokens in the middle of a long document, which have no special positional advantage, get low weight unless the query is specifically looking for something very similar to them.
There is also an important training distribution effect. Most pre-training text and instruction-following data is not structured to require integration of information buried in the middle of a 100k-token context. The model has learned patterns from documents where critical information tends to appear at the beginning or end of a passage because that is where human writers put it. The model's attention patterns reflect this learned prior, not a principled capability to retrieve from any position.
What positional encoding actually does, and where it breaks
To understand when long context works and when it fails, you need to understand how the model knows where in the sequence a token is. Without positional information, attention is permutation-invariant: shuffling all the tokens would produce the same output. The position encoding is what turns an unordered set of vectors into a sequence with a defined order.
The original transformer used absolute sinusoidal position embeddings: for each position i and each dimension d, the embedding was a sine or cosine function of i/10000^(2d/d_model). These embeddings were fixed (not learned) and had a specific mathematical property: the dot product between position embeddings at positions i and j depends only on the distance |i - j|, which is useful. But they had a critical limitation: the model was trained on sequences up to length N, and at inference time, positions beyond N produced embedding values the model had never seen during training. Performance degraded sharply beyond the training length.
Later models switched to learned absolute position embeddings, which solved some issues but made extrapolation even worse. A learned embedding for position 4,096 was trained with gradients. A position 8,192 embedding, requested for a sequence twice the training length, simply didn't exist.
Rotary Position Embeddings (RoPE) (Su et al., 2021), used in LLaMA 2, LLaMA 3, Mistral, Qwen, and most modern open-source models, takes a fundamentally different approach. Instead of adding a position-dependent vector to the token embedding, RoPE rotates the query and key vectors by a position-dependent angle before the dot product. The rotation for position m is a 2D rotation in each pair of dimensions of the vector, with the rotation angle equal to m times a fixed frequency θ_d = 1/10000^(2d/d_model).
The key insight is what this does to the dot product Q_m · K_n. Because both Q and K are rotated by their respective positions, the dot product computes as a function of (m - n), the relative position, rather than the absolute positions m and n separately. This means attention naturally encodes relative position rather than absolute position, and the model can generalize more smoothly to sequences longer than those seen in training because it is learning relationships between relative positions rather than memorizing patterns at specific absolute positions.
RoPE is not infinitely extensible. The rotation frequencies are fixed during training, and at very long extrapolation lengths the positions begin to alias: the rotation angles wrap around and positions that are very far apart in the sequence become indistinguishable from positions that are close together. Various techniques like Position Interpolation (PI) and YaRN (Yet another RoPE extensioN) address this by rescaling the rotation frequencies, effectively squeezing a longer sequence into the same angular range the model was trained on. This is how LLaMA models trained at 4k context were extended to 32k and 128k without full retraining from scratch.
ALiBi (Attention with Linear Biases) (Press et al., 2022), used in MPT and some BLOOM variants, takes a different approach that is arguably more elegant. Instead of modifying the position representations, ALiBi adds a fixed, non-learned bias to the attention scores before softmax. The bias for token i attending to token j is simply -m * |i - j|, where m is a small slope that varies across attention heads. Tokens that are farther away get a larger negative bias, making the model naturally discount distant tokens in proportion to their distance.
The practical implication of ALiBi is that it is by construction a relative position encoding with a smooth distance penalty built in. It extrapolates to longer sequences more gracefully than absolute embeddings because the bias formula is defined for any distance, not just distances seen in training. The cost is that ALiBi encodes a prior that nearer tokens are more relevant, which is often but not always true. For tasks requiring global document-level understanding rather than local context, this prior can be a limitation.
The cost arithmetic that changes the calculation
Long context is not just a capability question. It is a cost question, and the costs compound in ways that make large-scale use prohibitively expensive unless the capability is genuinely necessary.
At the time of writing, GPT-4o is priced at $2.50 per million input tokens and $10.00 per million output tokens. A single request with a 128k-token context costs $0.32 in input tokens alone before the model generates a single output token. If you are summarizing a 50-page document at roughly 750 words per page and 1.3 tokens per word, that is about 49,000 input tokens, which costs $0.12 per summarization call. Run this 100,000 times per month (a modest production workload for a document processing API) and the input token cost alone reaches $12,000 per month for a single feature.
Now consider the same task handled with a retrieval approach: embed the document (a one-time cost), retrieve the 5 most relevant chunks for each query (approximately 2,000 tokens of context), and generate from those. The input token cost per call drops from $0.12 to roughly $0.005, a 24x reduction. The retrieval infrastructure (a vector database and an embedding model) adds cost, but at any reasonable query volume the economics are decisively in favor of retrieval.
The memory cost on the serving side is equally significant, as established in the previous post. A 128k-token context on LLaMA-3-70B with GQA requires approximately 26 GB of KV cache for a single request. On a two-A100 node with 160 GB of HBM, you can serve at most 5 concurrent requests at 128k context before running out of memory. Reduce context to 4k tokens and the same hardware serves roughly 160 concurrent requests. The cost per served request at the same throughput target is 32x higher for long context. This is why production deployments of long-context models require significantly larger infrastructure than short-context deployments of equivalent quality.
The latency picture is also unfavorable. The time to first token (TTFT) for a 128k-context request on A100 hardware with Flash Attention is on the order of 2–5 seconds for a 70B model, compared to under 200ms for a 4k-context request. This is primarily the cost of the prefill pass, where all 128k input tokens are processed in parallel. For user-facing applications with latency requirements below 1 second, long context is often simply not viable, regardless of quality.
When long context genuinely wins
The costs and quality limitations above do not mean long context is a bad idea. They mean it is a specialized tool that is valuable in specific situations where retrieval-based approaches genuinely cannot substitute for it.
The clearest win for long context is tasks that require coherent reasoning across a document that cannot be decomposed into independent chunks. Consider analyzing a 50,000-token legal contract to identify all clauses that interact with each other in ways that create risk. A retrieval system would embed chunks independently and retrieve based on semantic similarity to a query. But the risk emerges from the interaction between clause 3.2 and clause 17.4 and appendix B, none of which may be semantically similar to a query about risk. The model needs to hold the entire document in working memory and reason across it globally. Long context is the right tool here.
Code repository analysis has a similar structure. Tracing how a change to a data model propagates through a service boundary and affects downstream consumers requires understanding the full call graph across multiple files. A retrieval system retrieves individual files or functions based on semantic similarity but may miss the structural dependency chain unless the retrieval is specifically designed to follow call graphs. Long context allows the model to see the entire relevant code path at once.
Multi-document synthesis, where the output requires integrating information from several sources that partially contradict each other, also benefits from long context. A model synthesizing five research papers with conflicting findings about a drug interaction needs to see all five simultaneously to produce a coherent, qualified summary. Retrieving chunks from each paper independently loses the comparative structure.
The common thread across these winning cases is that the task requires global reasoning: relationships between parts of the context that cannot be identified without seeing all parts simultaneously. For tasks where the answer is localized to a specific part of the context (answer a question given a document, find a specific fact, classify a passage), retrieval is almost always better on both cost and quality.
The decision framework
Before reaching for long context on a new task, three questions are worth working through explicitly.
The first is whether the task requires global reasoning or local retrieval. If you could theoretically answer the question given the right 5 paragraphs from the source material, retrieval is the right architecture. If the answer requires synthesizing relationships across the entire source, long context may be justified.
The second is whether the relevant information is at risk of landing in the middle. If you are placing retrieved documents into a long context and cannot control where the most relevant document appears, the lost-in-the-middle problem will hurt you in proportion to the number of documents and the length of each. A practical mitigation, backed by the Liu et al. findings, is to place the most likely relevant document first in the context or last, not in the middle. If you are not willing to engineer your context placement carefully, retrieval with a short context is a more reliable choice.
The third is whether the cost is justified by the value. At $0.32 per request for 128k context on GPT-4o, a million requests per month costs $320,000 in input tokens alone. If the task can be accomplished at comparable quality with a 4k context and retrieval, the long-context version needs to provide substantial value to justify the 30x cost multiplier. This calculation should happen explicitly, not after the infrastructure is built and the invoices arrive.
The practical engineering pattern
The pattern that consistently works well in production is a hybrid. Build a retrieval system that surfaces the top 3 to 5 chunks most relevant to each query. Place those chunks at the beginning of the context, not the middle, to take advantage of the primacy effect. Use a short context window (4k to 8k tokens) for the vast majority of requests where the answer is localized. Reserve long context for the small fraction of requests that genuinely require global reasoning, and make that determination programmatically based on task type, not as a default.
For the long-context requests, two specific practices improve reliability. First, ask the model to explicitly cite which part of the provided context supports its answer. This forces attention to the relevant portion and surfaces cases where the model is generating from prior knowledge rather than the provided context. Second, if the context is very long (above 32k tokens), consider breaking the reasoning into stages: use the model to identify which sections of the context are relevant to the question in a first pass, then use a second pass with only those sections plus the original question to generate the final answer. This is effectively doing retrieval at inference time and sidesteps the lost-in-the-middle problem by ensuring the relevant content is always at a known, favorable position.
The engineers who get the most value from long context windows are not the ones who use them for everything. They are the ones who understand exactly what the capability is, where it degrades, what it costs, and which tasks genuinely require it. A 1M-token context window is an impressive engineering achievement. It is also a tool that is easy to misuse and expensive to misuse at scale.
References
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL, 2024. https://arxiv.org/abs/2307.03172
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing, 2024. https://arxiv.org/abs/2104.09864
Press, O., Smith, N. A., and Lewis, M. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR, 2022. https://arxiv.org/abs/2108.12409
Chen, S., Wong, S., Chen, L., and Tian, Y. "Extending Context Window of Large Language Models via Positional Interpolation." arXiv, 2023. https://arxiv.org/abs/2306.15595
Peng, B., Quesnelle, J., Fan, H., and Shippole, E. "YaRN: Efficient Context Window Extension of Large Language Models." ICLR, 2024. https://arxiv.org/abs/2309.00071
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. "Attention Is All You Need." NeurIPS, 2017. https://arxiv.org/abs/1706.03762
Team, G., Anil, R., Borgeaud, S., et al. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv, 2024. https://arxiv.org/abs/2403.05530
Hsieh, C. P., Li, S., Lieu, C., Sangkloy, P., Wang, M., Chang, S. F., Ratner, A., and Langford, J. "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv, 2024. https://arxiv.org/abs/2404.06654