What the FFN Layers Are Actually Storing (The Transformer as a Key-Value Memory)

· 19 min read
transformersarchitectureffnhallucinationrag

Most engineers who work with language models have an implicit mental model of what the feed-forward network layers do: they apply a non-linearity that allows the transformer to compute functions that pure attention cannot express. The FFN is the "MLP part" of the transformer, sitting between the residual connections, doing something vaguely useful. This mental model is not wrong, but it is radically incomplete. The FFN layers are not just a computational primitive. They are the place where the model stores factual knowledge acquired during pre-training. They are the model's long-term memory, encoded in weights.

Understanding this at a mechanistic level changes how you reason about hallucination, about the limits of scaling, and about why retrieval-augmented generation is not just an engineering convenience but an architectural solution to a structural problem.

The Geva et al. result: FFN layers are key-value memories

In 2021, Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy published a paper titled "Transformer Feed-Forward Layers Are Key-Value Memories" in EMNLP. The argument is direct, and once you see it, the interpretation is difficult to unsee.

The FFN sublayer in a standard transformer takes the output of the attention sublayer, passes it through two linear projections with a non-linearity in between, and adds the result back to the residual stream. Written out, this is FFN(x) = W2 * GELU(W1 * x), where x is the d_model-dimensional vector for a given token position, W1 has shape [d_ffn x d_model], W2 has shape [d_model x d_ffn], and d_ffn is typically four times d_model (though LLaMA uses a different ratio, discussed below).

The key insight is in the intermediate computation. When you compute W1 * x, you are computing a d_ffn-dimensional vector where each element i equals the dot product of the i-th row of W1 with the input x. After applying GELU, you get a d_ffn-dimensional activation vector where many entries are near zero (GELU suppresses small values) and some are strongly positive. Call this activation vector a. Then W2 * a is a weighted sum: for each neuron i, you are adding W2[:,i] (the i-th column of W2) to the output, scaled by a[i].

This structure is exactly a soft key-value retrieval. The rows of W1 are the keys. Each row W1[i] is a d_model-dimensional vector. When you compute W1[i] · x, you are asking how well the input x matches this key. The GELU activation is the gating function: if the match is strong (large positive dot product), the gate opens; if the match is weak or negative, the gate closes. The columns of W2 are the values. When neuron i fires, column W2[:,i] is added to the output with weight proportional to how strongly the key matched.

The entire FFN computation reduces to: which keys match this input? Retrieve their associated values and sum them. This is not a metaphor. It is exactly the arithmetic the weights perform.

Geva et al. verified this interpretation empirically by finding what natural language text maximally activates each neuron. They took neurons from a trained GPT-2-style model and searched for input sequences that caused each neuron's key to fire strongly. What they found was that neurons had semantically coherent triggers. A neuron would fire maximally on inputs related to "months of the year," another on inputs involving "military ranks," another on sequences containing possessive constructions in English. The keys were not random vectors. They had learned to detect meaningful patterns in language.

The value vectors, W2[:,i], were then analyzed by projecting them into the vocabulary space. Because transformer models typically tie the output embedding matrix to the token predictions, projecting a value vector through the output embedding reveals which tokens that value vector promotes when added to the residual stream. For a neuron that fired on "months of the year" inputs, the value vector promoted tokens like "January," "February," "April" in the vocabulary space. The key matched a category; the value promoted members of that category.

The FFN layer as a key-value memory: W1 rows match patterns, W2 columns output associated values

This was followed up by the same group in 2022, in a paper titled "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space." That paper traced how factual associations are promoted across layers, showing that value vectors in upper layers push the final token distribution toward specific factual completions. When the model processes the prompt "The capital of France is," specific neurons in the upper layers fire and their value vectors promote "Paris" in the vocabulary space. The FFN layers are doing fact recall, not abstract computation.

How knowledge distributes across layers

The neurons that fire on months of the year are not randomly distributed across the 32 layers of a model. There is a systematic pattern that has been observed consistently across GPT-2, GPT-Neo, and LLaMA-family models: the type of pattern that activates a neuron varies with its depth in the network.

Lower layers, roughly the first third of the network in a 32-layer model (layers 0 through 10), contain neurons that fire on surface-level linguistic patterns. These neurons respond to morphological features like past-tense verb endings, plural noun suffixes, capitalization patterns, punctuation contexts, and part-of-speech patterns. A neuron in layer 3 might fire strongly whenever the input token is a past-tense verb form. A neuron in layer 7 might fire on commas followed by conjunctions. The value vectors associated with these neurons output syntactically coherent continuations, not factual assertions. They handle the grammar of text.

Middle layers (roughly layers 11 through 22 in a 32-layer model) shift toward semantic and topical patterns. Neurons at this depth fire on semantic co-occurrence and category membership rather than surface form. A neuron at layer 15 might fire on vocabulary associated with sports, regardless of whether the specific tokens are "quarterback," "penalty," or "tournament." Another fires on medical terminology. Another on financial language. The value vectors at this depth promote semantically related tokens, allowing the model to maintain topical coherence across a passage. This is where the model tracks what kind of text it is in.

Upper layers (roughly layers 23 through 31) are where factual associations live. This is the claim that follows from the Geva et al. probing experiments: neurons at this depth fire on specific factual configurations. A neuron in layer 28 might fire when the input is a sentence like "The Eiffel Tower is located in" and the preceding context establishes we are discussing geography. Its value vector promotes "Paris." Another neuron fires on prompts like "Shakespeare was born in" and its value vector promotes "Stratford." The keys have been trained to recognize factual cue patterns; the values have been trained to supply the associated completion.

Meng et al. (NeurIPS 2022) confirmed and extended this picture from a causal intervention perspective in "Locating and Editing Factual Associations in GPT." Their ROME method worked by patching specific upper-layer FFN weight vectors to change which fact the model retrieves. If you want the model to believe that the Eiffel Tower is in Rome rather than Paris, you edit the value vectors in the upper FFN layers that fire on the relevant key patterns. The fact that surgical edits to upper-layer FFN weights reliably change specific factual outputs, without disrupting unrelated knowledge, is strong evidence that factual associations are locally encoded in exactly the way Geva et al. described.

Knowledge distributed across transformer layers: syntax in lower layers, factual associations in upper layers

The practical significance of this layer-wise distribution is that the model's factual recall is not a diffuse emergent property spread uniformly across all parameters. It is concentrated in the upper-layer FFN weights. This is why fine-tuning on a small dataset can shift factual associations without requiring a complete re-learning of linguistic competencies: the morphological and syntactic knowledge in lower layers is largely intact, while the factual associations in upper layers are updated.

The parameter count that puts the stakes in perspective

When people talk about "a 7 billion parameter model," the number sounds like a single measure of capacity. Breaking it down by component reveals where those parameters actually live, which changes what you expect the model to be good and bad at.

Take LLaMA-2-7B as a concrete example. Its architecture uses d_model = 4096 and d_ffn = 11008 (LLaMA actually uses SwiGLU with three projection matrices, but for the memory count what matters is that the total FFN parameter count per layer is approximately 3 x 4096 x 11008 / 2 which comes to roughly 68M, though the simpler two-matrix count of 2 x 4096 x 11008 is approximately 90M parameters per layer). With 32 transformer layers, the FFN layers contain roughly 32 x 90M = 2.9 billion parameters. Against a total of 7 billion parameters, that is 41 percent of the entire model devoted to the FFN layers.

LLaMA-3-8B pushed d_ffn further to 14336. With d_model = 4096 and 32 layers, the FFN layers hold approximately 2 x 4096 x 14336 x 32 = 3.77 billion parameters, which is roughly 47 percent of the 8B total. Nearly half of every parameter in the model is inside FFN matrices.

The attention layers (Q, K, V, and output projections across all heads, across all layers) hold most of the remaining parameters. Embeddings are a smaller fraction. The rough breakdown for LLaMA-2-7B: embeddings around 500M, attention matrices around 2.4B, FFN layers around 2.9B, and layer norms plus miscellaneous under 200M. The FFN layers dominate.

This matters because each neuron in each layer is a candidate memory slot for a key-value association. LLaMA-2-7B has 32 x 11008 = 352,256 neurons. Each neuron is a (key, value) pair. The model's factual knowledge is distributed across these 352,256 slots, with slots in different layers specializing in different types of patterns as described above. More model capacity means more memory slots, which in principle allows more facts to be stored, but the storage mechanism has properties that scale is alone cannot fix.

Knowledge conflicts and the mechanism of hallucination

With the key-value memory picture in place, the mechanism by which hallucination occurs becomes precise rather than mysterious.

Consider the query "Who is the CEO of OpenAI?" Sam Altman was removed from the role in November 2023 and reinstated a few days later. During the interval, Emmett Shear briefly served as CEO. A model trained on a corpus that captures this period has seen text associating "CEO of OpenAI" with Altman, text associating it with Shear during the brief transition, and text associating it with Altman again after reinstatement. Multiple key-value associations have been trained into the upper-layer FFN weights for this factual query pattern.

When the model processes the input tokens encoding "Who is the CEO of OpenAI?", the upper-layer FFN neurons that fire on CEO-of-company and OpenAI patterns activate. Some of these neurons have value vectors that promote "Altman." Others have value vectors that promote "Shear." The FFN output is the weighted sum of all activated value vectors. If the training corpus had significantly more text associating Altman with the role than Shear, the sum of value vectors pointing toward "Altman" will dominate. The model gives the correct answer not because it retrieved a single clean fact, but because the weighted sum happened to land on the right answer.

Now consider asking about a CEO transition that happened three months after the model's training cutoff. The relevant key vectors fire because the model recognizes the pattern of a CEO query about a specific company. But none of the value vectors associated with those keys have been trained on the post-cutoff transition. The weighted sum of the activated value vectors reflects the most recent CEO the model was trained on. The model states this confidently, because confident predictions are what it was trained to produce. The error is not a reasoning failure. It is an absence of training signal for the specific key pattern.

Worse, the model has no reliable way to know that its value vectors are stale. There is no "timestamp" on a stored key-value pair. The model does not have access to when a fact was learned. It only knows that certain key patterns activate certain value configurations, and it reports those activations as facts.

Temperature adjustment does not fix this. Setting temperature to zero does not make the stored values more accurate. It only makes the model more committed to whatever values happen to dominate the weighted sum. A confident wrong answer at temperature 0 is still a wrong answer. The underlying cause is absent or conflicting training signal, and no decoding parameter can remedy that.

Memorization, generalization, and the double descent curve

Not all facts are stored with equal reliability. The strength with which a key-value association is encoded in the FFN weights depends heavily on how frequently the associated pattern appeared during pre-training.

A fact like "Paris is the capital of France" appears in training text millions of times, across encyclopedias, geography textbooks, travel guides, news articles, and conversational web text. The corresponding key vectors in the upper layers have been trained to recognize the relevant patterns across an enormous variety of phrasings. The value vectors have been pushed toward "France" by millions of gradient updates from millions of examples. The association is encoded so redundantly across so many neurons that it is extremely robust: even if 90 percent of the relevant neurons were ablated, the remaining 10 percent would still point the model toward the correct answer.

A fact like "The current mayor of Middletown, Connecticut is Bob Smith" appears perhaps a handful of times in the pre-training corpus. The key vectors for this pattern may barely exist as coherent representations. If they do exist, the value vectors have been shaped by only a small number of training examples. The association is fragile: a few neurons with weak key-value alignment, easily overwhelmed by other activations. The model will either refuse to answer, confabulate something plausible-sounding, or produce a confident wrong answer depending on what nearby patterns dominate at inference time.

This is the memorization-generalization tension at the level of individual facts. Frequent facts are stored reliably through redundant encoding. Rare facts are stored weakly or not at all.

Double descent describes the broader relationship between model capacity and generalization error. The classical bias-variance tradeoff predicts that as model size increases, training error decreases while test error first decreases (as capacity allows the model to capture more of the underlying pattern) and then increases (as the model begins to memorize training examples rather than generalize). This gives the familiar U-shaped test error curve as a function of model capacity.

Belkin et al. (PNAS, 2019) showed that the curve does not end at the minimum. As capacity continues to increase past the point where the model can exactly memorize the training data (the "interpolation threshold"), test error begins decreasing again, often to values below the classical optimum. The mechanism is that models large enough to interpolate through all training data have many ways to do so, and gradient descent tends to find the "simplest" interpolating solution, which generalizes better than the overfit solutions found by smaller models near the classical optimum.

Modern LLMs sit deep in the second descent phase. They are vastly over-parameterized relative to the number of training examples, and they generalize well on language because they have found smooth, generalizable representations of linguistic structure. But for specific facts, this picture is more complicated. A fact that appeared once in training contributes one data point. The model can memorize that specific point, but the key vectors it trains are not robust because there is only one gradient signal shaping them. A fact that appeared ten thousand times has ten thousand gradient signals from diverse phrasings training the same underlying key-value structure. Scale expands the number of memory slots, but it cannot substitute for the training signal that was never present.

Double descent: test error vs. model size, showing the interpolation threshold and second descent

This is why increasing model size from 7B to 70B does not reliably improve factual recall on obscure or recent knowledge. The 70B model has more key-value slots and generally better representations, but if a specific fact appeared rarely in training, the relevant slots in the 70B model are as weakly trained as the corresponding slots in the 7B model. You got more generalization capacity, more reasoning capability, better language understanding. You did not get more accurate recall of facts that were never reinforced.

Why retrieval beats scale for knowledge-intensive tasks

The accumulated picture explains why retrieval-augmented generation is not merely a workaround for knowledge cutoffs. It is the structurally correct architecture for tasks where factual accuracy on specific, potentially obscure, or potentially recent facts is required.

RAG bypasses the FFN key-value retrieval mechanism entirely for the information that needs to be accurate. Instead of asking the model's W2 matrices to recall a specific fact (which may be encoded weakly, encoded with conflicts from contradictory training examples, or not encoded at all because the fact postdates training), you retrieve the relevant passage from an external index and place it in the model's context window. The model then reads this text through its attention layers, which is a fundamentally different operation from FFN-mediated fact recall.

Reading text through attention is reliable in a way that FFN recall is not. The model does not need to have seen a specific fact during pre-training to read it correctly from provided context. The attention mechanism reads the current tokens and their relationships. The FFN mechanism retrieves from weights that encode patterns from past training. Providing relevant context moves the burden from weight-stored recall (unreliable for sparse or absent training signal) to in-context reading (reliable, because the fact is literally present in the token sequence).

This is why a 7B model with retrieval frequently outperforms a 70B model without retrieval on knowledge-intensive question answering benchmarks. Lewis et al. (NeurIPS, 2020) demonstrated this on the original RAG paper: the 7B RAG model achieved higher accuracy on Open Domain QA than larger non-retrieval models, because the bottleneck was not reasoning capability but factual recall, and retrieval solved the recall problem directly while the 7B model's language understanding was sufficient to extract the answer from the retrieved passage.

The 70B model's advantages over the 7B model are real: better reasoning, better instruction following, better handling of ambiguous or complex phrasings, better synthesis of information across multiple retrieved passages. For tasks where the challenge is reasoning over provided information rather than recalling information from weights, the larger model genuinely helps. But scale alone, without retrieval, cannot compensate for absent or stale training signal on specific factual queries. The architecture of FFN-as-key-value-memory simply does not allow it.

Practical implications for system design

The mechanistic picture of FFN layers as key-value memories has direct consequences for how you should build systems that rely on language models.

Do not use a model as a knowledge base. The FFN memory is not a reliable key-value store. It has no consistency guarantees, no versioning, no way to update individual facts without risking corruption of adjacent knowledge. Facts that were common in pre-training are recalled reliably; facts that were rare or absent are not. There is no way from the outside to distinguish which situation you are in for any given query. If your application requires accurate recall of specific facts, use retrieval.

Expect worse accuracy on obscure, recent, and contested facts. The model's factual recall is a direct function of training frequency for the relevant key-value pattern. Company-specific knowledge, recent events, domain-specific technical details, and any fact that was rare in the web corpus are all in the fragile zone. This is not a model quality failure. It is the predictable consequence of how the FFN memory mechanism works.

Understand that temperature does not change factual accuracy. Setting temperature to 0.0 makes the model more decisive about whatever its FFN layers retrieve, not more accurate. The confidence of a wrong answer is not evidence of correctness. If you need factual accuracy, you need the fact to be either in the training distribution with high frequency or in the provided context. No sampling parameter can substitute for either.

When you use a larger model for knowledge-intensive tasks, be explicit about what you are gaining. A 70B model retrieves the same weakly-stored facts as a 7B model with similar accuracy. What it does better is reason over retrieved context, handle complex multi-step queries, and maintain coherence across longer outputs. Pairing even a 7B model with a well-built retrieval system will outperform a 70B model without retrieval on most knowledge-intensive benchmarks, because the bottleneck is almost never model size; it is whether the relevant fact is accessible at all.

For applications where model-stored knowledge is unavoidable (conversational tasks where users may ask about anything, where pre-fetching context is not feasible), the right response is to build uncertainty handling into the output pipeline. Probe the model's own confidence through sampling (asking the same question multiple times at higher temperature and checking for consistency), route factual queries to retrieval where possible, and design the user experience to surface uncertainty rather than suppress it. The FFN layers will always have gaps. Systems that pretend otherwise will produce confident wrong answers at the worst possible moments.


References

  1. Geva, M., Schuster, R., Berant, J., and Levy, O. "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP, 2021. https://arxiv.org/abs/2012.14913

  2. Geva, M., Caciularu, A., Wang, K., and Goldberg, Y. "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space." EMNLP, 2022. https://arxiv.org/abs/2203.14980

  3. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. "Locating and Editing Factual Associations in GPT." NeurIPS, 2022. https://arxiv.org/abs/2202.05262

  4. Belkin, M., Hsu, D., Ma, S., and Mandal, S. "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." PNAS, 2019. https://arxiv.org/abs/1812.11118

  5. Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. https://arxiv.org/abs/2005.11401

  6. Touvron, H., Martin, L., Stone, K., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv, 2023. https://arxiv.org/abs/2307.09288