LoRA From First Principles: Why Low-Rank Adaptation Works and When It Breaks
Fine-tuning a 7B model updates 7 billion parameters. On an A100, that means storing 7B weights in fp16 (14 GB) plus gradients (14 GB) plus optimizer states (28 GB for Adam's first and second moments) for a total of 56 GB just for training state, before even considering activations. A 70B model pushes this to roughly 560 GB, requiring eight A100s just to hold the numbers, before any data flows through. LoRA changes the economics of this entirely by introducing a mathematical constraint on what the fine-tuning update is allowed to do. The constraint sounds limiting. The insight is that the constraint almost always has room to spare.
The rank hypothesis: why fine-tuning updates are low-dimensional
When a large pre-trained model is fine-tuned on a downstream task, the update to each weight matrix is not arbitrary. Aghajanyan et al. (2021) measured the intrinsic dimensionality of fine-tuning by asking: what is the minimum number of free parameters needed to reach 90% of the performance achievable with full fine-tuning on a given task? The method uses a random projection: instead of learning all d × k parameters of a weight matrix, you learn a small vector in some low-dimensional subspace and project it up into the full parameter space. By varying the dimension of this subspace and measuring task performance, you can identify the smallest subspace that is sufficient.
The results were striking. For RoBERTa fine-tuned on MRPC, a sentence-pair matching task, the intrinsic dimension was roughly 200. For SST-2 sentiment classification, around 100. For larger models fine-tuned on more complex tasks, the intrinsic dimension grew, but it remained far below the nominal parameter count. GPT-2 medium has roughly 345 million parameters. Fine-tuning it on common NLP benchmarks required fewer than 1,000 free dimensions to match 90% of full fine-tuning performance. The effective dimension of the update is a tiny fraction of the total parameter space.
This is not a mathematical guarantee that every fine-tuning task will compress well. It is an empirical observation that tasks commonly encountered in NLP practice involve updates that live in a surprisingly low-dimensional subspace of the weight space. The implication is that you do not need to update all d × k parameters in a weight matrix to capture the fine-tuning signal. You need to update only the components of those parameters that correspond to the directions in weight space that matter for the task. If those directions span a small subspace, a low-rank approximation of the weight update will suffice.
This is the empirical foundation for LoRA. The question it answers first is not "how do we make fine-tuning efficient?" but "why would constrained fine-tuning work at all?" The intrinsic dimensionality result is the answer.
The LoRA math in detail
Instead of learning ΔW directly (which would be d × k parameters), LoRA learns two smaller matrices: A of shape (r × k) and B of shape (d × r), where r is the rank and r is much less than min(d, k). The full weight update is reconstructed as the product B × A.
For a concrete example: at rank 8 for a 4096 × 4096 attention weight matrix, A has 8 × 4096 = 32,768 parameters and B has 4096 × 8 = 32,768 parameters, for a total of 65,536 trainable parameters. The original matrix would require 4096 × 4096 = 16,777,216 parameters to update directly. That is a 256-fold reduction in trainable parameters for this single matrix.
During training, the original weights W0 are frozen entirely. Only A and B receive gradient updates. The forward pass computes W0x + BAx, or equivalently (W0 + BA)x. The mathematical identity between these two forms is what enables zero-overhead inference: you can merge BA into W0 by simple addition before deployment, at which point the model runs as a standard dense linear layer with no adapter machinery in the critical path. There is no branching at inference time, no extra matrix multiplications, no memory overhead from maintaining separate adapter weights if you want to avoid it.
The initialization is intentional and consequential. A is initialized from a Gaussian distribution with small variance. B is initialized to zero. Because the forward pass adds BAx, and B starts at zero, BA = 0 at the beginning of training. This means the adapter contributes nothing to the output at initialization: the fine-tuning starts exactly from the pre-trained behavior. If you initialize B to something nonzero, you immediately shift the model away from pre-training before any task-specific gradient signal has arrived, which destabilizes early training. The zero-B initialization is a warm start that preserves the quality of the base model until the data has the chance to direct adaptation.
The scaling factor alpha/r controls the effective magnitude of the adapter's contribution. The adapter output is scaled by (alpha/r) before being added to W0x. By convention, alpha is set to 2 × r, making the scaling factor 2. The purpose of this scaling becomes clear when you vary r: without it, increasing the rank would increase the magnitude of updates (because more dimensions are contributing), which would effectively change the learning rate of the adapter as a function of rank. The alpha/r scaling decouples rank from update magnitude, so you can increase rank to add capacity without implicitly changing how aggressively the adapter learns.
Rank selection in practice
Rank is the primary configuration choice in LoRA, and it is worth reasoning about rather than defaulting to library examples. The right rank depends on how different the target task distribution is from the pre-training distribution, which is a way of asking how many independent directions of change are required in the weight space.
Rank 4 to 8 is appropriate for style adaptation, tone adjustment, and output format enforcement. The model already knows how to write English prose, follow instructions, and produce structured outputs. The fine-tuning is adjusting how it writes, not what it knows. This update is genuinely low-dimensional: you are pushing the model along a small number of directions that correspond to stylistic preferences. Rank 8 gives you 8 basis directions in a 16-million-dimensional weight space for a 4096 × 4096 matrix. For style, that is enough.
Rank 16 to 32 is appropriate for domain specialization, such as fine-tuning a general model on legal contracts, clinical notes, or financial reports. The model needs to learn a moderate number of patterns that are underrepresented or absent in its pre-training distribution: specialist vocabulary, domain-specific reasoning conventions, preferred citation formats, terminology disambiguation. The update subspace is larger than for style adaptation because the model is acquiring new associations, not simply adjusting existing ones.
Rank 64 and above is appropriate when structural reasoning patterns need to change: teaching a model to follow a specific multi-step chain-of-thought format, adapting it to a task type with genuinely different input-output structure from anything in pre-training, or performing heavy domain adaptation on a model that was pre-trained on a very different distribution. Research has consistently found diminishing returns above rank 64 for most tasks. Zhang et al. (2023) in AdaLoRA showed that most of the information in the fine-tuning update concentrates in a small number of singular value directions, and increasing rank beyond the point where the eigenvalue spectrum of the update flattens out adds parameters without proportional benefit. The optimal rank for a task is roughly the point where the update's effective dimensionality is saturated.
A practical approach when rank is uncertain: start at rank 16, evaluate task performance, then try rank 8 and rank 32. If rank 8 matches rank 16, the task fits comfortably in a small subspace. If rank 32 improves on rank 16, the task requires more capacity and you should check rank 64 before settling. This bracketing costs three fine-tuning runs but removes the guesswork.
Which layers to target
LoRA can be applied to any weight matrix in the network. The original paper's standard practice applies it to the attention projection matrices: query (Wq), key (Wk), value (Wv), and output projection (Wo). The FFN matrices can also be targeted but are less commonly included in default configurations.
Hu et al. showed that adapting only Wq and Wv captures approximately 90% of the fine-tuning gains achievable by adapting all four attention projections, at roughly half the parameter count. The interpretation connects to what these matrices do. The query and value projections carry most of the task-relevant signal: Wq determines what the model is looking for at each position, and Wv determines what information gets extracted from matched positions. Wk influences what queries attend to, and Wo mixes the attended values back into the residual stream. The task-specific behavior is largely encoded in the query-value interaction, because that is where the model decides both what to look for and what to retrieve when it finds it. Adapting Wk and Wo adds capacity but with decreasing return.
Including FFN matrices in LoRA adds a different kind of adaptation capacity. Recall from the previous post in this series that FFN layers function as key-value memories storing factual associations, with upper layers encoding factual completions and lower layers encoding syntactic patterns. If the fine-tuning task requires updating factual associations, such as domain adaptation on specialized literature where the FFN's key-value memory needs new entries, including FFN LoRA is beneficial. For style and tone tasks, where the factual content of the FFN's memory is irrelevant to the adaptation goal, attention-only LoRA typically suffices and keeps the parameter count lower.
The practical heuristic: start with Wq and Wv only. If task performance plateaus below acceptable thresholds, add Wk and Wo. If still insufficient, add FFN LoRA, particularly in the upper layers where factual associations are concentrated.
QLoRA: making large models trainable on a single GPU
QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization of the base model weights, enabling fine-tuning of models that would otherwise require hardware unavailable outside hyperscaler infrastructure.
The base model weights are quantized to NF4, Normal Float 4, a 4-bit format designed specifically for neural network weights. The design principle behind NF4 is that trained neural network weights follow approximately normal distributions. Rather than using a standard 4-bit integer format, NF4 places its 16 quantization levels at quantile boundaries of the standard normal distribution, so that each quantization bin contains roughly equal probability mass. This minimizes quantization error for normally distributed values compared to a uniform grid. The LoRA adapter matrices A and B remain in fp16, because they are small and actively trained, so the quantization overhead is irrelevant to their cost.
For a 70B model, the memory arithmetic becomes tractable. Full fp16 weights require 140 GB. With 4-bit quantization, the base model weights shrink to approximately 35 GB. A single A100-80GB node cannot hold full fp16 weights for a 70B model even for inference, let alone training. With QLoRA, it can hold the quantized base model, the fp16 adapters, and enough activation memory to train.
The training procedure is precise about what gets dequantized when. During the forward pass, QLoRA dequantizes each block of base model weights from NF4 to bf16 for the computation, performs the forward pass through that block, then discards the dequantized values. Only the adapter parameters ever exist in full precision for extended periods. The optimizer state covers only the adapter parameters: for a rank-16 LoRA adapter on a 70B model, the adapter has on the order of a few million parameters, and Adam's optimizer states for those parameters occupy a few dozen megabytes. The dominant memory cost during training is activations, not optimizer state, and activations can be managed through gradient checkpointing.
The accuracy cost of quantizing to 4 bits is smaller than intuition would suggest. The original QLoRA paper showed that Guanaco models fine-tuned with QLoRA on a single A100 were competitive with GPT-4 on the Vicuna benchmark, a chat evaluation. The reason is that NF4 quantization loses relatively little information from normally distributed weights, and LoRA's adapter parameters, which carry all of the task-specific adaptation, remain in full precision throughout. You are quantizing the frozen base that provides general language competence, while keeping the trainable task-specific components at full resolution.
When LoRA breaks
LoRA assumes that the fine-tuning update lives in a low-rank subspace. Three scenarios violate this assumption in ways that cannot be fixed by increasing rank.
The first is when the fine-tuning dataset is so far from the pre-training distribution that the required weight changes are fundamentally high-rank. LoRA intentionally limits the update's magnitude as well as its rank: because W0 is frozen and only BA is trained, the total change to any given weight is bounded by the magnitude of BA. If the task requires moving far from the initialization in many directions simultaneously, LoRA at any reasonable rank will underfit. The model simply cannot adapt far enough. In this scenario, full fine-tuning, continued pre-training on the target domain before LoRA fine-tuning, or PEFT methods with higher intrinsic capacity such as full prefix tuning or adapter layers are required.
The second failure mode involves high-diversity fine-tuning datasets. If you attempt to adapt a model to a large number of distinct task types simultaneously with a single LoRA, the combined update subspace across all tasks may not be low-dimensional. Each task type contributes its own set of required adaptation directions, and the union of many such sets is not compact. In this case, training separate LoRA adapters per task type and composing them using linear combinations, or using TIES-merging (Ilharco et al., 2023) to combine task-specific weight vectors while resolving conflicts, produces better results than a single high-rank monolithic LoRA. The architectural analogy is a single adapter trying to be all things to all tasks, versus a mixture of specialists composed at serving time.
The third failure mode is misplacing the constraint in the wrong layer. LoRA's parameter reduction assumes that the frozen base model has adequate representational capacity for the target task, and that only a small directional correction is needed. If the base model has insufficient capacity, LoRA cannot compensate. Fine-tuning a 1B model on complex multi-step reasoning with LoRA will not produce a model that reasons well if the 1B base model does not have the representational structure to support those reasoning patterns. LoRA adjusts direction in the weight space. It does not create new representational capacity that the base model geometry cannot support. The floor is the base model's own capability ceiling.
There is a subtler version of this failure that practitioners encounter in practice: using LoRA with a poorly trained or instruction-poorly-aligned base model, then attributing downstream quality failures to LoRA configuration. The diagnosis is often confused because increasing rank or changing layer targeting appears to help slightly, encouraging further tuning of the wrong variable. If the base model quality is the limiting factor, no LoRA configuration will close the gap. The correct diagnosis is to evaluate whether the base model itself, prompted without any fine-tuning, demonstrates the capabilities you are trying to elicit. If it does not, fine-tuning of any kind is building on a weak foundation.
References
Hu, E., Shen, Y., Wallis, P., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. https://arxiv.org/abs/2106.09685
Aghajanyan, A., Gupta, S., and Zettlemoyer, L. "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL, 2021. https://arxiv.org/abs/2012.13255
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS, 2023. https://arxiv.org/abs/2305.14314
Zhang, Q., Chen, M., Bukharin, A., et al. "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning." ICLR, 2023. https://arxiv.org/abs/2303.10512
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022. https://arxiv.org/abs/2208.07339
Ilharco, G., Ribeiro, M. T., Wortsman, M., et al. "Editing Models with Task Arithmetic." ICLR, 2023. https://arxiv.org/abs/2212.04089