Quantization Without Regret: Picking the Right Precision for Your Model

· 17 min read
inferencequantizationprecisionservinghardware

Every byte in a model weight costs money twice: once when it is loaded from DRAM into GPU HBM during deployment, and once when it is transferred from HBM to compute units during each forward pass. A 70B model in float16 occupies 140 GB of HBM. The same model in int4 occupies 35 GB: it fits on a single A100-80GB with room for KV cache, and its memory bandwidth requirement is 4x lower, directly increasing generation throughput. The question is not whether to quantize but how much precision you can afford to give up on your specific task and model size.

Modern LLM serving is almost always memory-bandwidth-bound during autoregressive decoding. The GPU's compute units are faster than the rate at which weights can be streamed from HBM, so the token generation rate is determined by how fast weights move from memory to the matrix multiply units, not by how fast those units compute. Halving the bytes per weight doubles the weight transfer rate, which directly doubles generation throughput. This is why quantization has become the dominant optimization lever in LLM serving, ahead of batching strategies, speculative decoding, and FlashAttention, all of which improve other parts of the pipeline. Nothing else reduces both memory footprint and throughput simultaneously by 2 to 4x with a few hours of preprocessing work.

Float format fundamentals

A floating-point number uses three components: sign (1 bit), exponent (range), and mantissa (precision within that range). The total bit width and the allocation between exponent and mantissa bits determine the format's range and precision independently. This separability is what makes the floating-point family so flexible: you can trade range for precision by reallocating bits, and different training and inference workloads have different sensitivities to each.

FP32 (32-bit, full precision): 1 sign bit, 8 exponent bits, 23 mantissa bits. Range up to approximately 3.4 x 10^38, precision to about 7 decimal places. The standard format for training from scratch, where numerical stability over many gradient steps requires both the wide range and the precision.

FP16 (16-bit, half precision): 1 sign bit, 5 exponent bits, 10 mantissa bits. Range only up to 65,504. Precision to about 3 decimal places. The risk is overflow: any activation value above 65,504 becomes infinity and corrupts gradients or forward pass computations. This is why models trained with mixed-precision training require loss scaling to keep activations in range. The scale factor is adjusted dynamically to prevent overflow while still operating in FP16 for the bulk of computation.

BF16 (bfloat16): 1 sign bit, 8 exponent bits, 7 mantissa bits. The exponent allocation matches FP32 exactly, so the dynamic range matches FP32 and overflow is not a concern. The mantissa is shorter than FP16, giving lower numerical precision, but training can compensate for that precision loss with additional optimizer steps. BF16 is the dominant training format for large models: it eliminates overflow risk while halving memory relative to FP32, at the cost of precision that the training process can absorb. For inference, BF16 and FP16 produce equivalent results for most tasks, with BF16 being safer for models that have large activation ranges.

FP8 (8-bit floating point): The E4M3 variant uses 1 sign bit, 4 exponent bits, and 3 mantissa bits. The E5M2 variant uses 1 sign bit, 5 exponent bits, and 2 mantissa bits. E4M3 is preferred for weights and activations during inference because the extra mantissa bit preserves more numerical precision; E5M2 has wider range and is used for gradient storage during training. On NVIDIA H100 hardware, FP8 matrix multiply is natively supported, meaning the hardware has dedicated tensor core support for FP8 operations at no throughput penalty relative to BF16.

INT8 (8-bit integer): A fixed-point representation with no exponent or mantissa split. The full 8 bits encode magnitude and sign as a scaled integer. Range adaptation requires a per-tensor or per-channel scale factor stored separately. INT8 is excellent for weights, where the scale factor can be calibrated once on a representative dataset. It is riskier for activations, whose distributions are input-dependent and can span a much wider range than can be reliably captured with a fixed scale calibrated offline.

Float format bit layout: FP32, FP16, BF16, and FP8 exponent vs. mantissa tradeoffs

The key insight from the bit layout is that BF16 matches FP32's exponent width exactly, giving it identical dynamic range while halving memory. FP16 sacrifices exponent bits for mantissa bits: more precision but narrower range. FP8 E4M3 keeps 4 exponent bits, which is enough range for inference activations on most models, while fitting in a single byte. INT8 abandons the floating-point structure entirely and uses a per-channel scale factor to map its fixed integer grid to the data distribution.

Post-training quantization: GPTQ and AWQ

Post-training quantization (PTQ) applies quantization to a trained model without any retraining. The weight values are mapped from floating-point to integer representations using a scale and zero-point per group of weights. The core challenge is that naive quantization, rounding each weight to the nearest integer multiple of the scale, introduces reconstruction error that compounds as activations flow through successive layers. PTQ methods differ fundamentally in how they minimize this reconstruction error, and the difference in perplexity at int4 between a good PTQ method and a naive one is 2 to 4 perplexity points on a 7B model.

GPTQ (Frantar et al., 2022) uses a second-order optimization approach based on the Optimal Brain Compression framework. For each transformer layer, it minimizes the quantization error with respect to the layer's output on a small calibration dataset by using the Hessian of the loss. The Hessian captures how sensitive the loss is to changes in each weight: weights in directions of high curvature need to be quantized more carefully than weights in flat directions. GPTQ quantizes weights column by column within each weight matrix, and after quantizing each column, it compensates for the introduced error by adjusting the remaining unquantized columns. This compensation step is what makes GPTQ substantially better than round-nearest: the error does not accumulate independently in each column but is continuously redistributed to preserve the layer's output. The calibration pass takes minutes to hours depending on model size, using a few hundred to a few thousand representative samples from the target domain.

AWQ (Activation-Aware Weight Quantization, Lin et al., 2023) takes a complementary approach based on an observation about which weights matter most. It finds that a small fraction of weight channels, specifically those corresponding to large activation magnitudes, contribute disproportionately to the model's output quality. When those channels are quantized aggressively, the resulting reconstruction error is large because their output is scaled by large activations that amplify the quantization noise. AWQ identifies these salient channels using a calibration set and applies a smoothing transformation: it scales the weight values in those channels upward before quantization (making them quantize to higher-resolution integer values) and scales the corresponding activations downward by the same factor (so the mathematical output is unchanged). The net effect is that approximately 1% of weights receive effectively higher precision at no memory cost, and the reconstruction error on the most important channels drops substantially. This is not a change in the quantization format itself but a preprocessing step that changes the distribution of values being quantized to better match the precision available.

In practice, AWQ tends to produce slightly better perplexity than GPTQ at the same bit width for instruction-following models, particularly at int4. The margin is roughly 0.5 to 1.5 perplexity points on standard benchmarks. GPTQ applies faster because the column-by-column optimization is more parallelizable. Both are available in widely used inference libraries including llama.cpp, AutoGPTQ, and llm-compressor, and both produce quantized model artifacts compatible with standard serving runtimes like vLLM and TensorRT-LLM. The choice between them is primarily driven by which inference runtime you are targeting and whether the calibration time matters for your deployment workflow.

One practical consideration: the calibration dataset matters. Both GPTQ and AWQ benefit from calibration data that is representative of the target inference distribution. A model quantized on generic web text calibration data and then deployed on medical or legal queries will have higher error than a model calibrated on domain-specific examples. For production systems where accuracy on a specific domain is critical, collecting 500 to 2,000 representative samples from that domain and using them as the calibration set is worth the effort.

The perplexity degradation curve

The accuracy cost of quantization follows a consistent pattern across model sizes and tasks. Understanding this pattern precisely, rather than in the abstract, is what separates deployments that work from deployments that fail quietly.

BF16 and FP16 are essentially lossless relative to FP32. Less than 0.1 perplexity point difference is typical in practice, well below the threshold for any detectable quality degradation on any task. These formats are the reference baseline from which all quantization costs are measured.

FP8 produces 0.1 to 0.5 perplexity degradation depending on the method and model. On NVIDIA H100 hardware, FP8 quantization is hardware-native and introduces no throughput penalty. The model computes at FP8 speed with near-FP16 accuracy. For new deployments on H100, FP8 is nearly always the right choice: the accuracy cost is negligible and the inference throughput improvement is real. The FP8 quantization support in major frameworks (TensorRT-LLM, vLLM with FP8 support, SGLang) has matured to the point where enabling it is a configuration flag rather than a custom kernel implementation.

Int8 weight-only (W8A16) produces 0.2 to 0.8 perplexity degradation. This is generally safe for production use on most tasks. The variance is higher than FP8 because int8 weight formats have limited dynamic range for models with large weight value distributions, and the per-channel scale calibration can miss edge cases in tail distributions. Still, for the vast majority of instruction-following and summarization tasks, int8 weight quantization is undetectable in human evaluation and produces meaningful memory savings on any hardware generation.

Int4 weight-only (W4A16) is where the decision becomes consequential, and model size is the dominant factor. A 70B model quantized to int4 with a good PTQ method loses roughly 2 to 3 perplexity points: the large number of parameters provides redundancy that absorbs quantization error, because the model's knowledge is distributed across more parameters than are strictly necessary to represent it. Quantization noise in any single parameter is compensated by other parameters encoding similar information. A 7B model at int4 loses 4 to 8 perplexity points, which is perceptible on reasoning and factual retrieval tasks. The model may produce fluent, confident text that is factually wrong, which is the worst failure mode for production systems because it is invisible without evaluation. A 1B model at int4 can lose 10 or more perplexity points and become essentially unusable on structured tasks like code generation or multi-step reasoning.

Perplexity degradation by quantization level and model size: how larger models tolerate lower precision

The underlying principle is that larger models tolerate lower precision better because their parameter count exceeds what is strictly necessary to represent the model's learned knowledge. Quantization error effectively adds noise to each parameter, and the larger the model, the more of that noise averages out across redundant representations. This is not a minor effect: a 70B model at int4 and a 7B model at int8 may use similar amounts of memory but have very different accuracy profiles. If you have the choice between a smaller model at higher precision and a larger model at lower precision for a given memory budget, the larger model at lower precision usually wins.

Weight-only vs. weight-and-activation quantization

The quantization target, weights alone or weights and activations together, determines both the accuracy risk and the categories of hardware speedup available.

Weight-only quantization (W4A16, W8A16) quantizes stored weights to low precision but dequantizes them back to fp16 before the matrix multiplication. The arithmetic happens in fp16; only the storage and memory bandwidth savings are realized. This is safe because weights are static: they can be quantized carefully offline using a calibration pass, and the same calibrated quantization applies to every inference request. Activations are dynamic: they change with every input, and their distributions can shift significantly across domains. Dequantizing weights before the matmul means the actual computation precision is unchanged, so weight-only quantization introduces only the quantization approximation error from the offline calibration, not runtime numerical instability from low-precision arithmetic on unpredictable activation values.

Weight-and-activation quantization (W8A8, W4A8, W4A4) quantizes both weights and activations for the matrix multiplication itself. This enables integer arithmetic on hardware with dedicated INT8 or INT4 matrix multiply units, which are available on NVIDIA Turing and later GPU architectures. INT8 tensor cores on an A100 can provide 2 to 4x the throughput of FP16 tensor cores for large matrix multiply operations, because the hardware can pack more integer operations into the same silicon area. The accuracy risk is higher because activation distributions are input-dependent and can be difficult to quantize without loss, particularly for models with large activation outliers.

SmoothQuant (Xiao et al., 2022) addresses the activation quantization problem directly. The core observation is that LLM activations often have a small number of channels with very large magnitudes (outliers) while most channels have small magnitudes. Quantizing the full activation tensor with a single scale factor forces you to choose: set the scale to accommodate the outliers, which wastes precision on the typical channels, or set the scale to optimize precision for typical channels, which clips the outliers and introduces large errors. SmoothQuant avoids this tradeoff by migrating the quantization difficulty from activations to weights. It multiplies each activation channel by a per-channel smoothing factor that compresses the outlier channels, and divides the corresponding weight row by the same factor. The mathematical output of the layer is unchanged (the scale factors cancel), but the smoothed activations now have a much narrower distribution that fits cleanly into int8 quantization. The weights absorb the inverse scale factors without significant precision loss because weight distributions are typically more uniform than activation distributions to begin with. SmoothQuant enables W8A8 quantization at near-W8A16 accuracy, which is the key result: you get both the memory savings from quantized weights and the compute savings from integer arithmetic without the accuracy hit that would otherwise come from naively quantizing activations.

The practical implication for deployment is that W8A8 with SmoothQuant is the right path when you are targeting hardware with INT8 tensor cores (A100, H100, or consumer Ampere/Ada GPUs) and need compute throughput improvement beyond what bandwidth-limited weight-only quantization provides. W4A16 with GPTQ or AWQ is the right path when memory capacity is the binding constraint and you cannot afford the accuracy cost of activation quantization at int4.

The decision matrix

The choice of quantization format is determined by three inputs: the hardware available, the model size, and the acceptable accuracy degradation for the task. These interact in ways that make the decision non-obvious without working through each combination.

On NVIDIA H100 hardware, use FP8 for inference. It is hardware-native, adds negligible accuracy loss (typically under 0.3 perplexity points), and requires no custom kernel work. The FP8 support in TensorRT-LLM and vLLM is production-grade. For high-stakes tasks like mathematical reasoning or factual question answering, validate FP8 on a held-out task-specific evaluation set before declaring it safe, but the bar for it passing is low. This is the clear default for any H100 deployment.

On A100 hardware with large models (30B parameters and above), use int4 weight-only quantization with GPTQ or AWQ. The 4x memory reduction enables serving on fewer GPUs, reducing cost per request substantially. A 70B model that previously required 4 x A100-80GB can run on a single A100-80GB instance with int4 quantization, reducing the serving cost by roughly 4x. Validate perplexity on a task-specific held-out set before deploying: 2 to 3 perplexity points of degradation is usually acceptable for chat and summarization, but may not be for specialized reasoning tasks. AWQ is preferred over GPTQ on instruction-following models because it better preserves salient channels.

On A100 hardware with small models (7B to 13B parameters), int4 is risky for reasoning, coding, and factual tasks. Use int8 (W8A16) as the safe default, which gives 2x memory savings with 0.2 to 0.8 perplexity degradation. If memory capacity is genuinely the bottleneck and serving the model at all requires int4, use AWQ and test on your specific task before committing. A model that appears to work in demo conditions can fail on edge cases in the long tail of production inputs.

On hardware without native int8 matrix multiply (older GPU generations, edge devices, consumer hardware), weight-only quantization provides bandwidth savings but not compute savings: the dequantization to fp16 before each matmul means integer tensor cores are not involved. Use the minimum bit width your accuracy requirement supports, which is typically int4 for memory-constrained edge deployments and int8 for accuracy-sensitive server deployments on older hardware.

For the KV cache: int8 KV cache is safe for most tasks and widely supported in frameworks like vLLM (which calls this "fp8 kv cache" on H100). FP8 KV cache is preferable on H100. Quantizing the KV cache and the model weights independently is the right approach: both contribute to memory reduction, the accuracy effects are approximately additive rather than multiplicative, and most modern inference frameworks support them as separate configuration flags. On a 70B model at int4 weights with int8 KV cache, the combined memory footprint is typically 40 to 50 GB on a single A100-80GB, leaving 30 to 40 GB for batch size scaling.

Quantization decision matrix: hardware, model size, and acceptable degradation mapped to format choice

The underlying logic of the decision matrix is this: quantization error scales inversely with model size and directly with the precision of the target format. The two most dangerous combinations are small model plus aggressive quantization (7B at int4) and high-stakes task plus any quantization method that has not been validated on that specific task's evaluation set. Avoiding those two failure modes captures most of the practical risk. Everything else is a tradeoff between memory savings, throughput improvement, and calibration effort that can be navigated methodically.

One final consideration: quantization and model architecture interact. Models trained with specific quantization-awareness (QAT, quantization-aware training) tolerate aggressive quantization far better than models quantized purely post-training, because the training process adapts weight distributions to minimize quantization error. Several recent open-weight models (Llama 3 quantized variants, Mistral quantized releases) include QAT checkpoints that produce substantially lower perplexity at int4 than the equivalent PTQ-quantized fp16 checkpoints. If a QAT checkpoint is available for your target model, use it instead of post-training quantization. The accuracy difference at int4 is typically 2 to 4 perplexity points, which can be the difference between a model that works and one that does not.


References

  1. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. https://arxiv.org/abs/2210.17323

  2. Lin, J., Tang, J., Tang, H., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. https://arxiv.org/abs/2306.00978

  3. Xiao, G., Lin, J., Seznec, M., et al. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." ICML, 2023. https://arxiv.org/abs/2211.10438

  4. Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022. https://arxiv.org/abs/2208.07339

  5. Micikevicius, P., Narang, S., Alben, J., et al. "Mixed Precision Training." ICLR, 2018. https://arxiv.org/abs/1710.03740

  6. Peng, H., Wu, K., Wei, Y., et al. "FP8-LM: Training FP8 Large Language Models." arXiv, 2023. https://arxiv.org/abs/2310.18313