Quantization
Floating point formats, numerical stability, and the math behind INT8, INT4, and FP8 quantization. Essential for deploying efficient inference at scale.
- Explain the memory/accuracy tradeoff in quantization
- Convert between floating point formats (FP32, FP16, FP8)
- Implement symmetric and asymmetric quantization
- Choose appropriate quantization strategies for different workloads
- Identify when quantization will/won't help performance
This chapter covers floating point representation and numerical concepts. Floating Point Basics
Part 4: Quantization Labs - FP8 conversion, INT8/INT4, NVFP4
Floating Point Representation
Understanding floating point formats is crucial for quantization and numerical stability. Every float has three parts: Sign (1 bit), Exponent (range), Mantissa (precision).
FP32: 1 sign + 8 exponent + 23 mantissa bits. Range: ±3.4×1038
Click bits to toggle. The value updates automatically.
Machine Epsilon
Machine epsilon is the smallest value ε such that 1.0 + ε ≠ 1.0 in floating point arithmetic. It defines the precision limit.
Epsilon by Format
| Format | Mantissa Bits | Epsilon (ε) | Decimal |
|---|---|---|---|
| FP32 | 23 | 2-23 | ~1.19×10-7 |
| FP16 | 10 | 2-10 | ~9.77×10-4 |
| FP8 E4M3 | 3 | 2-3 | 0.125 |
| FP8 E5M2 | 2 | 2-2 | 0.25 |
IEEE 754 reserves bit patterns for special values: ±Infinity (exponent all 1s, mantissa 0), NaN (exponent all 1s, mantissa non-zero), Denormals (exponent all 0s, gradual underflow). Always check for these in numerical code.
Avoiding Numerical Disasters
Floating point has limited range and precision. Large intermediate values cause overflow; small differences between large numbers cause catastrophic cancellation.
The exp() Overflow Problem
In softmax, we compute exp(x). But exp(100) ≈ 2.7×1043, which overflows FP16 (max ~65504) and even FP32 at exp(89).
Naive Softmax
exp(x) / Σexp(x)
x = [100, 101, 102]
exp(100) = OVERFLOW
exp(101) = OVERFLOW
exp(102) = OVERFLOW
Result: NaN
Stable Softmax
exp(x - max) / Σexp(x - max)
x = [100, 101, 102]
max = 102
exp(-2) = 0.135
exp(-1) = 0.368
exp(0) = 1.0
Result: [0.09, 0.24, 0.67]
Mathematically: exp(x-m)/Σexp(y-m) = exp(x)/Σexp(y) for any constant m. Choosing m = max(x) ensures all exponents are ≤ 0, preventing overflow. This is used in FlashAttention and all production softmax implementations.
Catastrophic Cancellation
When subtracting two nearly equal numbers, significant digits cancel and relative error explodes.
Example: Loss of Precision
// In FP32 with ~7 significant digits
a = 1.0000001
b = 1.0000000
c = a - b // Expected: 0.0000001
// Actual result may have only 1 significant digit!
// Relative error: potentially 100%
This matters in numerical derivatives, residual computations, and anywhere you compute differences of large similar values.
Error Accumulation in Summation
Adding many small values to a large accumulator loses precision. Each addition rounds, and errors accumulate.
Kahan Summation
Track the running error and compensate in subsequent additions:
float sum = 0.0f;
float c = 0.0f; // Running compensation
for (int i = 0; i < n; i++) {
float y = x[i] - c; // Compensated input
float t = sum + y; // Tentative sum
c = (t - sum) - y; // Recover lost low bits
sum = t;
}
Kahan summation reduces error from O(n·ε) to O(ε). In practice, sorting values by magnitude before summing or using tree reduction also helps.
Quantization Math
Quantization maps floating point values to lower-precision integers (or low-bit floats like FP8). This reduces memory bandwidth requirements but introduces quantization error.
Scale Factor Computation
scale = 8.0 / 7 = 1.143, quantized_int = round(3.7 / 1.143) = 3, dequantized = 3 × 1.143 = 3.43
Block Scaling
Computing one scale per tensor wastes precision. Block scaling computes a separate scale factor for each block of values (typically 16-128 elements).
Why Groups of 16 or 32?
- Matches GPU warp size (32 threads) for efficient computation
- Tensor Core MMA shapes (16×16, 8×8) align naturally
- Amortizes scale factor storage overhead
- Balance between precision and metadata cost
OCP Microscaling (MX) formats use 32-element blocks. NVIDIA's NVFP4 uses 16-element blocks with FP8 scale factors.
Quantization Error Bounds
| Format | Levels | Max Relative Error | Use Case |
|---|---|---|---|
| INT8 | 256 | ~0.4% | Weights, KV cache |
| INT4 | 16 | ~7% | Weights (with fine-tuning) |
| FP8 E4M3 | 256* | ~6.25% | Activations, general |
| FP4 E2M1 | 16* | ~25% | With block scaling |
* FP formats have non-uniform spacing; error varies by magnitude.
Production Quantization
Modern inference systems combine multiple quantization strategies. Here's how the pieces fit together.
Common Patterns
Weight Quantization (W8A16, W4A16)
Weights stored in INT8 or INT4, dequantized to FP16 for computation. Memory-bound workloads benefit most. Used by GPTQ, AWQ, bitsandbytes.
Full INT8 (W8A8)
Both weights and activations in INT8. Requires calibration data to determine activation scales. Used by TensorRT-LLM, ONNX Runtime.
FP8 (E4M3/E5M2)
Native hardware support on H100/B100+. E4M3 for forward pass (precision), E5M2 for gradients (range). Near-FP16 quality with 2× throughput.
NVFP4 (Blackwell)
4-bit floating point with per-block FP8 scales. 16-element blocks. 2× throughput vs FP8 with acceptable quality for inference.
Weight-only quantization helps memory-bound inference (batch size 1, long sequences). Full quantization (W8A8, FP8) helps compute-bound workloads (large batches, short sequences). Profile your workload to choose the right strategy.
Citations
- FP8 Formats for Deep Learning - Micikevicius et al., 2022
- GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
- AWQ: Activation-aware Weight Quantization - Lin et al., 2023
- IEEE 754-2019 - Floating-Point Arithmetic Standard
- OCP Microscaling Formats - MX format specification