Chapter 6

Quantization

Floating point formats, numerical stability, and the math behind INT8, INT4, and FP8 quantization. Essential for deploying efficient inference at scale.

Building on Chapters 2-5
You've optimized compute (Ch4) and memory access patterns (Ch2-5). Quantization attacks the problem from a different angle: reducing the data itself. Every concept from earlier chapters applies—bandwidth limits, memory hierarchy, tiling—but now with 2x or 4x more effective bandwidth.
What You'll Learn
  1. Explain the memory/accuracy tradeoff in quantization
  2. Convert between floating point formats (FP32, FP16, FP8)
  3. Implement symmetric and asymmetric quantization
  4. Choose appropriate quantization strategies for different workloads
  5. Identify when quantization will/won't help performance
📚
Prerequisites

This chapter covers floating point representation and numerical concepts. Floating Point Basics

01 — FLOATING POINT

Floating Point Representation

Understanding floating point formats is crucial for quantization and numerical stability. Every float has three parts: Sign (1 bit), Exponent (range), Mantissa (precision).

Interactive: Floating Point Bits

FP32: 1 sign + 8 exponent + 23 mantissa bits. Range: ±3.4×1038

Decimal Value: 1.0
Hex: 0x3F800000

Click bits to toggle. The value updates automatically.

Format Comparison
FP32 (float) ±3.4×1038
S
Exp(8)
Mantissa(23)
FP16 (half) ±65,504
S
Exp(5)
Mant(10)
FP8 E4M3 ±448
S
E(4)
M(3)
FP8 E5M2 ±57,344
S
E(5)
M(2)

Machine Epsilon

Machine epsilon is the smallest value ε such that 1.0 + ε ≠ 1.0 in floating point arithmetic. It defines the precision limit.

Epsilon by Format

Format Mantissa Bits Epsilon (ε) Decimal
FP32 23 2-23 ~1.19×10-7
FP16 10 2-10 ~9.77×10-4
FP8 E4M3 3 2-3 0.125
FP8 E5M2 2 2-2 0.25
In FP16, what is the next representable number after 1.0?
1.0001 (too small to represent)
1.0009765625 (1 + 2-10)
1.001
1.5
Special Values

IEEE 754 reserves bit patterns for special values: ±Infinity (exponent all 1s, mantissa 0), NaN (exponent all 1s, mantissa non-zero), Denormals (exponent all 0s, gradual underflow). Always check for these in numerical code.

FP8 E4M3 vs E5M2: E4M3 has:
More range
More precision (3 mantissa bits)
Same as E5M2

02 — NUMERICAL STABILITY

Avoiding Numerical Disasters

Floating point has limited range and precision. Large intermediate values cause overflow; small differences between large numbers cause catastrophic cancellation.

The exp() Overflow Problem

In softmax, we compute exp(x). But exp(100) ≈ 2.7×1043, which overflows FP16 (max ~65504) and even FP32 at exp(89).

Naive Softmax

exp(x) / Σexp(x)

x = [100, 101, 102]
exp(100) = OVERFLOW
exp(101) = OVERFLOW
exp(102) = OVERFLOW
Result: NaN

Stable Softmax

exp(x - max) / Σexp(x - max)

x = [100, 101, 102]
max = 102
exp(-2) = 0.135
exp(-1) = 0.368
exp(0) = 1.0
Result: [0.09, 0.24, 0.67]
The Max-Subtraction Trick

Mathematically: exp(x-m)/Σexp(y-m) = exp(x)/Σexp(y) for any constant m. Choosing m = max(x) ensures all exponents are ≤ 0, preventing overflow. This is used in FlashAttention and all production softmax implementations.

Catastrophic Cancellation

When subtracting two nearly equal numbers, significant digits cancel and relative error explodes.

Example: Loss of Precision

// In FP32 with ~7 significant digits
a = 1.0000001
b = 1.0000000
c = a - b  // Expected: 0.0000001

// Actual result may have only 1 significant digit!
// Relative error: potentially 100%

This matters in numerical derivatives, residual computations, and anywhere you compute differences of large similar values.

Error Accumulation in Summation

Adding many small values to a large accumulator loses precision. Each addition rounds, and errors accumulate.

Kahan Summation

Track the running error and compensate in subsequent additions:

float sum = 0.0f;
float c = 0.0f;  // Running compensation

for (int i = 0; i < n; i++) {
    float y = x[i] - c;       // Compensated input
    float t = sum + y;        // Tentative sum
    c = (t - sum) - y;        // Recover lost low bits
    sum = t;
}

Kahan summation reduces error from O(n·ε) to O(ε). In practice, sorting values by magnitude before summing or using tree reduction also helps.

Why does subtracting max(x) before exp() in softmax not change the result?
It does change the result, but the error is acceptable
exp(a-c)/exp(b-c) = exp(a)/exp(b) — the constant cancels
The subtraction is only applied to the numerator
max(x) is always 1.0
03 — QUANTIZATION

Quantization Math

Quantization maps floating point values to lower-precision integers (or low-bit floats like FP8). This reduces memory bandwidth requirements but introduces quantization error.

Scale Factor Computation

scale = max(|x|) / max_representable
quantized = round(x / scale)
dequantized = quantized × scale
Interactive: Quantization Round-Trip
(-8 to 8 range)
Original
3.70
Quantized
3.43
Error
0.27
Computation

scale = 8.0 / 7 = 1.143, quantized_int = round(3.7 / 1.143) = 3, dequantized = 3 × 1.143 = 3.43

Block Scaling

Computing one scale per tensor wastes precision. Block scaling computes a separate scale factor for each block of values (typically 16-128 elements).

Why Groups of 16 or 32?

  • Matches GPU warp size (32 threads) for efficient computation
  • Tensor Core MMA shapes (16×16, 8×8) align naturally
  • Amortizes scale factor storage overhead
  • Balance between precision and metadata cost

OCP Microscaling (MX) formats use 32-element blocks. NVIDIA's NVFP4 uses 16-element blocks with FP8 scale factors.

Quantization Error Bounds

Format Levels Max Relative Error Use Case
INT8 256 ~0.4% Weights, KV cache
INT4 16 ~7% Weights (with fine-tuning)
FP8 E4M3 256* ~6.25% Activations, general
FP4 E2M1 16* ~25% With block scaling

* FP formats have non-uniform spacing; error varies by magnitude.

A tensor has values in [-2.4, 3.1]. What scale factor maps this to INT8 symmetric range [-127, 127]?
3.1 / 127 = 0.0244
max(2.4, 3.1) / 127 = 3.1 / 127 = 0.0244
2.4 / 127 = 0.0189
(3.1 + 2.4) / 255 = 0.0216
04 — IN PRACTICE

Production Quantization

Modern inference systems combine multiple quantization strategies. Here's how the pieces fit together.

Common Patterns

Weight Quantization (W8A16, W4A16)

Weights stored in INT8 or INT4, dequantized to FP16 for computation. Memory-bound workloads benefit most. Used by GPTQ, AWQ, bitsandbytes.

Full INT8 (W8A8)

Both weights and activations in INT8. Requires calibration data to determine activation scales. Used by TensorRT-LLM, ONNX Runtime.

FP8 (E4M3/E5M2)

Native hardware support on H100/B100+. E4M3 for forward pass (precision), E5M2 for gradients (range). Near-FP16 quality with 2× throughput.

NVFP4 (Blackwell)

4-bit floating point with per-block FP8 scales. 16-element blocks. 2× throughput vs FP8 with acceptable quality for inference.

Key Insight: Memory vs Compute

Weight-only quantization helps memory-bound inference (batch size 1, long sequences). Full quantization (W8A8, FP8) helps compute-bound workloads (large batches, short sequences). Profile your workload to choose the right strategy.

Quantization helps MOST when:
Compute-bound (large batches)
Memory-bound (bandwidth limited)
Neither - always helps equally

Citations