Chapter 6

Quantization

Floating point formats, numerical stability, and the math behind INT8, INT4, and FP8 quantization. Essential for deploying efficient inference at scale.

Building on Chapters 2-5

You've optimized compute (Ch4) and memory access patterns (Ch2-5). Quantization attacks the problem from a different angle: reducing the data itself. Every concept from earlier chapters applies—bandwidth limits, memory hierarchy, tiling—but now with 2x or 4x more effective bandwidth.

What You'll Learn

Explain the memory/accuracy tradeoff in quantization
Convert between floating point formats (FP32, FP16, FP8)
Implement symmetric and asymmetric quantization
Choose appropriate quantization strategies for different workloads
Identify when quantization will/won't help performance

📚

Prerequisites

This chapter covers floating point representation and numerical concepts. Floating Point Basics

📝

Practice Notebooks

Part 4: Quantization Labs - FP8 conversion, INT8/INT4, NVFP4

01 — FLOATING POINT

Floating Point Representation

Understanding floating point formats is crucial for quantization and numerical stability. Every float has three parts: Sign (1 bit), Exponent (range), Mantissa (precision).

FP32: 1 sign + 8 exponent + 23 mantissa bits. Range: ±3.4×10³⁸

Decimal Value: 1.0

Hex: 0x3F800000

Click bits to toggle. The value updates automatically.

Format Comparison

FP32 (float) ±3.4×10³⁸

Exp(8)

Mantissa(23)

FP16 (half) ±65,504

Exp(5)

Mant(10)

FP8 E4M3 ±448

E(4)

M(3)

FP8 E5M2 ±57,344

E(5)

M(2)

Machine Epsilon

Machine epsilon is the smallest value ε such that 1.0 + ε ≠ 1.0 in floating point arithmetic. It defines the precision limit.

Epsilon by Format

Format	Mantissa Bits	Epsilon (ε)	Decimal
FP32	23	2^-23	~1.19×10^-7
FP16	10	2^-10	~9.77×10^-4
FP8 E4M3	3	2^-3	0.125
FP8 E5M2	2	2^-2	0.25

In FP16, what is the next representable number after 1.0?

1.0001 (too small to represent)

1.0009765625 (1 + 2^-10)

1.001

1.5

Special Values

IEEE 754 reserves bit patterns for special values: ±Infinity (exponent all 1s, mantissa 0), NaN (exponent all 1s, mantissa non-zero), Denormals (exponent all 0s, gradual underflow). Always check for these in numerical code.

FP8 E4M3 vs E5M2: E4M3 has:

More range

More precision (3 mantissa bits)

Same as E5M2

02 — NUMERICAL STABILITY

Avoiding Numerical Disasters

Floating point has limited range and precision. Large intermediate values cause overflow; small differences between large numbers cause catastrophic cancellation.

The exp() Overflow Problem

In softmax, we compute exp(x). But exp(100) ≈ 2.7×10⁴³, which overflows FP16 (max ~65504) and even FP32 at exp(89).

Naive Softmax

exp(x) / Σexp(x)

x = [100, 101, 102]
exp(100) = OVERFLOW
exp(101) = OVERFLOW
exp(102) = OVERFLOW
Result: NaN

Stable Softmax

exp(x - max) / Σexp(x - max)

x = [100, 101, 102]
max = 102
exp(-2) = 0.135
exp(-1) = 0.368
exp(0) = 1.0
Result: [0.09, 0.24, 0.67]

The Max-Subtraction Trick

Mathematically: exp(x-m)/Σexp(y-m) = exp(x)/Σexp(y) for any constant m. Choosing m = max(x) ensures all exponents are ≤ 0, preventing overflow. This is used in FlashAttention and all production softmax implementations.

Catastrophic Cancellation

When subtracting two nearly equal numbers, significant digits cancel and relative error explodes.

Example: Loss of Precision

// In FP32 with ~7 significant digits
a = 1.0000001
b = 1.0000000
c = a - b  // Expected: 0.0000001

// Actual result may have only 1 significant digit!
// Relative error: potentially 100%

This matters in numerical derivatives, residual computations, and anywhere you compute differences of large similar values.

Error Accumulation in Summation

Adding many small values to a large accumulator loses precision. Each addition rounds, and errors accumulate.

Kahan Summation

Track the running error and compensate in subsequent additions:

float sum = 0.0f;
float c = 0.0f;  // Running compensation

for (int i = 0; i < n; i++) {
    float y = x[i] - c;       // Compensated input
    float t = sum + y;        // Tentative sum
    c = (t - sum) - y;        // Recover lost low bits
    sum = t;
}

Kahan summation reduces error from O(n·ε) to O(ε). In practice, sorting values by magnitude before summing or using tree reduction also helps.

Why does subtracting max(x) before exp() in softmax not change the result?

It does change the result, but the error is acceptable

exp(a-c)/exp(b-c) = exp(a)/exp(b) — the constant cancels

The subtraction is only applied to the numerator

max(x) is always 1.0

03 — QUANTIZATION

Quantization Math

Quantization maps floating point values to lower-precision integers (or low-bit floats like FP8). This reduces memory bandwidth requirements but introduces quantization error.

Scale Factor Computation

scale = max(|x|) / max_representable

quantized = round(x / scale)

dequantized = quantized × scale

Original value: (-8 to 8 range)

Target bits:

Original

3.70

Quantized

3.43

Error

0.27

Computation

scale = 8.0 / 7 = 1.143, quantized_int = round(3.7 / 1.143) = 3, dequantized = 3 × 1.143 = 3.43

Block Scaling

Computing one scale per tensor wastes precision. Block scaling computes a separate scale factor for each block of values (typically 16-128 elements).

Why Groups of 16 or 32?

Matches GPU warp size (32 threads) for efficient computation
Tensor Core MMA shapes (16×16, 8×8) align naturally
Amortizes scale factor storage overhead
Balance between precision and metadata cost

OCP Microscaling (MX) formats use 32-element blocks. NVIDIA's NVFP4 uses 16-element blocks with FP8 scale factors.

Quantization Error Bounds

Format	Levels	Max Relative Error	Use Case
INT8	256	~0.4%	Weights, KV cache
INT4	16	~7%	Weights (with fine-tuning)
FP8 E4M3	256*	~6.25%	Activations, general
FP4 E2M1	16*	~25%	With block scaling

* FP formats have non-uniform spacing; error varies by magnitude.

A tensor has values in [-2.4, 3.1]. What scale factor maps this to INT8 symmetric range [-127, 127]?

3.1 / 127 = 0.0244

max(2.4, 3.1) / 127 = 3.1 / 127 = 0.0244

2.4 / 127 = 0.0189

(3.1 + 2.4) / 255 = 0.0216

04 — IN PRACTICE

Production Quantization

Modern inference systems combine multiple quantization strategies. Here's how the pieces fit together.

Common Patterns

Weight Quantization (W8A16, W4A16)

Weights stored in INT8 or INT4, dequantized to FP16 for computation. Memory-bound workloads benefit most. Used by GPTQ, AWQ, bitsandbytes.

Full INT8 (W8A8)

Both weights and activations in INT8. Requires calibration data to determine activation scales. Used by TensorRT-LLM, ONNX Runtime.

FP8 (E4M3/E5M2)

Native hardware support on H100/B100+. E4M3 for forward pass (precision), E5M2 for gradients (range). Near-FP16 quality with 2× throughput.

NVFP4 (Blackwell)

4-bit floating point with per-block FP8 scales. 16-element blocks. 2× throughput vs FP8 with acceptable quality for inference.

Key Insight: Memory vs Compute

Weight-only quantization helps memory-bound inference (batch size 1, long sequences). Full quantization (W8A8, FP8) helps compute-bound workloads (large batches, short sequences). Profile your workload to choose the right strategy.

Quantization helps MOST when:

Compute-bound (large batches)

Memory-bound (bandwidth limited)

Neither - always helps equally

Citations

FP8 Formats for Deep Learning - Micikevicius et al., 2022
GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
AWQ: Activation-aware Weight Quantization - Lin et al., 2023
IEEE 754-2019 - Floating-Point Arithmetic Standard
OCP Microscaling Formats - MX format specification