Chapter 7

Production

Deploy efficient LLM inference at scale. KV cache strategies, PagedAttention, TensorRT-LLM, and vLLM—the systems that power real-world AI applications.

Everything Converges Here

GPU fundamentals (Ch1) determine how your server utilizes hardware. Memory hierarchy (Ch2) explains why KV cache is the bottleneck. Attention (Ch5) and quantization (Ch6) are the core computations you're serving. This chapter is about orchestrating all of it at scale.

What You'll Learn

Explain continuous batching and its benefits
Describe KV cache management strategies
Identify bottlenecks in inference serving systems
Choose between inference frameworks (vLLM, TensorRT-LLM)
Design a system meeting latency and throughput requirements

📚

Prerequisites

This chapter builds on attention and quantization concepts. Chapter 5: Attention | Chapter 6: Quantization

📝

Practice Notebooks

Part 4: Production Labs - KV cache, quantized inference

01 — KV CACHE

Why KV Cache Matters

In autoregressive generation, each new token attends to all previous tokens. Without caching, you'd recompute K and V projections for the entire sequence at every step—O(n²) complexity.

The KV cache stores computed key and value vectors, so each generation step only computes K/V for the new token. This reduces complexity to O(n) per step, but creates a new bottleneck: memory.

KV Cache Memory Formula

Memory = 2 × layers × heads × head_dim × seq_len × batch × bytes_per_element

For Llama 2 70B with 4K context in FP16: ~2.5 GB per sequence. At batch size 32, that's 80 GB just for KV cache. Source: vLLM paper.

The Memory Wall

KV cache often exceeds model weights in memory usage for long contexts. A 70B model with 32K context can require 100+ GB just for KV storage. This is why KV cache optimization is critical for production deployment.

Batching improves throughput because:

Faster memory access

Better GPU utilization (amortized overhead)

Simpler code

02 — PAGEDATTENTION

Virtual Memory for KV Cache

PagedAttention (introduced in vLLM) applies virtual memory concepts to KV cache management. Instead of pre-allocating contiguous memory per sequence, it allocates fixed-size blocks on demand.

Traditional allocation wastes memory on unused context. PagedAttention eliminates this.

Traditional:

Seq1

Seq2

Paged:

S1:0

S1:1

S2:0

free

PagedAttention Benefits

Aspect	Traditional	PagedAttention
Memory utilization	~50-60%	~95%+
Max batch size	Limited by worst-case	Adapts dynamically
Memory fragmentation	Significant	Near zero
Prefix sharing	Duplicated	Copy-on-write

Source: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention"

Copy-on-Write for Beam Search

PagedAttention enables efficient beam search and parallel sampling by sharing KV cache blocks across sequences. Only modified blocks are copied, reducing memory by up to beam_width × for common prefixes.

03 — CONTINUOUS BATCHING

Iteration-Level Scheduling

Traditional batching waits for all sequences to complete before starting new ones. Continuous batching (also called iteration-level scheduling) allows new requests to join mid-batch as others finish.

Request Queue
Incoming prompts

→

Scheduler
Per-iteration batching

→

KV Cache Manager
PagedAttention

→

Output
Streaming tokens

Static Batching

Wait for batch to fill
All sequences same length (padded)
Batch completes together
GPU idle during padding

Continuous Batching

Add requests immediately
Variable sequence lengths
Sequences exit independently
Higher GPU utilization

Continuous batching can improve throughput by 10-20× compared to static batching for variable-length workloads.

04 — TENSORRT-LLM

NVIDIA's Optimized Inference

TensorRT-LLM is NVIDIA's library for optimized LLM inference. It compiles models into optimized TensorRT engines with fused kernels, in-flight batching, and hardware-specific optimizations.

Key Features

Feature	Description
Fused Attention	FlashAttention-2 + fused QKV projection + RoPE
In-flight Batching	Continuous batching with paged KV cache
Quantization	FP8, INT8, INT4 (AWQ, GPTQ), W4A8
Speculative Decoding	Draft model + verification for faster generation
Tensor Parallelism	Multi-GPU inference with NVLink optimization

# TensorRT-LLM basic usage
from tensorrt_llm import LLM, SamplingParams

# Load optimized model
llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Generate with sampling params
outputs = llm.generate(
    ["What is the capital of France?"],
    sampling_params=SamplingParams(temperature=0.7, max_tokens=100)
)

for output in outputs:
    print(output.outputs[0].text)
        

When to Use TensorRT-LLM

Best for: NVIDIA GPUs (especially H100/A100), maximum throughput, production deployments with stable models. Requires model compilation step but delivers highest performance on NVIDIA hardware.

TensorRT-LLM is best for:

Flexibility and rapid iteration

Maximum performance on NVIDIA GPUs

Easy setup without compilation

05 — VLLM

High-Throughput Serving

vLLM is an open-source library focused on high-throughput serving. It introduced PagedAttention and provides an easy-to-use API compatible with OpenAI's format.

vLLM Architecture

Component	Implementation
KV Cache	PagedAttention with block-level management
Batching	Continuous batching with preemption support
Attention	FlashAttention / FlashInfer backends
Quantization	AWQ, GPTQ, FP8, bitsandbytes
Parallelism	Tensor parallel, pipeline parallel

# vLLM offline inference
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

prompts = ["Hello, my name is", "The capital of France is"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Output: {output.outputs[0].text}")
        

# vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 2

# Client usage (standard OpenAI API)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 50}'
        

06 — QUANTIZED KV CACHE

Reducing Memory Further

Even with PagedAttention, KV cache dominates memory for long contexts. Quantizing the KV cache to INT8 or FP8 can halve memory usage with minimal quality loss.

KV Cache Quantization Options

Format	Memory Reduction	Quality Impact	Support
FP16 (baseline)	1×	None	All frameworks
FP8 E4M3	2×	Minimal	TensorRT-LLM, vLLM
INT8	2×	Low	TensorRT-LLM, vLLM
INT4	4×	Moderate	Experimental

FP8 KV cache is recommended for H100/B100 deployments. See KIVI for INT2 KV cache research.

Practical Recommendation

Start with FP8 KV cache on Hopper+ GPUs—it's nearly lossless and halves memory. For extreme memory constraints, INT8 with per-head scaling works well. INT4 requires careful evaluation on your specific use case.

07 — CHOOSING

Framework Selection

Decision Matrix

Priority	Recommendation
Maximum throughput on NVIDIA	TensorRT-LLM
Easy deployment, OpenAI compatibility	vLLM
Research / rapid iteration	TGI or vLLM
Multi-cloud / AMD GPUs	vLLM (ROCm support)
Edge deployment	llama.cpp

References

PagedAttention Paper - Kwon et al., "Efficient Memory Management for Large Language Model Serving"
Orca: Continuous Batching - Yu et al., OSDI 2022
TensorRT-LLM Documentation - NVIDIA's official docs
vLLM Documentation - Official vLLM docs
KV Cache Quantization Survey - Methods and benchmarks
KIVI: INT2 KV Cache - Extreme KV cache compression
Continuous Batching Explained - Anyscale blog