Chapter 7

Production

Deploy efficient LLM inference at scale. KV cache strategies, PagedAttention, TensorRT-LLM, and vLLM—the systems that power real-world AI applications.

Everything Converges Here
GPU fundamentals (Ch1) determine how your server utilizes hardware. Memory hierarchy (Ch2) explains why KV cache is the bottleneck. Attention (Ch5) and quantization (Ch6) are the core computations you're serving. This chapter is about orchestrating all of it at scale.
What You'll Learn
  1. Explain continuous batching and its benefits
  2. Describe KV cache management strategies
  3. Identify bottlenecks in inference serving systems
  4. Choose between inference frameworks (vLLM, TensorRT-LLM)
  5. Design a system meeting latency and throughput requirements
📚
Prerequisites

This chapter builds on attention and quantization concepts. Chapter 5: Attention | Chapter 6: Quantization

01 — KV CACHE

Why KV Cache Matters

In autoregressive generation, each new token attends to all previous tokens. Without caching, you'd recompute K and V projections for the entire sequence at every step—O(n²) complexity.

The KV cache stores computed key and value vectors, so each generation step only computes K/V for the new token. This reduces complexity to O(n) per step, but creates a new bottleneck: memory.

KV Cache Memory Formula

Memory = 2 × layers × heads × head_dim × seq_len × batch × bytes_per_element

For Llama 2 70B with 4K context in FP16: ~2.5 GB per sequence. At batch size 32, that's 80 GB just for KV cache. Source: vLLM paper.

The Memory Wall

KV cache often exceeds model weights in memory usage for long contexts. A 70B model with 32K context can require 100+ GB just for KV storage. This is why KV cache optimization is critical for production deployment.

Batching improves throughput because:
Faster memory access
Better GPU utilization (amortized overhead)
Simpler code

02 — PAGEDATTENTION

Virtual Memory for KV Cache

PagedAttention (introduced in vLLM) applies virtual memory concepts to KV cache management. Instead of pre-allocating contiguous memory per sequence, it allocates fixed-size blocks on demand.

Traditional allocation wastes memory on unused context. PagedAttention eliminates this.

Traditional:
Seq1
Seq1
-
-
Seq2
-
-
-
Paged:
S1:0
S1:1
S2:0
free
free
free
free
free

PagedAttention Benefits

Aspect Traditional PagedAttention
Memory utilization ~50-60% ~95%+
Max batch size Limited by worst-case Adapts dynamically
Memory fragmentation Significant Near zero
Prefix sharing Duplicated Copy-on-write

Source: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention"

Copy-on-Write for Beam Search

PagedAttention enables efficient beam search and parallel sampling by sharing KV cache blocks across sequences. Only modified blocks are copied, reducing memory by up to beam_width × for common prefixes.

03 — CONTINUOUS BATCHING

Iteration-Level Scheduling

Traditional batching waits for all sequences to complete before starting new ones. Continuous batching (also called iteration-level scheduling) allows new requests to join mid-batch as others finish.

Request Queue
Incoming prompts
Scheduler
Per-iteration batching
KV Cache Manager
PagedAttention
Output
Streaming tokens

Static Batching

  • Wait for batch to fill
  • All sequences same length (padded)
  • Batch completes together
  • GPU idle during padding

Continuous Batching

  • Add requests immediately
  • Variable sequence lengths
  • Sequences exit independently
  • Higher GPU utilization

Continuous batching can improve throughput by 10-20× compared to static batching for variable-length workloads.

04 — TENSORRT-LLM

NVIDIA's Optimized Inference

TensorRT-LLM is NVIDIA's library for optimized LLM inference. It compiles models into optimized TensorRT engines with fused kernels, in-flight batching, and hardware-specific optimizations.

Key Features

Feature Description
Fused Attention FlashAttention-2 + fused QKV projection + RoPE
In-flight Batching Continuous batching with paged KV cache
Quantization FP8, INT8, INT4 (AWQ, GPTQ), W4A8
Speculative Decoding Draft model + verification for faster generation
Tensor Parallelism Multi-GPU inference with NVLink optimization
# TensorRT-LLM basic usage from tensorrt_llm import LLM, SamplingParams # Load optimized model llm = LLM(model="meta-llama/Llama-2-7b-hf") # Generate with sampling params outputs = llm.generate( ["What is the capital of France?"], sampling_params=SamplingParams(temperature=0.7, max_tokens=100) ) for output in outputs: print(output.outputs[0].text)
When to Use TensorRT-LLM

Best for: NVIDIA GPUs (especially H100/A100), maximum throughput, production deployments with stable models. Requires model compilation step but delivers highest performance on NVIDIA hardware.

TensorRT-LLM is best for:
Flexibility and rapid iteration
Maximum performance on NVIDIA GPUs
Easy setup without compilation

05 — VLLM

High-Throughput Serving

vLLM is an open-source library focused on high-throughput serving. It introduced PagedAttention and provides an easy-to-use API compatible with OpenAI's format.

vLLM Architecture

Component Implementation
KV Cache PagedAttention with block-level management
Batching Continuous batching with preemption support
Attention FlashAttention / FlashInfer backends
Quantization AWQ, GPTQ, FP8, bitsandbytes
Parallelism Tensor parallel, pipeline parallel
# vLLM offline inference from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-2-7b-hf") sampling_params = SamplingParams(temperature=0.8, top_p=0.95) prompts = ["Hello, my name is", "The capital of France is"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Output: {output.outputs[0].text}")
# vLLM OpenAI-compatible server python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-7b-hf \ --tensor-parallel-size 2 # Client usage (standard OpenAI API) curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 50}'
06 — QUANTIZED KV CACHE

Reducing Memory Further

Even with PagedAttention, KV cache dominates memory for long contexts. Quantizing the KV cache to INT8 or FP8 can halve memory usage with minimal quality loss.

KV Cache Quantization Options

Format Memory Reduction Quality Impact Support
FP16 (baseline) None All frameworks
FP8 E4M3 Minimal TensorRT-LLM, vLLM
INT8 Low TensorRT-LLM, vLLM
INT4 Moderate Experimental

FP8 KV cache is recommended for H100/B100 deployments. See KIVI for INT2 KV cache research.

Practical Recommendation

Start with FP8 KV cache on Hopper+ GPUs—it's nearly lossless and halves memory. For extreme memory constraints, INT8 with per-head scaling works well. INT4 requires careful evaluation on your specific use case.

07 — CHOOSING

Framework Selection

Decision Matrix

Priority Recommendation
Maximum throughput on NVIDIA TensorRT-LLM
Easy deployment, OpenAI compatibility vLLM
Research / rapid iteration TGI or vLLM
Multi-cloud / AMD GPUs vLLM (ROCm support)
Edge deployment llama.cpp

References