Production
Deploy efficient LLM inference at scale. KV cache strategies, PagedAttention, TensorRT-LLM, and vLLM—the systems that power real-world AI applications.
- Explain continuous batching and its benefits
- Describe KV cache management strategies
- Identify bottlenecks in inference serving systems
- Choose between inference frameworks (vLLM, TensorRT-LLM)
- Design a system meeting latency and throughput requirements
This chapter builds on attention and quantization concepts. Chapter 5: Attention | Chapter 6: Quantization
Part 4: Production Labs - KV cache, quantized inference
Why KV Cache Matters
In autoregressive generation, each new token attends to all previous tokens. Without caching, you'd recompute K and V projections for the entire sequence at every step—O(n²) complexity.
The KV cache stores computed key and value vectors, so each generation step only computes K/V for the new token. This reduces complexity to O(n) per step, but creates a new bottleneck: memory.
KV Cache Memory Formula
For Llama 2 70B with 4K context in FP16: ~2.5 GB per sequence. At batch size 32, that's 80 GB just for KV cache. Source: vLLM paper.
KV cache often exceeds model weights in memory usage for long contexts. A 70B model with 32K context can require 100+ GB just for KV storage. This is why KV cache optimization is critical for production deployment.
Virtual Memory for KV Cache
PagedAttention (introduced in vLLM) applies virtual memory concepts to KV cache management. Instead of pre-allocating contiguous memory per sequence, it allocates fixed-size blocks on demand.
Traditional allocation wastes memory on unused context. PagedAttention eliminates this.
PagedAttention Benefits
| Aspect | Traditional | PagedAttention |
|---|---|---|
| Memory utilization | ~50-60% | ~95%+ |
| Max batch size | Limited by worst-case | Adapts dynamically |
| Memory fragmentation | Significant | Near zero |
| Prefix sharing | Duplicated | Copy-on-write |
Source: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention"
PagedAttention enables efficient beam search and parallel sampling by sharing KV cache blocks across sequences. Only modified blocks are copied, reducing memory by up to beam_width × for common prefixes.
Iteration-Level Scheduling
Traditional batching waits for all sequences to complete before starting new ones. Continuous batching (also called iteration-level scheduling) allows new requests to join mid-batch as others finish.
Incoming prompts
Per-iteration batching
PagedAttention
Streaming tokens
Static Batching
- Wait for batch to fill
- All sequences same length (padded)
- Batch completes together
- GPU idle during padding
Continuous Batching
- Add requests immediately
- Variable sequence lengths
- Sequences exit independently
- Higher GPU utilization
Continuous batching can improve throughput by 10-20× compared to static batching for variable-length workloads.
NVIDIA's Optimized Inference
TensorRT-LLM is NVIDIA's library for optimized LLM inference. It compiles models into optimized TensorRT engines with fused kernels, in-flight batching, and hardware-specific optimizations.
Key Features
| Feature | Description |
|---|---|
| Fused Attention | FlashAttention-2 + fused QKV projection + RoPE |
| In-flight Batching | Continuous batching with paged KV cache |
| Quantization | FP8, INT8, INT4 (AWQ, GPTQ), W4A8 |
| Speculative Decoding | Draft model + verification for faster generation |
| Tensor Parallelism | Multi-GPU inference with NVLink optimization |
# TensorRT-LLM basic usage
from tensorrt_llm import LLM, SamplingParams
# Load optimized model
llm = LLM(model="meta-llama/Llama-2-7b-hf")
# Generate with sampling params
outputs = llm.generate(
["What is the capital of France?"],
sampling_params=SamplingParams(temperature=0.7, max_tokens=100)
)
for output in outputs:
print(output.outputs[0].text)
Best for: NVIDIA GPUs (especially H100/A100), maximum throughput, production deployments with stable models. Requires model compilation step but delivers highest performance on NVIDIA hardware.
High-Throughput Serving
vLLM is an open-source library focused on high-throughput serving. It introduced PagedAttention and provides an easy-to-use API compatible with OpenAI's format.
vLLM Architecture
| Component | Implementation |
|---|---|
| KV Cache | PagedAttention with block-level management |
| Batching | Continuous batching with preemption support |
| Attention | FlashAttention / FlashInfer backends |
| Quantization | AWQ, GPTQ, FP8, bitsandbytes |
| Parallelism | Tensor parallel, pipeline parallel |
# vLLM offline inference
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
prompts = ["Hello, my name is", "The capital of France is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}")
# vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 2
# Client usage (standard OpenAI API)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "prompt": "Hello", "max_tokens": 50}'
Reducing Memory Further
Even with PagedAttention, KV cache dominates memory for long contexts. Quantizing the KV cache to INT8 or FP8 can halve memory usage with minimal quality loss.
KV Cache Quantization Options
| Format | Memory Reduction | Quality Impact | Support |
|---|---|---|---|
| FP16 (baseline) | 1× | None | All frameworks |
| FP8 E4M3 | 2× | Minimal | TensorRT-LLM, vLLM |
| INT8 | 2× | Low | TensorRT-LLM, vLLM |
| INT4 | 4× | Moderate | Experimental |
FP8 KV cache is recommended for H100/B100 deployments. See KIVI for INT2 KV cache research.
Start with FP8 KV cache on Hopper+ GPUs—it's nearly lossless and halves memory. For extreme memory constraints, INT8 with per-head scaling works well. INT4 requires careful evaluation on your specific use case.
Framework Selection
Decision Matrix
| Priority | Recommendation |
|---|---|
| Maximum throughput on NVIDIA | TensorRT-LLM |
| Easy deployment, OpenAI compatibility | vLLM |
| Research / rapid iteration | TGI or vLLM |
| Multi-cloud / AMD GPUs | vLLM (ROCm support) |
| Edge deployment | llama.cpp |
References
- PagedAttention Paper - Kwon et al., "Efficient Memory Management for Large Language Model Serving"
- Orca: Continuous Batching - Yu et al., OSDI 2022
- TensorRT-LLM Documentation - NVIDIA's official docs
- vLLM Documentation - Official vLLM docs
- KV Cache Quantization Survey - Methods and benchmarks
- KIVI: INT2 KV Cache - Extreme KV cache compression
- Continuous Batching Explained - Anyscale blog