The Foundations: PagedAttention and Hardware Synergy

Author1 week ago

0 6 minutes read

In the rapidly evolving world of large language models (LLMs), the initial excitement of simply getting a model to generate text has matured into a sophisticated engineering challenge. Today, running LLMs in production isn’t just about a simple model.generate() loop; it’s a complex systems problem. The choice of your inference stack directly dictates critical metrics like tokens per second, tail latency, and ultimately, your cost per million tokens on a given GPU fleet. Getting it right can save fortunes and significantly improve user experience.

For engineering teams serious about deploying LLMs at scale, four names consistently surface in deep technical discussions: vLLM, NVIDIA TensorRT-LLM, Hugging Face Text Generation Inference (TGI v3), and LMDeploy. Each brings a unique philosophy and set of optimizations to the table. Let’s peel back the layers and understand where each truly shines.

The Foundations: PagedAttention and Hardware Synergy

vLLM: The PagedAttention Pioneer

If you’ve been anywhere near LLM inference discussions, vLLM is likely a familiar name. It’s become a de facto open baseline, and for good reason. Its core innovation lies in PagedAttention, an ingenious attention mechanism that treats the KV (Key-Value) cache like paged virtual memory. Instead of allocating a large, contiguous KV region for each request—which often leads to wasted GPU memory—vLLM divides the KV cache into fixed-size blocks.

This approach maintains a block table mapping logical tokens to physical blocks and, crucially, allows blocks to be shared between sequences with overlapping prefixes. The result? Dramatically reduced external fragmentation, enabling the scheduler to pack many more concurrent sequences into the same VRAM. This innovation alone often leads to 2–4 times higher throughput compared to older systems like FasterTransformer or Orca, especially for longer sequences.

vLLM also pioneered continuous batching (or inflight batching), which merges new requests into existing GPU batches rather than waiting for rigid batch windows. While its P50 latency remains low at moderate concurrency, P99 can degrade under heavy load or tight KV memory. With its OpenAI-compatible HTTP API and seamless integration with orchestrators like Ray Serve, vLLM offers a robust, open-source foundation for many production deployments.

TensorRT-LLM: NVIDIA’s Performance Apex

When it comes to squeezing every last drop of performance out of NVIDIA GPUs, TensorRT-LLM is the undisputed champion. This isn’t just another inference library; it’s a deeply optimized powerhouse, leveraging custom attention kernels, inflight batching, paged KV caching, and aggressive quantization down to FP4 and INT4. It’s also tightly coupled to NVIDIA’s latest hardware, exploiting features like FP8 tensor cores on Hopper and Blackwell architectures.

The numbers speak for themselves: on an H100 with FP8, TensorRT-LLM can achieve over 10,000 output tokens/second at peak throughput for 64 concurrent requests, with a time to first token (TTFT) around 100 ms. For latency-sensitive scenarios, it can drive TTFT below 10 ms in batch 1 configurations, albeit at reduced overall throughput. NVIDIA’s own benchmarks show H100 FP8 achieving up to 4.6 times higher max throughput and 4.4 times faster first token latency than an A100 for the same models.

TensorRT-LLM meticulously optimizes both the prefill and decode phases. Prefill benefits from high-throughput FP8 attention kernels and tensor parallelism, while decode leverages CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion. This holistic optimization delivers consistently high tokens/second across a wide range of input and output lengths. For multi-tenant and multi-model setups, NVIDIA typically pairs it with orchestrators like Ray or Triton, ensuring maximum hardware utilization.

Targeted Brilliance: Long Prompts, Quantization, and Orchestration

Hugging Face TGI v3: Mastering the Marathon Prompt

Hugging Face’s Text Generation Inference (TGI) has long been a go-to for deploying models from the Hugging Face ecosystem. Version 3, however, introduces a crucial specialization: handling extremely long prompts. Built as a Rust and Python-based serving stack, TGI provides robust HTTP and gRPC APIs, a continuous batching scheduler, and vital observability hooks. What makes TGI v3 stand out is its focus on long prompt processing through intelligent chunking and prefix caching.

Imagine a RAG (Retrieval Augmented Generation) pipeline where users repeatedly query a massive document. With TGI v3, the initial, long context is stored in a prefix cache. Subsequent turns only “pay” for the incremental tokens, avoiding costly re-computation of the entire prompt. The results are striking: for prompts exceeding 200,000 tokens, TGI v3 can serve a conversation reply in about 2 seconds, a 13x speedup over vLLM’s 27.5 seconds on the same workload. It also achieves about 3x more token capacity in the same GPU memory.

While short chat workloads see performance similar to vLLM, long, cacheable contexts drastically improve both P50 and P99 latency. TGI’s architecture also makes it a powerful router and model server, capable of directing requests across many models and replicas, and even targeting different backends like TensorRT-LLM for high-priority tasks and smaller GPUs for others. This positions TGI as an ideal central serving tier in complex multi-tenant environments.

LMDeploy: Maximizing Reach with Quantization

Hailing from the InternLM ecosystem, LMDeploy is a comprehensive toolkit for compressing and serving LLMs, centered around its TurboMind engine. LMDeploy’s philosophy is to achieve high throughput and enable larger models on more modest hardware through aggressive optimization and quantization. It features a blocked KV cache (similar to paged KV), persistent batching, and a strong emphasis on quantization for both weights and the KV cache.

LMDeploy claims up to 1.8 times higher request throughput than vLLM, a testament to its optimized CUDA kernels, dynamic split and fuse operations, and tensor parallelism. Its support for KV cache quantization (typically int8 or int4) significantly reduces KV memory footprint and bandwidth requirements, while weight-only quantization paths like 4-bit AWQ further compress models. This combination makes LMDeploy particularly attractive for running larger open models, such as InternLM or Qwen, on mid-range GPUs without compromising too much on tokens per second.

Architecturally, LMDeploy provides a proxy server that handles multi-model deployments, multi-machine, and multi-GPU setups, complete with routing logic based on request metadata. This positions it closer to TGI in terms of its ability to manage diverse serving needs beyond a single model instance.

Choosing Your Champion: Aligning Stack to Workload

With such a rich landscape of inference stacks, the “best” choice isn’t universal; it’s entirely dependent on your specific production workload and infrastructure. Think of it as choosing the right tool for the job – you wouldn’t use a sledgehammer to drive a nail, nor a tiny hammer to demolish a wall.

If your primary goal is **maximum throughput and extremely low time to first token (TTFT) on NVIDIA GPUs**, especially the latest Hopper or Blackwell architectures, then **TensorRT-LLM** is your primary choice. Its FP8 precision, custom kernels, and speculative decoding are engineered to push the boundaries of tokens/second and keep TTFT under 100 ms at high concurrency, even hitting sub-10 ms for single-request scenarios. It’s for those who want to wring every bit of performance out of their cutting-edge hardware.

For scenarios dominated by **long prompts with significant reuse**, such as advanced RAG pipelines over extensive documents, **Hugging Face TGI v3** emerges as a strong default. Its intelligent prefix caching and chunking mechanism deliver unparalleled efficiency, providing up to 3x token capacity and a reported 13x lower latency than vLLM in long-prompt benchmarks. If your users frequently engage in multi-turn conversations with vast contexts, TGI v3 will dramatically improve their experience and your operational costs.

Should you require an **open, simple engine with robust baseline performance and an OpenAI-style API**, **vLLM** remains the standard. Its PagedAttention and continuous batching offer a significant leap over older stacks, integrating cleanly with cloud-native orchestrators like Ray and Kubernetes. It’s an excellent choice for general-purpose LLM serving, especially if you prioritize an open-source solution that’s widely adopted and understood by the community.

Finally, if you’re targeting **open models like InternLM or Qwen and need aggressive quantization to run on mid-range GPUs**, while also valuing multi-model serving capabilities, then **LMDeploy** is a compelling fit. Its blocked KV cache, persistent batching, and int8/int4 KV quantization can deliver up to 1.8x higher request throughput than vLLM on supported models, all within a serving framework that includes routing logic.

In practice, many sophisticated deployment teams don’t pick just one; they orchestrate a mix. TensorRT-LLM might handle high-volume, proprietary chat, while TGI v3 manages long-context analytics, and vLLM or LMDeploy cover experimental or open-model workloads. The real key is to deeply understand your traffic’s token distributions, align the capabilities of each stack to those patterns, and diligently measure your cost per million tokens on your actual hardware fleet. Only then can you truly optimize for both performance and budget.

LLM inference, production LLM serving, vLLM, TensorRT-LLM, Hugging Face TGI, LMDeploy, PagedAttention, KV cache, GPU optimization, LLM quantization, continuous batching

Author1 week ago

0 6 minutes read