product
Jul 1, 2025

Accelerating Inference

Accelerating Inference

Author

Dan Aloni, Senior Software Engineer | Matthew Rogers, Field CTO AI and Security | Alon Horev, Co-Founder and CTO | Subramanian Kartik, PhD, Global Systems Engineering Lead | Dave Graham, Technical Marketing Manager

The Inference Evolution

Over the past year, we’ve observed substantial growth in inference workloads and innovation in inference frameworks. As GPUs and models evolve, they can accomplish more complicated tasks. One of the enablers for this evolution is the context window size growth.

The context window enables applications to pack more information into prompts or maintain extended conversations without summarizing or losing historical data. More importantly, reasoning models can achieve better results by iteratively gathering more information from external sources in their journey to generate an answer.

The landscape of inference has transformed dramatically to support this growth, evolving from simple model serving to schedulers routing requests and tracking sessions across many environments.

A significant amount of innovation occurred in how inference is scheduled and executed, with techniques like Continuous Batching, Chunked Prefill, and Disaggregated Prefill and Decode emerging to maximize hardware utilization. One key opportunity for optimization is avoiding redundant computation altogether; this is where Key-Value (KV) Caching plays a crucial role.

What is KV Cache?

Understanding KV cache requires an understanding of how the two phases function:

Prefill phase:

When you input a prompt, the model calculates attention values for each token. This is a computationally intensive process that enables the model to determine which tokens should receive the most attention at every step of generating a response. The prefill stage concludes with this attention data stored in memory, known as KV cache (each token has key and value states).

Decode phase:

The decode phase commences once the prefill is complete and the KV cache is populated with the prompt’s context. During this phase, the model generates output, one token at a time, utilizing the cached context. Additionally, the cache size expands quadratically with the attention information for the generated tokens.

Can KV Cache be optimized?

Reusing a cached state is crucial, especially with long context models and use cases. An example of this can be seen in the pricing of the OpenAI O3-Pro model, where costs range from $20 per million tokens for input to $80 per million tokens for output. For applications like RAG or multi-turn chat, which require large contexts (documents, system prompts, chat history) to be repeatedly sent as input, a persistent KV cache prevents re-processing and eliminates the need to pay for computing these same input tokens on every single turn. The cost is incurred only once to generate the initial cache, which can be retrieved from the VAST AI OS for subsequent requests, resulting in substantial savings.

Time to First Token (TTFT) is primarily influenced by the compute-intensive prefill phase. By completely bypassing this computation and loading a pre-computed KV cache, the model can begin generating a response almost instantaneously. As a result, this improves the user experience from a noticeable initial lag to one that feels immediate and highly responsive.

How Does KV Cache… Get Cached?

An active session or conversation is sitting in GPU memory in expectation of follow-up questions from a user, agent or a reasoning model that iterates towards an objective.

After it becomes inactive for a while or when other prompts need GPU memory for KV cache, it can be offloaded to CPU memory, local storage or remote storage. An inference engine or framework needs to decide when to offload and reload each KV cache.

There are several great solutions out there that are worth mentioning:

1. vLLM is a popular inference engine that efficiently manages KV cache in GPU memory. It also supports extensions via plugins for offloading the KV cache to other locations.

2. LMCache works very closely with vLLM and enables offloading KV cache:

a. CPU memory is the first location where the KV cache is destaged as it’s typically larger than GPU memory. It’s faster to store/load the KV cache to CPU memory as it stays local to the machine.

b. The next location is remote storage, which enables unbound capacity.

3. NVIDIA Dynamo is a low-latency inference serving framework for large-scale distributed environments It is compatible with all popular inference engines like SGLang, TensorRT-LLM, and vLLM.

a. It enables transferring KV cache across GPUs, CPUs and storage using an accelerated data movement library called NVIDIA NIXL (NVIDIA Inference Xfer Library).

b. The distributed architecture behind Dynamo naturally supports implementing disaggregated prefill and decode. This serves as another strategy for enhancing scheduling across accelerated computings to boost inference throughput and minimize latency. It works by assigning a set of GPUs to run prefill and having NIXL move the data using RDMA to a different set of GPUs that will perform the decode process, as seen in the diagram below.

Disaggregated Prefill

Control & Execution Tiers

Does it work?

To understand the implications of these solutions, we chose to test the impact of the KV cache in an environment that our customers use daily. We executed a multi-stage process to quantify the performance benefits of offloading LLM KV caches to the VAST AI OS. The methodology was designed to assess the entire stack, from the foundational I/O capabilities of the infrastructure to the application-level impact on inference latency within an LLM serving framework.

For validation purposes, we would like to test two layers independently:

1. I/O layer - using NVIDIA Magnum IOTM GPUDirect® Storage (GDS) to read/write to remote storage while bypassing CPU memory to reduce CPU and bandwidth utilization of the host RAM.

2. Application layer - LMCache that would read/write KV caches from/to VAST.

1. I/O and NVIDIA GPUDirect Storage (GDS)

The initial phase focused on establishing a verified performance baseline for the underlying hardware and data path to ensure the environment was functional and configured for high-throughput, low-latency storage access.

  • GDS Path Validation: We began by integrating and validating the NVIDIA NIXL GDS plugin with the VAST AI OS, utilizing the nixlbench benchmark tool configuration highlighted below. This step was critical to confirm the correct functioning of the GDS data path and to measure the raw, unimpeded I/O performance between the NVIDIA H100 GPU memory and the VAST AI OS.

CUDA_VISIBLE_DEVICES=0 nixlbench \ -backend=GDS \ -mode=SG \ -worker_type=nixl \ -op_type=READ \ -gds_filepath=/mnt/vast/1 \ -num_threads=32 \ -total_buffer_size=$((32*1024*1024*1024)) \ -start_block_size=$((1024*1024)) \ -max_block_size=$((1024*1024)) \ -target_seg_type=VRAM \ -initiator_seg_type=VRAM \ -storage_enable_direct=1

  • Single-GPU Saturation Test: A primary objective was to demonstrate that the storage platform could deliver data at a rate sufficient to saturate a single GPU, with and without GDS. Our tests successfully achieved GPU saturation, driving a single H100 GPU at 35GB/s using GDS without saturating the VAST AI OS’s available throughput. This confirmed that storage would not be a bottleneck for the subsequent stages, and the GPU’s processing capacity would dictate that performance.

Accelerating Inference

2. Application-Level Integration and Comparative Testing

With a high-performance I/O baseline established, we transitioned from synthetic benchmarks to a real-world LLM inference environment.

Framework Integration: We utilized a build of the vLLM inference engine, which incorporated the LMCache kv_connector. This connector enables vLLM to persist and retrieve KV caches from an external namespace, in this case, a file-based repository on the VAST AI OS.

Prefill Time Analysis: The core of our analysis was a comparative measurement of prefill duration. We benchmarked inference requests for a range of input token sequence lengths under two distinct conditions:

1. Compute-Bound Prefill (Baseline): Generating the KV cache from scratch using GPU compute resources.

2. Cache-Read Prefill (Optimized): Bypassing the initial computation by loading a pre-computed KV cache directly from the VAST platform. This comparison directly measures the reduction in TTFT achieved by leveraging a persistent KV cache.

TTFT for Qwen3-32B

To understand the real-world implications of this performance, we tested to determine the TTFT for a prefill task on a currently popular LLM. The parameters for this are as follows:

  • Model: Qwen/Qwen3-32B
  • Context Size: 130,858 tokens (max: 128,000)
  • KV Cache Size: 32 GB (for a 130K token context)
  • Conventional Prefill Rate: 500 Tokens Per Second (TPS)
  • System Overhead: 1 second (for initialization and other processes)

KV Cache Loading Calculation:

Using vLLM v0.9.0-170 and LMCache v0.3.0-1, the pre-computed 32 GB KV cache can be loaded directly into the GPU.

Accelerating Inference

The results of our testing indicate a dramatic reduction in the time required for the initial prefill stage. By directly loading the KV cache, vLLM+LMCache can reduce the token read time (TTFT) from over 11 seconds to just 1.5 seconds at the maximum context window size (128,000).

The time spent generating the KV cache vs. reading it from storage represents a 10x improvement in request_prefill_time performance for large context scenarios!

This environment adheres to NVIDIA DGX SuperPOD™ and NVIDIA Cloud Partner (NCP) reference architectures. The storage traffic traverses the north-south fabric, and VAST AI OS routes all storage traffic over a high-speed 400 Gbps connection with NVIDIA BlueField-3 DPUs.

While the platform supports NFSoRDMA with GPUDirect Storage (GDS) for optimized data paths, these features are optional. Our architecture is engineered to drive full line-rate performance using a standard NFSoTCP implementation, providing maximum deployment flexibility without compromising throughput.

The Way Forward

The landscape of LLM infrastructure is rapidly evolving towards open, standardized interfaces that decouple storage and compute, a critical step for enabling, high-performance AI factories. NVIDIA Dynamo and the underlying NIXL library are helping drive this transformation. Dynamo enables disaggregated serving, separating prefill and decode on district GPUs. NIXL creates a standardized API layer for low-latency data transfer , effectively abstracting the complexities of data movement between GPU memory and various external storage tiers,. It offers a GPU-native library for managing and accessing KV cache movement between data caching tiers, including external storage, which is what we validated with nixlbench in our testing.

Projects like LMCache, vLLM Production-Stack, and the broader LLM-d initiative, which advocate for disaggregated LLM components, fit perfectly within this paradigm. Using the NIXL GDS plugin and an LMCache-enabled version of vLLM, our methodology directly engages with this next-generation stack. It demonstrates how a high-performance, disaggregated, shared-everything data platform can serve as the persistent tier for GPU memory extension, leveraging the open, standardized I/O paths being championed by NVIDIA and the wider community to deliver scalable, stateful, and cost-effective LLM inference.

Convergence is continually occurring in the broader model-supporting frameworks space, and we’re ensuring that at every step along the way, the VAST AI OS is available as a fundamental starting point for development, iteration, testing, and production.

Partnerships Make Possibilities

We want to thank our partners at Scaleway, a leading AI cloud service provider in France, for their invaluable assistance with our recent KV cache testing initiatives. Their team provided a truly cutting-edge environment that was instrumental in helping us validate and push the boundaries of our research. It's a privilege to collaborate with partners who are as committed to high-performance infrastructure as we are.

The modern technology stack provided by Scaleway represents a powerful blueprint for AI infrastructure, demonstrating how best-of-breed components can be integrated for maximum performance. This setup, built on the NVIDIA Spectrum™-X networking platform, takes advantage of the capabilities of NVIDIA Spectrum-4 switches and NVIDIA BlueField-3 DPUs and RoCE (RDMA over Converged Ethernet) to enable advanced, low-latency protocols such as NFSoRDMA and GPUDirect Storage, all seamlessly integrated with the VAST AI OS.

Join the Technical Discussion

The evolution of inference optimization is just beginning, and the challenges surrounding KV cache management, memory efficiency, and scheduling algorithms are transforming how we create AI infrastructure at scale.

If you're facing similar optimization challenges or want to delve into the technical aspects behind these benchmarks, we would love for you to join our engineering discussions in the Cosmos community, the LMCache Slack community, Dynamo Discord Server, and contribute to our open-source projects. It's a space where we and other practitioners share insights on everything from storage architecture to AI workload optimization - the type of technical deep-dives that don't always make it into blog posts.

More from this topic

Learn what VAST can do for you
Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

By proceeding you agree to the VAST Data Privacy Policy, and you consent to receive marketing communications. *Required field.