Open and Undivided Attention

Author

Dave Graham, Technical Marketing Manager | Dan Aloni, Senior Software Engineer | Matthew Rogers, Field CTO AI and Security | Alon Horev, Co-Founder and VP of Technology

As the world seeks to deploy AI models for large-scale inference, the increased popularity of inference and the evolution of AI models are pushing the limits of our computational infrastructure. AI computers are equipped with a finite amount of CPU and GPU memory, which is becoming an increasingly limited resource as three dynamics take shape:

AI models are increasingly evolving to store larger contexts, or knowledge, into a model. To put this into perspective, LLaMA 1 was released in 2023 with support for a context window of 2,048 tokens. Fast forward to LLaMA 4, just announced last week by Meta, which can support new AI models with up to 10 million tokens. In two years, the number of tokens supported by an AI model has grown by a factor of 5,000. Ten million tokens consume far more memory than can be accommodated in GPU memory, so larger storage and caching methods are required.
Long prompts and extended inference sessions will often push a model’s context beyond what can fit in the main GPU memory. Examples of this include system instructions, rich input data, or very long histories.
Multi-user infrastructures continuously transfer extended session data across GPU servers as machines periodically evict chat and agentic sessions from their cache to free space for more recent user interactions. When Sam Altman discusses OpenAI’s Image Generation “melting” their GPUs, this type of cloud service struggles to retain only essential data in a GPU cache. Then, it evicts this data as quickly as possible to service other user prompts.

Context is King

System caches are far too constrained to support the volume of data that large AI environments must handle. When GPU and CPU memory are exhausted and an inference model cannot access all the tokens it needs, AI engines resort to recomputing inference sessions, which is a wasteful misuse of valuable GPU resources. We have already entered the terabyte era of model context, and this trend appears to be scaling exponentially. What’s needed is infinite and shared context data that can be made available in real-time across all the machinery in the AI Factory.

To help the entire industry achieve significantly greater efficiency from their AI processor investments, VAST Data is excited to announce the availability of our open-source, global exabyte-scale key-value service called VAST Undivided Attention (VUA). VUA integrates with popular AI inference workloads and expands cache space to a third tier of persistent, shared NVMe memory, providing infinite context scalability. VUA assists organizations in reducing time to first token while also saving significantly on GPU and CPU memory.

This release offers the AI community - comprising researchers, data scientists, ML engineers, customers, and industry partners - the tools to build and deploy advanced AI applications at speeds up to four times faster than conventional KV Cache approaches.

Why KV Cache Optimization is Critical

Large Language Model (LLM) inference consists of two stages: the prefill stage, where an input prompt is processed, and the decode stage, where output tokens are generated one by one. Two critical performance metrics in this process are Time To First Token (TTFT), which represents the latency to produce the first output token, and Time Per Output Token (TPOT), which represents the average time to generate each subsequent token.

Modern LLM serving systems strive for low TTFT and low TPOT. A key innovation that enables this is Key-Value (KV) cache, which stores the intermediate attention “key” and “value” vectors for each token as they are computed. This allows the model to reuse them during subsequent token generations, rather than recomputing them from scratch. By avoiding redundant computation, KV caching accelerates generation but also introduces new challenges in memory usage and GPU resource utilization.

KV caching primarily boosts the decode phase but can also indirectly reduce TTFT in scenarios with long or streaming prompts. TTFT is typically dominated by the prefill stage, which is compute-heavy. Without a cache, applications that try to generate and stream output from a very long prompt in segments must reprocess earlier prompt tokens repeatedly, delaying the first output token.

With KV caching, each prompt token is processed and cached only once. As a result, the model can emit the first generated token immediately after the prompt is processed, with no repeated computations. Caching also enables optimizations like prompt segmentation and prefix reuse. For instance, systems such as vLLM’s detect when new requests share a prompt prefix with previous ones and skip reprocessing the shared portion by reusing cached keys and values.

Extending the KV cache beyond GPU memory yields several advantages:

Extreme Context Lengths: Supports very long context windows (ranging from hundreds of thousands to potentially billions of tokens) that exceed GPU memory by utilizing CPU RAM or disk space as overflow. This is crucial for applications requiring lengthy document summarization or code analysis.
Cost Efficiency: Leverages cheaper memory tiers. Large models and contexts can be served on fewer or smaller GPUs, with CPU and SSDs augmenting memory capacity. This can significantly lower the cost per query.
State Persistence: Enables persistent conversational state across turns or sessions. KV caches representing prior dialogue can be stored in off-GPU memory between queries, freeing GPU resources while retaining the ability to resume context quickly.
Scalability for Throughput: Facilitates serving more concurrent requests or larger batches by utilizing host memory or disk as a safety valve, yielding very high aggregate throughput in non-interactive settings.

Introducing VAST Undivided Attention (VUA): A Global, Intelligent Cache

We previously introduced VAST Undivided Attention (VUA). VUA is designed from the ground up to address the challenges of scaling KV cache. As an intelligent caching system, it functions as a prefix-search-based global KV cache accessible throughout a GPU cluster.

Built on the VAST Data Platform’s unique Disaggregated Shared-Everything (DASE) architecture, VUA operates as an agent within GPU servers, creating a new data presentation layer for AI frameworks within modern multi-tenant environments. It intelligently manages KV cache data across tiered memory hierarchies, encompassing a vast, shared pool of low-latency NVMe flash accessible via RDMA (Remote Direct Memory Access). This design enables a near-infinite scalable memory space for context data.

Key architectural advantages include:

Global Shared Access: VUA provides each GPU server with shared access to the extended KV cache. This removes the need to route subsequent prompts back to the same GPU that handled the initial request, improving load balancing and reducing redundant cache entries across the cluster.
Intelligent Prefix Caching: VUA surpasses basic caching by breaking down attention keys into chunks, which are stored in a nested structure. This enables sophisticated partial context matching using longest prefix identification, significantly improving cache hit rates in workloads such as Retrieval-Augmented Generation (RAG), where the same base documents appear across multiple distinct prompts.
Ultra-Low Latency with RDMA: By leveraging NVIDIA GPU Direct Storage (GDS) and RDMA-based protocols (ex: NFSoRDMA), VUA facilitates direct, high-speed data transfers between the shared NVMe storage tier and GPU memory, bypassing CPU bottlenecks. This reduces cache access latency, which is crucial for maintaining high inference throughput. In contrast, traditional object stores and key-value databases are not optimized for this workload.
Optimized Metadata: VAST’s underlying metadata structure is built for extreme scale, efficiently handling billions of cache entries with fast lookups, which is essential for rapid prefix resolution.

Supporting popular frameworks like vLLM, LMCache, and NVIDIA’s Dynamo, VUA significantly reduces Time-To-First-Token (TTFT) for context-sharing requests and maximizes GPU utilization by keeping the compute units fed with the necessary KV data. KV Caching is becoming a commodity due to all the frameworks mentioned above, so in addition to our performance, scale, and uptime capabilities, VAST also layers data management services. Our lifecycle policies help you manage capacity and enable the system to delete stale KV caches automatically. Meanwhile, our auditing can help you understand what’s being used and how, facilitating awareness of which KV Caches are most popular, for example. These factors become crucial for understanding and profiling your AI-serving environment, which is built on the strong foundation of DASE.

A View into VUA

As seen on the graph below, data from early VUA integration testing with vLLM demonstrates its impact on reducing Time-to-First-Token (TTFT).

As a baseline, we used the Qwen2.5-1.5B-Instruct model and tested it under two configurations: standalone vLLM and vLLM with VUA. Then, we issued a series of increasingly complex questions designed to increase the token demand. The assumption was that KV cache “hits” would decrease over time as the system continued to cycle through responses, because the disparity between answers would increase exponentially during prefill processing.

As seen in the testing series above, TFFT proportionally increases with token count, at times exceeding 2 seconds per response, particularly in the case of vLLM without VUA. However, when VUA is used in conjunction with vLLM to prefill and reduce token reprocessing intelligently, the results shift dramatically.

When using vLLM with VUA, the TFFT delta decreases by over 70% and scales as the token count increases. Response times remain relatively constant, only exceeding 0.5 seconds near the end of the testing process. These optimizations highlight how VUA is particularly valuable for applications requiring:

Common Question Prompts
Multi-round dialogues (faster context switching)
Long document Q&A (improved throughput)
High-concurrency scenarios (reduction in preemptions)

Get Involved with the VUA Project

We invite the AI community to explore, use, and contribute to the VAST Undivided Attention project. Source code, documentation, and initial usage examples are available at https://github.com/vast-data/vua.

Join our community forums at community.vastdata.com and our Discord server to ask questions, share your findings, and collaborate with fellow users, industry experts, and the VAST engineering team. We are excited to see the innovative ways the community will leverage VUA to advance AI infrastructure.

Moving Toward Limitless AI Inference

Open-sourcing VAST Undivided Attention represents a significant step in addressing the infrastructure challenges associated with large-scale AI inference. By delivering an intelligent, scalable, and low-latency global KV cache solution, VUA enables organizations to deploy larger models, handle longer contexts, and maximize the utility of their AI systems. We are committed to supporting the open-source community and collaborating to build the future of efficient, scalable AI infrastructure. Come build the next generation of AI with us.