Thought Leadership
May 29, 2026

A New Memory Tier for AI at Scale

A New Memory Tier for AI at Scale

Authored by

Nicole Hemsoth Prickett, Head of Industry Relations

Although it might seem like the focus will always be on scaling compute, in real-world large-scale AI, it’s quickly becoming clear that such framing breaks the minute systems are expected to do what’s needed the most for future workloads, like serving real users continuously, holding context across interactions, and operating inside way longer workflows.

We’re rapidly moving toward a world where we’re worried about how fast a model can run to looking closely at how well systems can sustain useful interaction over time without running into memory limits.

At VAST Forward 2026, Anat Heilper, Director of AI Architecture at VAST Data reminded the crowd, the bottleneck is no longer just compute, it’s how we manage memory. “We're seeing a shift where the context of the conversation is as important as the model itself,” she explained, adding that such a change forces a different kind of system design.

Production inference, unlike training, has to run continuously, serve many users at once, and carry forward the history of each interaction. That history builds up as context, and context drives memory demand. The system focus is far less about how many operations it can execute, but how efficiently it can store and retrieve the information generated during those interactions.

This is where the GPU memory wall becomes a real problem, according to Heilper. GPU memory is fast but limited and once context no longer fits, it has to move into other layers of the system, with each step away from the GPU adding both latency and cost. At that point, she says, the limiting factor is if the system can keep enough context available to make that token useful without slowing everything down.

Inference Carries Memory

Vikram Sharma Mailthody, Senior Research Scientist at NVIDIA, joined Heilper on stage to dig further into what happens when inference stops looking like a simple request and response loop and more like a continuous process.

Agentic systems don’t just answer prompts, Mailthody says, they run multi-step workflows, call tools, and build toward outcomes over time. That only works if context is carried forward instead of being reset after every response.

“Inference is no longer stateless. Agents must retain and reuse context across interactions, sessions, histories and even services.” This, he adds, means it turns context into a long-lived part of the workload, spanning sessions, moving across services, all the while being available as the system works through each step.

The result is steady growth in how much context the system has to manage. Longer sequences, tool outputs, and multiple users all add to it.

KV Cache Becomes the System

As context grows, KV cache becomes the main thing the system has to manage, which is quite the job since it’s holding the attention data needed to avoid recomputing work, while also expanding quickly with longer sequences and higher concurrency (which makes it one of the largest consumers of memory).

At that point, KV cache becomes the working memory that keeps interactions coherent over time. As Mailthody explains, “in effect, KV cache is becoming a long lived memory, and performance now depends on how efficiently we manage the KV cache and reuse this context” but “the farther the inference context goes away from the GPU, the more expensive and inefficient the inference becomes.”

From there, the system is no longer limited by compute speed, but by how well it can keep that context close enough to be useful which is where the partnership between VAST and NVIDIA shines through.

Orchestrating Inference With Dynamo

Once KV cache becomes central, the problem shifts to managing it across the system, which is where NVIDIA Dynamo comes in.

It acts as the orchestration layer for inference, deciding how requests are handled, where they run, and how context is reused by breaking inference into coordinated parts (API layer for requests, router to sends them to the right place based on where relevant KV cache already exists, and a planner that adjusts capacity as demand changes).

Underneath all that, a KV cache manager tracks where context lives across memory tiers, and a transfer layer moves it between GPUs, memory, and storage.

The all-important prefill stage, which is compute-heavy, and decode, which is memory-heavy, can run separately and be optimized on their own, which is a genius feature that further refines smart memory management. It basically lets the system decide where work performs best instead of forcing everything through one path, so things like routing decisions are based on context location, not just load balancing.

The result is that inference behaves more like a distributed system. Requests, memory, and compute are constantly being coordinated. That makes context reuse possible at scale, but it also highlights the next issue. Even with orchestration, performance still depends on how fast that context can be stored and retrieved.

Storage Becomes a Compute Multiplier

Once orchestration is working, the next limit is how fast context can be read and reused. This is where VAST’s expertise is on full display.

KV cache is not like typical data. It is read-heavy, accessed in large chunks, and reused often and as it turns out, VAST is designed for that pattern, turning what would be an I/O bottleneck into something that scales with network speed.

As Heilper explains, “We are not making the GPU faster… we’re making it available more often, and turning the storage into a compute force multiplier.” By serving KV cache fast enough to avoid recomputation, the system keeps GPUs doing useful work instead of rebuilding context.

The effect is easy to see. “Instead of 65 seconds of waiting for the GPUs to calculate it, we fetch it in three seconds. That’s a fundamental change,” she says. That kind of improvement translates into less GPU time wasted, faster responses, and more output for the same hardware.

At that point, storage is not just holding data. It directly increases how much work the system can get out of its compute.

As Heilper shared, that improvement depends on how often the system can reuse context instead of rebuilding it. In real workloads, a 40 to 60 percent cache hit rate is common, and that alone changes the output of the system.

With that level of reuse, overall throughput increases significantly, translating to roughly 60 to 130 percent more tokens per dollar. The gain comes from keeping them focused on new work instead of repeating what has already been done.

Context Becomes Enterprise Data

Once KV cache leaves the GPU, it stops being just a performance detail and starts looking like real data. It includes prompts, user inputs, and intermediate results tied to how the model is working through a task. In production systems, that makes it sensitive.

As Anat Heilper explains, “Once the KV cache leaves the GPU, it contains sensitive user data and is vulnerable to manipulation or reverse engineering by attackers.” This means context can no longer be treated as temporary. It has to be secured and managed like any other important data.

That brings in standard enterprise requirements. Data needs encryption, isolation between users, and clear access controls. Retention matters because context can persist beyond a single session. Multiple users and services may be sharing the same infrastructure, which raises the need for consistent governance.

This is where VAST plays a broader role beyond what we’ve already covered here. It applies enterprise data services directly to KV cache, not just for performance but for control. As inference moves into regulated environments, this becomes part of the core system design.

A New Memory Tier for AI at Scale

Even with orchestration and fast storage, KV cache keeps growing, and pushing more of it into traditional storage adds latency and cost. This is where NVIDIA introduces CMX, a new memory tier built specifically for inference.

CMX sits between GPU memory and storage and is shared across GPUs in a pod. Instead of duplicating context, it allows reuse, keeping more working memory close enough to the GPU to maintain performance while scaling capacity beyond what GPU memory can handle.

This layer is enabled by BlueField-4 DPUs and high-speed networking. BlueField-4 brings compute and data services closer to inference, allowing VAST to run directly on the DPU. That removes extra server layers and enables faster movement of KV cache between GPUs and storage.

At scale, this matters. Systems supporting thousands of users can require hundreds of terabytes to hold active context and multiple petabytes to retain it across sessions. With this architecture, that capacity can be added without slowing access, allowing long-running sessions and persistent agents to work reliably.

The result is a system organized around memory. Compute, storage, and networking are all aligned to keep context available and reusable, which becomes the main factor in how much work the system can deliver.

More from this topic

Learn what VAST can do for you

Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

* Required field.