Really Big Data: How Context Memory Is Reshaping AI Infrastructure

For at least the past decade, we’ve taken data storage and computer memory capacity for granted. It’s easy to mock the infamous quote often attributed to Bill Gates — that 640KB of memory ought to be enough for anybody — when $2,000 will get you a new iPhone with nearly 8,000 times as much RAM (12GB) and 2TB of SSD storage. That is, after all, about 100,000 times what passed as a large internal hard drive when those began shipping with PCs in the mid-1980s.

In fact, easy access to storage and memory, paired with ubiquitous high-speed internet, has arguably led to operational laziness. Popular data systems like Apache Kafka might require storing triplicates of data in order to assure high availability. And even when it’s easily achievable, application developers might make minimal effort to optimize memory efficiency. Just buy more storage. Just switch to higher-memory cloud instances.

But those days are coming to an end. AI inference workloads are creating an entirely new data type that’s set to skyrocket in terms of volume: the context memory, or the working memory, of our interactions with AI models.

It’s a big enough deal that, at CES last week, NVIDIA CEO Jensen Huang unveiled a forthcoming AI supercomputer that can use up to 18 petabytes of solid-state-drive capacity — or up to 16TB per each of its 1,152 GPUs — in a new tier of context memory storage designed to help manage all the inference context we’ll be generating. For AI factories scaling into the tens or hundreds of thousands of GPUs, that number could balloon into exabytes of capacity for inference alone. And this in the face of an industry-wide shortage of both high-speed memory and SSD capacity (the building blocks of context-memory storage) because manufacturers can’t keep up with demand.

Context memory lives in the KVcache

For LLMs and other transformer-based AI models, context memory lives in the key-value cache, or KVcache. Here’s a high-level explanation of how it works:

When a human or an agent calls a model, the model must first compute the initial context (such as the current prompt, document upload, retrieved content, or conversation history) before it can generate the first token (in the case of an LLM, a token usually equates to about one word). This is called the prefill stage; it’s compute-intensive and it’s why you might notice a lag before the model starts generating a response.
For each subsequent token generated as part of that response, during the decode stage, the model generates the next token based on the keys and values of all previous tokens, and then appends to the KVcache accordingly. This step is memory bandwidth-intensive, consisting of continuous, but relatively small, interactions between GPU compute and memory — where all those keys and values reside.
Without KVcache, LLMs would have to execute the prefill stage for each token. Because the context grows with every pass, each additional token would take more time to compute, consume more resources, and generate a larger cache.

When you hear about the context windows of LLMs, this is what they’re talking about: How many tokens a model can technically handle. But even then, model performance can degrade as user sessions stretch beyond the context window, or merely beyond the available memory and/or GPU capacity. At this point, earlier context might be flushed to make room for new content, leading to a model’s “forgetfulness.”

Today, a large producer of context memory (and, therefore, consumer of KVcache capacity) might be a workload like code generation, where someone could upload an entire codebase into a context window as part of their prompt. Video models can also generate a large amount of context memory, as each frame is composed of multiple (and sometimes hundreds of) tokens.

Looking to the near future, agentic workflows will also be large producers of context memory. This is especially true of multi-agent systems, where the orchestration agent has to maintain context across any number of other agents. Looking a little further out, multimodal models for applications like robotics could also generate huge amounts of context memory.

From 640 kilobytes of computer memory to exabytes of context memory

Let’s get back to the earlier point about how greatly memory and storage capacity have increased over the past 40ish years. While it’s fun to marvel at the specs of a modern smartphone, here’s a statistic that really matters today: The key-value cache (KVcache or KV$) of a large language model with a 128,000-token context window (like Llama 3), will generate about 61GB of data per conversation.

Now, 61GB might not sound like a lot, but it adds up. Here’s a hypothetical situation our engineering team recently laid out, using the the still relatively standard 128,000-token context window as a starting point:

How Much Capacity Do You Need?

The following is an example formula that can guide customers in sizing for KV$:

Total Capacity = Number of Users × Average KV$ per User × Retention Multiplier

The retention multiplier is the number of past conversations that are stored for a user. Let’s use our Llama 3.1 405B example, and plan for 100,000 concurrent users or agents:

Average KV$ per Prompt: As large context windows (e.g., 128,000 tokens) are becoming standard for tasks like document analysis or long-running conversations, the cache size per user balloons. A 128k-token context is roughly 61GB in size. Let’s assume a more typical average for a prompt is 64K tokens, roughly 30GB
Retention Multiplier: We’ll use 15 to provide a good user experience
Calculation: 100,000 users × 30GB × 15 = 45,000,000GB = 45PB

Both architecturally and cost-wise, maintaining petabytes or exabytes of KVcache solely in GPU memory or even system CPU memory is out of the question.

And while persistently storing that much context memory — or more — might seem a little crazy today (beyond for a handful of large model providers), it might seem much more reasonable tomorrow. Nobody wants their AI model to forget the last 5 questions they asked, and everyone wants to be able to resume their last conversation. Even if use cases somehow remained static, user growth, larger models, longer context windows, and a desire to store more context for longer periods of time will also cause context-memory storage to balloon in the years to come.

Storing all the world’s context on SSDs

The incredible amount of context memory about to be generated and stored is why Jensen Huang stood in front of that large supercomputer at CES and, with the world watching, spoke at length about KVcache and the lengths NVIDIA went to make sure its next-generation systems can manage it at production scale.

SSDs, obviously, cost much less than GPU memory in terms of terabytes — or petabytes — per dollar. And although pulling a KVcache from external storage isn’t as fast as GPU memory for current or “hot” sessions, properly architected storage systems – such as the newly announced context memory storage architecture – can still feed context to GPUs at near real-time speed. Because it mitigates cold starts, in fact, this setup will likely improve performance for workloads that require accessing past conversations or context.

Looking forward, tiered KVcache storage points to a future where we can do even more by blowing past some of AI’s architectural hurdles:

Multi-agent workflows and robotics models get high-speed access to stateful context, allowing them to perform complex, multi-turn tasks, and learn from historical interactions, without constant prefill computation.
State-space models could offset their lack of attention by calling on externally stored context, thus opening their superior efficiency up to new classes of applications.
Smaller models could use historical context to overcome their inherently worse accuracy, which is the natural tradeoff of being trained with fewer parameters.

The more context we can offload to SSDs (or other flash) with low-latency GPU connections, the more creative we can get around compute utilization, model and application architecture, and data center design.

Supply Chain, Meet Demand … Meet Opportunity

Of course, all of this AI investment does come at a cost: insatiable demand for DRAM and NAND flash capacity. For the first time in a long time, “just buy more storage” doesn’t look like a viable strategy. Costs are up. Some manufacturers are scrapping consumer product lines to focus on data center production. Hyperscale buyers are snatching up capacity as soon as it’s available.

The opportunity comes from figuring out how to squeeze more capacity out of existing gear. Step up data reduction. Share storage across systems, applications, and users. Find better, modern alternatives to legacy systems that waste capacity. Clever engineering today will pay dividends down the road, even after DRAM and NAND capacity are readily available again.

It’s also a good time to start thinking seriously about how to secure context memory. Consider the amount of sensitive personal data and intellectual property that might be contained within, to use our previous example, 45PB of AI interactions and model “thought processes” for 100,000 users. Anyone storing that much data unencrypted and/or weakly isolated, even ephemerally, is asking for leakage.

When all the proposed AI data centers ultimately fill up, they should be operating as optimally as possible. That means maximum energy efficiency and minimal wasted space, so operators have the flexibility to adapt to changing compute requirements and use cases as needed. Adding storage capacity to handle KVcache and context memory will be a critical piece of this puzzle, so long as we remember that storage infrastructure isn’t just an endlessly scalable supply of dumb bits.

In the AI era, storage, like GPUs, can’t be measured solely in terms of performance or capacity. What will matter far more is what we can do with all that capacity. Like those measly 640 kilobytes of PC memory in 1981 helped drive the digital revolution, the exabytes of context memory about to come online will help drive the AI revolution. The more we optimize it, the bigger results we’ll see.

Experience VAST’s industry-leading approach to AI and data infrastructure at VAST Forward, our inaugural user conference, February 24–26, 2026 in Salt Lake City, Utah. Engage with VAST leadership, customers, and partners through deep technical sessions, hands-on labs, and certification programs. Register here to join.