The Lie At The Center Of Enterprise RAG

Authored by

Nicole Hemsoth Prickett, Head of Industry Relations

Let’s imagine a manufacturing company for a moment. An employee asks about a product manual and the GPUs go chew on the necessary documents. An hour later, another employee asks a similar question but instead of using what it just surfaced, it goes and rebuilds the same context all over again.

Or imagine too a developer sitting down to open a session, loading the same repositories each time they start. Support teams repeatedly query the same manuals and ticket histories. Financial analysts reload the same filings and compliance documents into long-context systems.

More broadly, almost all agentic workflows revisit the same operational data over and over as tasks move between users and systems.There are countless possible examples of this in enterprise settings and the fact is clear, the big secret when it comes to how businesses use AI is that there is a constant loop of rebuilding the same context time and again.

Across the enterprise, huge amounts of compute are spent reconstructing context the system already understood yesterday and the real cost goes far beyond just retrieval as we would have thought of it in the before-times, it’s the expense of context (re)construction.

Now as enterprises want AI that moves toward long-context inference, multi-document reasoning, and agentic workflows, prefill has become one of the most expensive parts of inference. The system spends massive GPU resources ingesting documents, processing token relationships, and building KV cache before useful reasoning even begins, only to have much of that disappear when the session ends.

So the enterprise pays to rediscover the same information again and again. So, that lie referred to in the title? It’s that the infrastructure acts like every prompt is brand new even when the organization has already built the same contextual understanding thousands of times before.

It goes a bit deeper than this too. The industry treated retrieval as the hard problem, so it built vector databases, embedding systems, and semantic layers around finding things fast, and while those are important tools, retrieval wasn’t the expensive part, rebuilding understanding that was.

Most inference infrastructure still hasn’t caught up to this because even still, every request is treated as isolated even when the same documents, repositories, policies, tickets, and reports are being queried constantly across the organization.

The model retrieves the same information again and again, then rebuilds nearly the same context from scratch each time, with GPUs repeating the same prefill work, regenerating similar attention patterns, then reconstruct the same KV cache because the system has no durable memory of what it already processed earlier that day.This becomes especially wasteful at enterprise scale and as context windows grow larger, the inefficiency grows with them.

Prefill Is Eating The Economics Of Enterprise AI

Speaking of enterprise scale, as you can imagine, this becomes a much larger problem when prefill starts dominating the economics of inference.

In long-context systems, the model might only generate a few hundred output tokens while spending enormous amounts of compute ingesting documents, processing token relationships, and building KV cache beforehand. To put this in perspective, one query can make the system process hundreds of thousands (even millions) of tokens just to prepare the context window needed for reasoning, hence all we’re hearing about token sprawl and teams hitting their token limits way faster than expected.

All of this changes the cost structure of enterprise AI. The expensive part is no longer just generation speed or model size (though those things do matter too). Increasingly, it’s actually the repeated construction of context itself.

Multi-document reasoning, coding assistants, compliance review, enterprise search, and agentic workflows all depend on large contextual windows that have to be continuously rebuilt during inference. The larger those windows become, the more GPU resources are consumed during prefill, and the more expensive it becomes to repeatedly reconstruct nearly identical semantic state across users and sessions.

All of this waste compounds because the resulting KV cache usually remains trapped inside temporary runtime memory tied to a specific GPU allocation, inference engine, or session. Once the workload moves or the session ends, the context disappears with it, even though the enterprise may need that same understanding again minutes later.

What this means is that KV cache is no longer just a model optimization detail buried inside inference engines. At enterprise scale, it directly affects cost, latency, throughput, and GPU utilization, which means discarded context windows are becoming one of the largest hidden inefficiencies in modern AI infrastructure.

KV Cache Is Actually Enterprise Knowledge

If prefill is becoming one of the most expensive parts of enterprise AI, then the obvious question is why the resulting context is still treated as disposable.

Today, most infrastructure treats KV cache as temporary inference memory. The system builds it during runtime, keeps it briefly inside GPU or host memory, then throws it away when the session ends. But at enterprise scale, that cache increasingly represents expensive semantic work the organization already paid to create.

A support agent querying a product manual isn’t just retrieving documents, the system is building contextual understanding around those documents. And in that example of the coding assistant loading a repository is constructing awareness of files, dependencies, APIs, and prior changes. Financial analysts too querying the same filings and reports are repeatedly rebuilding nearly identical contextual understanding across users and sessions.

There is no real reason that contextual understanding has to remain temporary other than the fact that inference systems were originally designed that way. Once KV cache is treated as reusable data instead of short-lived memory, the architecture starts to change.

Context can persist across sessions. Similar requests can reuse already-prepared understanding instead of rebuilding it from scratch. Requests can be routed toward systems where relevant context already exists. Over time, the enterprise begins building a persistent layer of reusable understanding alongside documents, embeddings, and metadata.

The system stops asking “what information is relevant?” and starts asking “what understanding already exists?”

The Infrastructure Stack Is Built For The Wrong Model

Once contextual understanding becomes persistent and reusable, the traditional AI stack starts looking a little…fragmented.

Most enterprise AI infra was built on the assumption that storage, retrieval, inference, and orchestration were separate problems. Documents live in one system, embeddings in another, vector indexes somewhere else, GPU infra separate from storage and so on with orchestration layers trying to coordinate everything from above. That all made sense when inference was mostly short-lived and contextual memory disappeared as soon as a session ended.

Now, reusable KV cache changes those assumptions because context itself starts behaving like enterprise data (needs to be stored, shared, routed, replicated, indexed, protected, and reused across users, agents, and workloads). That means the infrastructure can no longer treat inference memory as something isolated inside a GPU server because the contextual understanding generated during inference now has lasting operational value.

As that happens, the boundaries between storage, retrieval, inference, and orchestration start breaking down too. So, a query engine might need awareness of where relevant context already exists before assigning workloads for instance or storage systems might need to manage persistent KV alongside source documents and embeddings. Orchestration layers too might need to route requests not just based on available GPU capacity, but based on where useful context is already warm and accessible.

The result is infrastructure that behaves less like a chain of disconnected AI services, which always should have been the goal. And when you zoom out you see it’s more like a unified platform where data, inference, and contextual understanding operate together as part of the same system.

Enterprise AI Is Becoming A Continuous System

That new kind of data-driven platform can go way beyond just retrieving data and serving inference requests and can get enterprises closer to the AI efficiency grail of gradually building a shared layer of contextual understanding that sits alongside documents, embeddings, metadata, and broader workflows.

This means requests can move toward existing context, agents can inherit understanding generated earlier by other users or systems and the enterprise can begin building an environment where useful semantic work compounds instead of disappearing after each session.

More broadly speaking, at that point, the distinction between storage infrastructure, inference infrastructure, and orchestration infrastructure becomes much harder to separate cleanly because all of them are participating in the continuous management of contextual understanding across the environment. It is almost the very definition of what VAST has built, in fact.

The Lie At The Center Of Enterprise RAG

Prefill Is Eating The Economics Of Enterprise AI

KV Cache Is Actually Enterprise Knowledge

The Infrastructure Stack Is Built For The Wrong Model

Enterprise AI Is Becoming A Continuous System

More from this topic