As the line blurs between large language models and AI agents, it’s a great time for enterprises to start thinking seriously - and holistically - about their AI infrastructure. If the last few years were defined by large AI labs delivering huge models to the world, the next few years will be defined by how well everyone else can take those models and run with them.
It is, of course, difficult to predict exactly how things will play out, but one safe bet seems to be that AI agents and agentic workloads will take off. It’s already happening at the application level for certain workloads, such as coding, where agents can work from a simple prompt to execute increasingly complex tasks. And (assuming a cooperative ecosystem) agents are set to proliferate on the web for everything from booking reservations to buying products directly from AI applications like ChatGPT.
Although the project is still very early, the level of interest surrounding the Model Context Protocol (MCP) is another leading indicator of agentic activity to come. The easier it is to connect models with external tools and systems, the more workflows agents can execute. Think of MCP as the AI analog of open APIs, at least in the sense that it opens up a new path for software products to communicate (and for software vendors to partner).
Whereas today it’s largely human users writing prompts that kick off relatively simple agentic workflows, tomorrow those workflows will involve any number of agents working in tandem to complete increasingly complex tasks. Many workflows shouldn’t require a human catalyst at all. An event will trigger a function that triggers a team of agents that might do everything from parsing and analyzing a newly received document - RPA that actually works! - to warning public safety officials of potential threats caught on cameras or other sensors.
But maximizing the potential of this new world requires a reinvestment in enterprise infrastructure and data architecture that’s optimized for running these agentic workflows. At some point, the assumptions we’ve been relying on for decades have to break. The current and near-term volume of AI inference workloads will stretch them thin, and the large-scale adoption of AI agents looks like the breaking point.
From zero to agents, and from training to inference
Before digging into the ideal agentic infrastructure, though, let’s look at the recent history of AI to see how we got to this point. Here’s a very brief rundown of widely available AI capabilities available over the past decade or so:

Beyond going from zero to agents in the course of a few years, the biggest shift in AI has been from a focus on training to a focus on inference. Training frontier AI models famously requires huge amounts of data and computing resources, but the process also is pretty predictable: AI labs know how much infrastructure they’ll need and for how long they’ll need it. Once they’ve established their methodology, they can tweak it as needed for future training runs.
From an infrastructure perspective, inference can be trickier. Because although each inference task is relatively small in isolation, they can add up fast. Inference requests are often batched together, meaning x jobs are sent for processing simultaneously, in order to increase GPU utilization and efficiency. Determining a batch size is a tradeoff between throughput and latency, as users definitely notice when outputs are slow or, worse, they can’t even run their jobs.
In fact, it’s inference that’s at the core of many of the massive compute deals the leading AI labs are striking with infrastructure providers around access to cloud resources and physical GPUs. If companies like OpenAI and Anthropic want to keep their customers happy and deliver on their internal goals, they need all they can get.
However, it’s not just an influx of users that strains inference infrastructure; it’s also the types of workloads they’re running. Today’s reasoning models rely on inference-time scaling to carry out complex requests, which means they spend more time (and compute resources) testing possible options and, often, generating more-detailed responses. This approach helps mitigate diminishing returns on scaling laws during pre-training, at the expense of consuming more compute and bandwidth at runtime.
Agentic workflows can look a lot like reasoning and include chain-of-thought processing, but likely involve more calls to external services such as APIs, MCP servers, and other agents designed to carry out specific tasks or subtasks. Going forward, we should expect agents to persist for longer periods of time, maintaining context across sessions and learning with each run. As with pure reasoning, though, every additional step adds compute resources on the backend and latency for the human (or agentic) end-user.
All of this is combined with longer native context windows for LLMs, which let them store ever-larger token volumes in memory so they can reason across longer collections of text. This naturally affects bandwidth and GPU performance, although approaches to minimize the impact via context engineering are becoming well understood and adopted.
Challenges for enterprise AI inference systems
With some exceptions, the majority of enterprise AI workloads will be inference tasks, from simple prompts to complex multi-agent workflows. And many of the pieces are already in place for ambitious organizations to start implementing their AI strategies, there are high-quality open-source models, fast-maturing ecosystems for agentic tooling, and a better understanding of what to do with these new resources. What most organizations lack, however, is an infrastructure foundation optimized for running AI workloads, and certainly for running AI agents at operational scale.
The problem is that AI agents were born into a world where much of the infrastructure required to run them was designed for previous eras of computing. Even the world’s best reasoning models, despite sometimes exhibiting agentic behavior themselves, are currently limited by infrastructural constraints. They’re amazingly capable models that, for example, can’t easily maintain state across instances or utilize data outside the timestamp of their last training run.
Even as cloud-hosted models and services continue to improve, though, many large enterprises and government agencies will also want to manage their own AI infrastructure. This is especially true for mission-critical applications and applications where access to fresh data and low latency is paramount. Think about any application involving highly sensitive data; regulated industries; real-time data streaming from edge devices; or multiple agents working in tandem to execute multimodal workflows (touching, say, language, video, and image models) or multi-part tasks (like ingesting, processing, and analyzing loan documents).
Beyond security and performance concerns, there’s also a cost element when it comes to choosing which models to call, and when. Calling the major models from large AI labs can be costly, and not every task requires that level of LLM performance. A true agentic system should be able to intelligently route jobs across a range of models of different sizes, modalities, and capabilities (often referred to as a mixture of experts, or MoE, approach) wherever they’re hosted.
The advent of capable and readily available video models - especially video-language models (VLMs) - adds another wrinkle for high-stakes applications. With classic LLMs, for example, the retrieval pipeline focuses on chunking documents, embedding text, and storing these embeddings in a vector database. The LLM queries these chunks as needed with relatively lightweight infrastructure.
With video, however, each chunk may require multimodal embedding that extracts textual cues, image features, and sometimes direct video descriptors. Different sampling policies might also be necessary to avoid, say, processing static surveillance and only saving chunks when something changes. In addition to requiring more storage capacity due to the size of video files, the greater dimensionality of video data means VLMs also require more GPU resources.
Legacy data infrastructure can’t keep up
If expert predictions turn out to be correct, large organizations might be running millions of agents to augment their human workforces and automate as many processes as possible. That’s a whole lot of inference activity and, frankly, might be more traffic than most enterprise systems have ever had to handle. Unfortunately, the capabilities of AI models have outrun the applications for which legacy data and application infrastructure was built.
Using these components to power meaningful inference workloads, much less agentic ones, can result in complex pipelines like we’re used to seeing inside large web companies: disparate tools and systems to handle event-processing, stream-processing, function calls, storage, database calls, and more. A DIY system comprised of legacy components might end up looking something like this:

That’s a lot to manage. The more complicated these systems become, composed of various components with different limitations and requirements, the more they risk dying by a thousand cuts. At a broad level, all that complexity leads to challenges with cost structure, governance, security/permissions, reproducibility (of errors), and traceability (of interactions among subsystems). On a per-task level, it slows down reasoning and agentic workflows by forcing what should be fluid access to multiple data sources through a collection of siloed systems.
The agentic operating system
So what does the ideal enterprise infrastructure for running AI agents look like? It’s a hair too early to say with certainty, but one absolute requirement will be the ability to maintain high performance as the number of users, agents, and inference tasks skyrocket. Additionally, any sizable team of agents will increase necessary storage capacity due to KVcache offloading (also a side effect of larger context windows), shared memory between agents, tool logging, and the introduction of complex RAG systems, among other things.
Other requirements for production-grade AI agent infrastructure will include:
Native support for multi-cloud architectures (so compute can access data wherever it’s housed)
Statefulness (across jobs, instances, and agents)
Multi-agent coordination (so multiple agents can collaborate, delegate, and communicate)
Adequate data security (so agents, models, and users can only access what they have permission to see)
Observability (to see who accessed what, who called who, and what caused a particular behavior or result)
And the simpler agentic infrastructure is to manage and maintain, the better. Especially at scale, a complex system of siloed systems adds operational overhead and failure points, taking focus away from the primary goal: developing agentic pipelines that can execute reliably across a wide range of use cases.

At VAST, we’re trying to solve and even future-proof the infrastructure stack for AI inference with a single platform that we call the AI OS. Today, it spans every step of the data pipeline from event-processing to storage.
And it works. For example, a large international city is using VAST to improve public safety by monitoring cameras to identify dangerous activity as it unfolds. Its collection of agents extracts and summarizes frames, then acts accordingly based on pre-defined policies. Some behavior might warrant an immediate flag to the police, while others require a human overseer to make the final call. Everything is logged so the city can troubleshoot, audit, and refine as needed.
Going forward, we’re developing capabilities specifically for building, managing, and orchestrating AI agents. Providing these features natively on the AI OS will allow users to deploy agents at a massive scale, while ensuring they deliver on enterprise requirements around performance, governance, security. These agents won’t just be performing one-off coding jobs or making restaurant reservations. They’ll be integral parts of every enterprise workflow, always learning, remembering, and improving.
The time to start investing in the agentic future isn’t five years from now when it’s already too late, but today when AI is the mandate and you can evolve along with all the technologies at play. If you want to learn more about how the VAST AI OS can help, reach out.



