In Part 1 of this two-part deep dive, we highlighted the new demands that AI places on data infrastructure and explained why the current stack of “modern” distributed data systems can’t meet those demands.
Here’s the very short version: Many of today’s popular data systems were developed discretely, are difficult and costly to manage, and must be assembled into complex pipelines in order to achieve the goal of web-scale analytics. However, AI inference workloads pose a novel set of requirements and problems that legacy architectures and best-of-breed systems can’t meet.
In part, this is because most data processing and analysis still happens outside of the model layer. Models are, of course, very important components of data pipelines for AI workloads. Model behavior will influence how pipelines are built. Model-serving infrastructure will help dictate how models interact with the data layer. But the model layer itself can’t deliver production-caliber applications.
The VAST AI OS was built to deliver the data-layer performance that AI will require. Many of the benefits come from consolidating disparate data systems and processes into a single AI-native platform, all taking advantage of the underlying “disaggregated, shared-everything” storage architecture. However, the platform actually extends down into the more traditional AI infrastructure layer. For example, local deployments running VAST-certified hardware with embedded GPUs can host models and execute GPU-accelerated analytics without leaving the VAST cluster. The AI OS can also handle KVcache-offloading and long-term context memory without requiring yet another new storage layer for those purposes.

In this post, we dive a little deeper into five areas in which AI is reshaping data architectures, and that help explain some of the architectural decisions underpinning the VAST AI OS.
1. Collapsing the data architecture for AI workloads — and analytics
Reshaping architectures for AI inference often begins with collapsing the number systems required to store, process, and analyze data. This serves many purposes, beginning with a curtailment of operational complexity and latency, and extending to things like minimizing an organization's attack footprint.
Think about a use case like real-time video intelligence, where the workflow might look something like this:
Raw video hits object store, which triggers a data-processing pipeline
Raw video is chunked into short segments and sent to a video-language model
The VLM output is sent to an embedding model that generates vector embeddings
These embeddings are inserted into a vector database
An automated query determines whether newly the newly embedded video constitutes an emergency and, if so, triggers an alert
A lag at any step defeats the real-time nature of the application, so keeping the workflow largely contained to a single data platform helps mitigate against that result.

Although, this concern applies to more traditional data workloads, as well. Transforming and moving data from system to system is often a batch process, and at least adds time and failure points between new data being generated and that same data being queryable. It can also add operational overhead to users who need to understand the nuances of multiple systems and tools to perform seemingly simple tasks.
This is a big reason why large cloud providers have been working to add more capabilities to, and connections between, existing products (e.g., adding vectors and tables to object stores, or implementing RAG functionality that spans database services). These approaches do allow users (human or agent) to do more stuff, easier, within a single system, but they are necessarily half-measures given cloud-platform architectures and business models. Users are still running multiple managed services for both core data systems (e.g., databases and data stores) and pipeline plumbing (e.g., message/event brokering, ETL, and serverless functions).
2. GPU availability, multi-cloud, and edge environments
Doing anything within the confines of a cloud platform typically means your data stays there. It doesn’t have to, of course, but egress fees make regularly moving large volumes (or ever moving huge volumes) unpalatable. While this might not be an issue if performance, cost, and reliability are best-in-class and GPU instances are readily available, it becomes a big issue when production workloads can’t execute as planned or costs rise too high.
Thus, over the past few years, we’ve witnessed the emergence of AI-focused “neoclouds” (many of which are VAST customers) and the push for multi-cloud environments. Because demand for AI computing resources is so high that no single cloud provider can meet it all, the industry has accepted that savvy customers will likely seek out the best available source of GPUs, TPUs, or other compute hardware at any given time for their particular needs. In theory, a rising tide should float all cloud boats and ensure even smaller customers have access to compute resources within their budgets.
In another interesting development, neocloud providers are able to buck the trend of massive datacenter buildouts by focusing on smaller edge locations. Edge computing will be critical to deployments of physical AI — including robots — that require the lowest possible latency between on-board and cloud-based resources. Or, in some extreme cases, between on-board and locally hosted resources that can house additional computing power and context without requiring an internet connection.
However, data gravity is the unstated variable in realizing these types of extended cloud architectures. It’s unrealistically time-intensive and expensive to move or replicate many terabytes or petabytes everywhere an organization might spin up compute resources. Large cloud platforms also offer a slew of managed services that can’t be replicated elsewhere, meaning there’s no 1:1 replacement when rebuilding data pipelines inside another cloud.
Achieving true multi-cloud and edge architectures, therefore, requires at least a global namespace at the storage layer, so that data accessible to new compute resources without moving or copying it. Better yet would be to also have a collection of services and capabilities that move with the data to new sources of compute power. Stacks, pipelines, and architectures are all portable and all act in service of the thing that really matters — the data.
In the video below, Microsoft's Kanchan Mehrotra explains how AI inference workloads are driving a move toward multi-cloud and edge architectures, and making the data layer a strategic advantage:
3. Distributed and consistent context memory and KVcache
For transformer-based models like most LLMs, it’s becoming increasingly difficult to plan for AI-inference data volumes and architectures without accounting for a model’s context memory, which is stored in the KVcache to avoid recalculating the first token of each response. And although the KVcache has traditionally resided — ephemerally — within the high-bandwidth memory (HBM) of GPUs, that default approach can cause performance issues as context length grows (including from system prompts) and concurrent users scale.
Already today, for example, serving a relatively large number of users across multi-turn sessions can produce petabytes of context data that must be cached. This can overwhelm available memory pretty quickly, depending on the size of the system. A potentially less obvious concern is that as human users take time to read and assess LLM outputs, their HBM-stored KVcache reduces overall system efficiency by reducing available capacity for other users, leaving GPUs idle.
Looking forward, there are multiple other trends that could push more KVcache, and context in general, onto tiered storage, including:
Multi-agent systems that require shared context and will consume a lot of compute resources
Longer-term context retention to serve agents, ultra-personalized experiences, regulatory requirements, and any other reasons to recall more than just prompt-response history
An uptick in usage of video models, which inherently generate more context due to the density of video inputs and outputs
There is a steady stream of R&D around condensing KVcache and otherwise making it more efficient, and advances in model architecture, or increased uptake of alternative architectures such as state-space models, could also dull the need for KVcache specifically. However, Jevon’s paradox has shown time and time again that efficiency often begets more consumption. If the current pace of datacenter, bandwidth, and power buildouts are any indication, we’ll need all the context capacity we can produce.
Like all potentially sensitive data, KVcaches and any type of context memory also need to be secured. It’s still too early to declare any bulletproof solutions for agentic RBAC or any sort of access management for agents, although current approaches to encryption and granular data permissions are a solid start toward keeping it context data safe.
In the video below, NVIDIA's Vikram Sharma Mailthody and VAST Data's Anat Heilper discuss the benefits of offloading KVcache to external storage in order to overcome limited memory capacity at the GPU layer:
4. Adding a vector database to the mix
Although vector databases and vector embeddings are not entirely new concepts, they’ve experienced a revival thanks to generative AI. They’re critical for the RAG process by which LLMs can access out-of-training documents for additional context, as well as for semantic queries across a wide variety of data types. The semantic angle shouldn’t be overlooked: Agents able to query and search across various unstructured data formats in real time should be able to solve problems and carry out analyses that humans cannot.
But running a vector database for at-scale AI workloads introduces a set of issues that traditional approaches weren’t designed to meet. As with other types of database, these typically boil down to performance, scalability, and — importantly — performance at scale. Like most distributed databases, vector databases tend to handle scale via some type of sharding, and then mitigate any performance issues by storing the index in memory. This can work well up to a point, but it often breaks at extreme scale because the index can’t fit in memory and/or coordinating queries across too many shards introduces both operational and execution overhead.
While standard text-based embeddings might not push too many vector databases to their limits, video, robotics, and agentic workloads will. Video, for example, can produce a continuous stream of dense, high-dimension inputs that must be processed, inserted, and indexed. Agentic workloads, as mentioned multiple times already, can exert continuous query pressure on a vector database while also producing streams of artifacts to be ingested. A few years ago, a few hundred million vectors might have been a lot, but today it’s not uncommon today to see organizations wanting to store and query tens of billions of vectors.
And, of course, adding yet another distributed data system into the application stack introduces yet another failure point, more complexity, and another cost center.
The below image, based on recent benchmark testing, illustrates how the VAST AI OS's native vector store handled more than 1,000 queries per second at 50 billion vectors, compared with less than 100 queries per second for a popular standalone offering.

5. Kicking network traffic — and data — into overdrive
An ancillary effect of everything described in the first four sections is that network traffic volumes are going to skyrocket. Across the internet, yes, but also — and likely more dramatically — within corporate data centers and networks. The extreme requirements of training large models is one part of the story, evidenced by OpenAI’s work on the MRC networking protocol, although inference workloads (such as agents) will also push data center networks to their limits.
From a data perspective, this will create at least two major related issues that anyone serving AI inference will need to address:
Maintaining acceptable latency as data volumes (including context memory, vectorized data, and tabular data) grow and are distributed across the network.
Ensuring logging/observability infrastructure can keep up with the numbers of events now being executed by agents, GPU clusters, and other new components of production AI systems.
Here, too, data architecture matters a lot. Reducing the number of data stores and data-processing systems that need to touch data will go a long way toward solving the latency problem. Optimizing for throughput performance and scalability will help ensure that all events are accounted for in instances where real-time troubleshooting or agentic access are necessary. More fundamentally, eliminating potential failure points and bottlenecks helps ensure applications remain online and crash-free. Assuming AI is a mission-critical undertaking, keeping those systems operational will have a direct impact on business outcomes.
The image below, based on recent benchmark testing, illustrates how the VAST AI OS's native event broker significantly outperforms both Apache Kafka and a popular third-party distribution in terms of read-write throughput.

Running AI and data-intensive workloads on VAST
In short: The status quo would be to build AI pipelines using existing data systems as the foundation, inserting models, vector databases, and other related AI-specific infrastructure where necessary. While there are some pros to this approach, they might be largely psychological. The cons, however, could become inhibitors to maximizing production AI systems:
Many legacy data systems are designed for much different workloads.
Operational cost and complexity could balloon along with AI traffic and data volumes.
Scaling can also become a capital issue when supply-chain constraints arise.
The VAST AI OS is designed to handle all of these concerns, and more, in a single AI-native data platform that includes:
Object, file, and block storage
Event broker
Data lake
Tabular data
Vector data
Event-driven functions and pipelines
Automated syncing across data sources
GPU acceleration
KVcache offloading
Global namespace
Built-in data reduction, management, and security protocols
Tight partnerships with leading hyperscalers, neoclouds, and AI hardware providers
Customers and end-users that include CoreWeave, Cursor, Pixar, and more.
If you want to learn more about the VAST architecture and why it can power AI, analytics, HPC, and whatever data-intensive workloads you can throw at it, check out the following list of resources (roughly grouped by topic) or reach out:
Data storage VAST DataStore and the Case for True Shared-Everything Architecture | Structured data / SQL Powering Enterprise AI with High-Velocity Vector Search and SQL |
|---|---|
Event broker | Vector database |
AI data pipelines Running Lightning-Fast Functions and AI Workloads on Kubernetes When the City Thinks: Real-Time Video Agents with VAST and NVIDIA Cosmos Reason 2 Solving the Last Mile Problem: Connecting Your Enterprise Data to Your AI Future | AI agents |
Context memory / KVcache Really Big Data: How Context Memory Is Reshaping AI Infrastructure | Multi-cloud |



