Rebuilding the Data Stack for AI: Web-Era Systems Can’t Keep Up

Rebuilding the Data Stack for AI: Collapsing Architectures and Going Multi-Cloud

Authored by

Derrick Harris, Technology Storyteller

The data infrastructure that keeps many modern organizations running is built from bubble gum, duct tape, and a whole lot of reinvented wheels. This probably is not the ideal way to run large-scale web services or analytics jobs. It certainly is not the right way to run large-scale AI workloads, which have performance requirements more akin to HPC simulations than they do with BI reports for SaaS companies.

What’s more, running complex architectures like this is expensive in terms of operations and opportunity. The pace of innovation in AI is still so fast that most organizations will benefit from focusing their attention and resources on things like data hygiene and context engineering, not on maintaining brittle data systems and pipelines.

This post explores how we ended up with today’s data-architecture status quo, what AI workloads require from their data layer, and why today’s systems aren’t the answer. In Part 2, we’ll explore how the industry is already reshaping itself around these requirements by collapsing data architectures and shrinking data pipelines in order to reduce complexity and latency while improving performance and scale.

AI and agents will crush brittle data pipelines

Examine the data stack of any reasonably large organization, and you’re likely to see some combination of event brokers, data lakes, databases, data-processing engines, data stores, and more. A simplified view of the core components might look like this:

The evolution of current data systems and architectures is highly pertinent to the world of AI applications. Although GPUs and models get much of the attention, the infrastructure stack for production AI inference workloads is multi-faceted. They still require storage, networking, databases, observability, security, and everything else that all applications need. AI models can reason over data, but many of the operations they’ll run are similar to what modern data workloads do.

Still, there are many new considerations when it comes to building AI factories and deploying AI workloads at scale, including:

GPUs must be kept productive to avoid wasting money. The more time spent processing and moving data between systems, the less time spent generating tokens on expensive and power-hungry GPUs. In this context, all latency represents waste, while extended outages represent a triple-whammy of lost revenue, lower ROI, and displeased users.
Context memory is a net new data source that could reach a massive scale as organizations seek to share context across agents and/or geographies, or retain historical context rather than flushing it when a session ends. While this presently happens in GPU memory (in the form of KVcache), the data layer will soon need to shoulder much of the burden to help account for data scale and the cost of GPU operations.
AI security and governance are moving targets, with new innovations often introducing novel concerns around data management and privacy. For example, agentic workflows raise issues regarding data access both internally and across systems, and long-term storage of user context could provide a rich new target for cybercriminals.
At-scale agentic workflows could result in data systems experiencing far greater traffic and query activity, while simultaneously demanding faster performance to account for a shift to machine consumers. Any weak link in the pipeline will cause an outsized failure. Additionally, GPU-accelerated analytics are a promising approach to keeping up with agentic demands.
Certain AI use cases rely on minimal latency in order to be maximally useful. Real-time video intelligence pipelines, for example, need footage to be captured, chunked, embedded, and reasoned over almost immediately if the goal is to respond as fast as possible to emergencies.
GPU acceleration is coming for more data-processing and analytics workloads, in part to help execute agentic tasks faster. Startups building AI models that specialize in structured, tabular data are also raising a lot of money and being acquired by larger data companies. Systems that don’t support GPU acceleration risk becoming bottlenecks.

Beyond performance and maintenance concerns at the infrastructure level, organizational data also tends to live in silos, often varying from team to team. So, while giving agents or models access to data from systems like Salesforce might be simple enough via existing ETL tools, mapping and automating entire operational workflows — ranging from generating a sales quote to executing an online order — can be a hairy task. Agents need to know what data lives where, have proper read/write access to each system, and the API layer must be able to withstand all those requests.

The writing is on the wall: Agentic traffic, reasoning models, and spiking user growth already cause regular resource throttling for popular AI products, if not full-scale outages. And we haven’t yet begun deploying agents at the scale many people expect is coming.

The web data stack is slow and complex

Among the major challenges with traditional data architectures is that they frequently involve dozens of components sold, built, and managed by different companies and/or open source projects.

The complexity and dependency hell isn’t surprising given how all the various systems came about over the course of about two decades. The digitization/web wave broke traditional analytics workflows and systems by producing myriad new data sources and types (including from user behavior, SaaS applications, server farms, geodata, microservices, and more) and ramping up the pace at which they’re created. We needed inexpensive, scale-out infrastructure to deal with the capacity and new types of systems to store and analyze unstructured data.

Hadoop was an early poster child for this era of big data, and also serves as a microcosm illustrating how the space expanded. Initially developed at Yahoo, Hadoop originally consisted of a distributed file system (HDFS) and an implementation of the MapReduce algorithm for data processing. However, MapReduce often proved too slow and difficult for many users and use cases, and people still really liked SQL. So the Hadoop community produced Hive, HBase, Avro, Flume, Drill, Sqoop, and more to account for how people actually wanted to interact with their data. Zookeeper helped manage state and operations across Hadoop clusters, while YARN tackled resource management.

Anybody reading this probably knows what happened next: Spark; Kafka; Flink; Cassandra; Elasticsearch; Pulsar; Parquet; Iceberg; Ceph; and dozens of other popular projects, software products, cloud-provider services, and SaaS applications for moving, transforming, processing, and analyzing data. Although all of this activity served a real purpose, the disparate nature of its development resulted in a rat’s nest of interconnected and interdependent systems. And with each integration, one system or tool took on — or had to work around — the architectural compromises of the others.

Here’s how Netflix engineers, in announcing a custom-built solution for data movement, describe the legacy experience of moving data across more than a dozen distinct systems:

The key problems caused by this fragmentation included:

Cognitive Overload: Users needed to learn numerous different systems and interfaces to accomplish their data movement tasks.
Operational Overload: Multiple platform teams were required to support many disparate, often overlapping, data movement solutions.
Unreliable Governance: Security checks, lineage, and metadata gathering were implemented inconsistently across various tools, resulting in gaps in meeting our data governance standards.
Poor Discoverability: It was difficult for users to identify the right tool for their specific data movement among the numerous available options.
Poor Separation of Concerns: Often, users’ intent was mixed with the implementation details of the underlying tool. This meant a user often wasn’t just moving data from ‘A’ to B ‘, they were running complex commands to ‘run a data movement job with these Spark parameters’, making it extremely challenging for the data platform team to upgrade the underlying engine without breaking user workflows.

Ultimately, answering the simple question, “How do I move data between X and Y?” was far from simple and often depended on the specific systems involved.

After all of this, ironically, a large portion of data still ends up in tables and SQL is still the gold standard for analytics.

The web data stack is expensive to manage

In addition to being licensed software products, or open source systems that must be maintained, many data systems and products also require significant capital expenditures in the form of compute and storage capacity. Today, even a traditional large enterprise running a “modern” data architecture could easily spend several million per year in operating expenses and usage costs:

Open-source distributed systems like Ceph, Clickhouse, Elastic, and Kafka typically require at least a couple of dedicated engineers apiece (with more necessary as deployments scale), each likely costing at least $200,000 annually.
The cost of infrastructure can grow non-linearly in relation to data growth. This is often the result of systems running on shared-nothing architectures where compute and storage cannot scale independent of each other; high availability requires storing three copies of data; and performance degrades as partitioning schemes become more complex.
Enterprise licenses and cloud usage fees only compound with larger data volumes, increased data movement, and more sophisticated and frequent queries.

And whereas data storage has historically been relatively easy to scale, AI has upended that market. For starters, AI inference workloads incentivize hot data over cold data, meaning flash is king for anything that doesn’t fit into GPU memory. However, large-scale AI buyers have largely consumed available supplies of both flash and HDD storage, driving prices through the roof (currently ~$300/TB for flash and >$20/TB for HDD). Anyone deploying petabytes or exabytes of data today is doing so at an extreme mark-up, meaning every little operational inefficiency represents an outsized capital expenditure.

Again, Netflix offers some lessons worth heeding. As of 2024, it was spending more than $150 million per year on storage and compute resources for its data pipelines, not including personnel costs. Thankfully, the company is pretty open about reiterating that its variety of data architecture is not something to strive for, but actually is an artifact of many years of iteration where ripping and replacing existing systems wasn’t a viable option at its scale.

The path to AI-native data infrastructure

If we extrapolate a modern data architecture or data pipeline to account for AI workloads, it probably looks something like the diagram below. Beyond adding models into the mix, systems are now accessed by any number of new human or agentic users, with AI models acting as the intermediary. The results of those actions are fed back into source locations as context for future agentic behavior, reinforcement learning, and other approaches for achieving dynamic and self-learning AI environments.

AI inference also adds new infrastructure wrinkles in the form of GPUs and, potentially, a layer of flash storage targeting KVcache and overall context-memory data. KVcache sizes risk overwhelming GPU memory as they grow with longer context or concurrent user counts. Offloading KVcache to flash storage can help free up GPU resources and speed the time-to-first-token. More generally, maintaining a high-speed data store for long-term context memory can act as a central source of truth for agentic swarms, improve user personalization, and enable auditability of system interactions.

If there’s good news, it’s that most organizations neither operate at Netflix-scale when it comes to their data architectures, nor have they built out highly customized systems and pipelines over the course of nearly two decades. This makes it easier to invest in a future data stack that’s designed for the unique demands of AI workloads, and to reap the operational rewards that come with it.

In Part 2 of this deep dive, we’ll explain how AI is forcing data architectures to collapse and data pipelines to shrink, while simultaneously turning the desires for multi-cloud, edge, and chip-agnostic environments into reality. And we’ll dive into how VAST helps make this all possible.

Rebuilding the Data Stack for AI: Web-Era Systems Can’t Keep Up

AI and agents will crush brittle data pipelines

The web data stack is slow and complex

The web data stack is expensive to manage

The path to AI-native data infrastructure

More from this topic