Perspectives

Jun 30, 2026

The End of Data Duplication as a Design Principle

Authored by

Jim Crook - Director, Corporate Communications

Organizations can provision thousands of GPUs on demand, fine-tune frontier models, and deploy applications across multiple cloud providers in a matter of hours.

Yet many of those same organizations still manage data by making copy after copy after copy.

During a recent talk on AI infrastructure, one statistic captured the contradiction: the average enterprise reportedly maintains 13 copies of every dataset.

Whether the number is exactly right is almost beside the point. The industry has become extraordinarily good at making compute mobile while continuing to treat data duplication as a fact of life.

That assumption is now being challenged.

The Cost of Data Movement

Most enterprise AI pipelines still rely heavily on replication. Data is copied into data lakes, analytics environments, training clusters, inference platforms, vector databases, and backup systems. Each copy introduces additional storage consumption, governance overhead, synchronization challenges, and operational complexity.

The common response has been to make those pipelines faster. The more provocative argument presented by VAST Cloud GM Jonsi Stefansson was that organizations may be optimizing the wrong thing. Said Stefansson:

"The answer is not building faster pipelines to move these copies. The answer is to stop the damn copies altogether."

That idea reflects a broader shift underway across AI infrastructure.

Most enterprise AI pipelines still rely heavily on replication, introducing a host of challenges including additional storage consumption, governance overhead and operational complexity.

Moving data to compute has been, to date, the only practical option. Today, the economics are changing. GPU resources can be provisioned dynamically across multiple providers. Network connectivity continues to improve. Global-scale distributed systems are becoming more sophisticated.

As a result, the question is increasingly becoming: should organizations move data to compute, move compute to data, or avoid moving either whenever possible?

Stefansson’s view is that rather than treating every environment as a separate island requiring synchronization, the goal should be to create a shared view of data that spans clouds, datacenters, and edge locations.

A slide from Stefansson’s session that illustrates VAST’s control plane Polaris, offering centralized intelligence with distributed execution.

If successful, VAST’s cloud approach could fundamentally change how AI infrastructure is planned.

A Different Architectural Assumption

The discussion wasn't ultimately about storage architecture. It was about collapsing the growing number of systems that have accumulated around AI pipelines.

In many environments, organizations still move data between object stores, processing engines, SQL databases, vector databases, and inference platforms. Every transition introduces additional orchestration, synchronization, and governance requirements.

The vision presented during the session was notably different: a unified data platform capable of serving as the system of record, processing engine, and retrieval layer simultaneously.

That includes the ability to contextualize and vectorize unstructured data in place rather than exporting data into separate vector stores and retrieval systems.

The underlying idea is simple: if AI pipelines increasingly depend on the same data, why maintain separate systems for storing, processing, indexing, and serving it?

Infrastructure Becomes an Optimization Layer

A second theme emerged as Stefansson shifted from data architecture to orchestration.

Many organizations still manage infrastructure through a series of manual decisions. Operators determine where workloads should run, where data should be replicated, when clusters should expand, and how resources should be allocated.

The roadmap Stefansson described during the session points toward a different model.

Rather than administrators making those decisions directly, infrastructure increasingly becomes a system that continuously optimizes itself against business objectives.

One example involved a scheduler that evaluates GPU availability, pricing, capacity requirements, data locality, and transfer costs across multiple cloud providers.

Until recently the interesting part might have been the comparison of cloud pricing. Today it’s the implication that workload placement could become a policy decision rather than an operational task.

A team might instruct infrastructure to optimize for performance this week and cost next week. The system would determine where workloads should run and how data should be made available.

That concept appeared several times throughout the discussion.

Stefansson described a future state in which orchestration systems automatically decide whether to route data to compute or compute to data based on workload requirements and economics. Later, when discussing future capabilities, he outlined a vision of "an autonomous AI agent that manages a global cache by predicting needs and optimizing data placement policy algorithms based on physics, economics, and supply chain constraints."

The specifics may still be evolving, but the direction is notable.

The Rise of the Infrastructure Control Plane

Perhaps the clearest sign of this shift was the emphasis on abstraction.

One of the recurring messages throughout Stefansson’s talk was that infrastructure should become less visible to application teams.

"It's about abstracting the control away from the infrastructure," he explained.

That philosophy is becoming common across modern platforms. Kubernetes abstracted servers. Public cloud abstracted datacenter hardware. AI infrastructure platforms are now attempting to abstract data placement, lifecycle management, governance, and orchestration.

The end goal is not simply operational efficiency. It is organizational scalability.

As AI initiatives expand across multiple clouds, regions, and teams, few organizations can afford to manage every deployment, replication policy, and optimization decision manually.

The control plane increasingly becomes the product, and that ultimately may be the most significant takeaway from this session.

The AI infrastructure conversation is often framed around faster GPUs, larger clusters, or bigger models. Those advances matter. But beneath them, a quieter shift is underway.

The next generation of infrastructure platforms is being designed around two assumptions: data should not need to move nearly as much as it does today, and humans should not need to manage every infrastructure decision directly. Whether the industry fully realizes that vision remains to be seen.

But if these assumptions prove correct, the future of AI infrastructure may depend less on where data lives and more on how intelligently systems make that question irrelevant.

The End of Data Duplication as a Design Principle

The Cost of Data Movement

A Different Architectural Assumption

Infrastructure Becomes an Optimization Layer

The Rise of the Infrastructure Control Plane

More from this topic