May 14, 2025

HPC’s Straight Lines Can't Cross AI’s Data Chaos

Nicole Hemsoth Prickett

When HPC’s Straight Lines Intersect with AI’s Data Storm

For decades, HPC sites ran on a single, unshakeable assumption: data moves in straight lines. 

Massive files flowed through massive pipes into massive compute nodes, and parallel file systems were the arteries keeping the whole system pumping. 

Sequential IO was the gold standard, and everything was engineered around it. Big blocks, big transfers, big throughput.

But AI data doesn’t move in straight lines. It’s a relentless churn of randomness, a constant loop of ingestion, training, inference, retraining—each step generating more data, more chaos, more noise. 

What VAST Data’s Jan Heichler described at a recent HPC Advisory Council meeting is less a pipeline than it is a feedback loop. 

And HPC’s parallel file systems, built to move structured blocks of data from disk to node and back again? Heichler says they’re drowning in the flood.

As he explained to the crowd gathered at European supercomputing hub, CSCS,  the old world of HPC, data was static, predictable, and finite. You knew what was coming in, how long it would take to process, and when it would be done. Climate simulations, genome sequencing, crash testing–all of these followed the same pattern: ingest data, run the model, spit out the result.

But AI factories don’t follow that pattern. They are relentless. Data doesn’t just flow in and out. It ricochets around the system, spawning new datasets, new training sets, new inferences, new transactions–each iteration generating more data, which in turn feeds back into the pipeline. 

The data lifecycle in AI is a hurricane. And parallel file systems? They were built for a gentle breeze.

As Heichler explained, HPC’s reliance on large-block, sequential IO made sense when data was neatly staged and stored in predictable chunks. But AI is random by design. Datasets are fragmented, data flows are unpredictable, and every data point is a potential input for the next training run. 

The very concept of “sequential IO” becomes meaningless when every data access is a scattershot read across hundreds of tiny files, each demanding immediate access.

And the problem isn’t just that data is random. It’s that AI is greedy. 

In HPC, data sat in storage until it was needed. In AI, data is always in motion, always being processed, always being retrained. The same dataset that just finished training one model is immediately fed back into another. Every result generates more data, which in turn creates more training, which in turn creates more results. 

The pipeline isn’t just a pipeline—it’s a cycle. A factory. A machine that never shuts down.

HPC systems were never built to handle that kind of load. Parallel file systems were engineered to deliver massive, sequential data transfers as fast as possible. 

But AI data doesn’t arrive in massive blocks. It arrives in fragments, in wee-bitty, random IO bursts that hit the storage layer in unpredictable waves, slamming the system with a barrage of async reads and writes. And if the data isn’t accessible instantly, the entire pipeline stalls.

In the AI world, latency isn’t a nuisance, it’s a killer. If data can’t be accessed immediately, training grinds to a halt, inference stalls, and the entire AI factory seizes up. 

Parallel file systems weren’t built for that kind of demand. They were built to handle big, predictable transfers, not the relentless, fragmented IOs that define AI pipelines.

So the AI builders moved on. They ditched tiering, shoved everything onto flash, and started treating every dataset—files, objects, vectors—as a single, unified pool of data that had to be accessible all the time. No staging. No waiting. Just immediate access across every layer of the pipeline. And every layer is generating more data, more IOs, more pressure on the system.

Meanwhile, as Heichler told the crowd of supercomputing practitioners, HPC keeps talking about optimizing sequential IO, as if AI workloads will someday magically conform to a single, predictable data flow. They won’t. AI pipelines are inherently chaotic. 

As Heichler said, data doesn’t just move in one direction—it’s constantly folding back on itself, reprocessing its own outputs, generating new datasets on the fly.

The AI Pipeline Isn’t a Flow—It’s a Factory That Never Shuts Down

AI pipeline Heichler

Here is the pipeline for AI as Heichler described it for the audience. 

It all starts with petabytes data—raw, unfiltered, unrefined. Unstructured chaos streaming in from every conceivable source: transaction logs, sensor feeds, clickstreams, event data. 

It’s a tidal wave of information, all of it waiting to be extracted, filtered, categorized, and compressed. 

This is the data capture phase, the bottomless pit where the data pipeline begins. And in the AI world, it’s not just a one-time process. Every transaction, every query, every inference generates more data, which gets fed right back into the pipeline. 

But raw data is useless without refinement, so enter data prep–a phase where CPUs grind through all that unstructured noise, transforming it into something coherent, something that can actually be fed into a model. 

Extract, transform, deduplicate. Strip out the junk, compress the essentials. This is where AI starts to diverge sharply from HPC.

In HPC, data arrives in neat, massive blocks, ready to be processed. In AI, data doesn’t arrive. It accumulates, constantly, in shards and fragments, each a potential training set or inference point. 

Data prep is the refinery where that noise gets distilled down to something GPUs can handle. And it doesn’t happen once. It’s perpetual.

Now, the refined data moves to model training, the place HPC people still think of as the heart of AI. And sure, it’s critical. This is where GPUs devour terabytes of structured datasets, smashing through training cycles, running model after model to extract patterns, insights, predictions.

But here’s the thing: in AI, the model is never finished. There’s no end point. Every model training run generates new data, new results, new feedback, which get fed right back into the pipeline. It’s a self-sustaining loop where data isn’t just consumed—it’s produced. And that production doesn’t stop once the model is trained.

Because now we’re moving to quantization. The bloated, overfed and power-hungry trained model needs to be slimmed down for inference. 

The data pipeline doesn’t just compress raw data; it compresses models themselves, turning multi-terabyte monstrosities into quantized, inference-ready versions. These are still powerful, still packed with insight, but now they’re portable, scalable, ready to be deployed across CPUs, GPUs, whatever hardware is on hand.

But the pipeline isn’t done yet. The quantized model enters model serving, where the actual AI work happens. Every query, every inference request, every transaction spins off more data. A single request might generate a few kbs of inference output, but multiply that by millions of queries per second and suddenly you’re staring at another petabyte-scale data mountain.

And that data? It doesn’t just vanish. It gets logged, audited, traced, and fed right back into the pipeline. 

Every response is a potential training set, every result a new dataset to be refined, quantized, and deployed again. The pipeline doesn’t flow—it loops and it’s here again where the division between HPC and AI sharpens further. 

And now to the transaction log store. Every data point, every response, every inference gets logged here, a relentless stream, all of it potential training data.

And through it all, a Zero Trust security layer weaves itself around every step. Because in AI, the data isn’t just flowing through a single pipeline. It’s looping through multiple systems, crossing hardware boundaries, hitting object stores, databases, training clusters, and inference engines. 

As one might imagine, much of this does not look familiar to HPC shops. 

In the kind of world shown above, the old HPC storage model isn’t just outdated or “AI ready”, it’s actually a bottleneck. 

A system built to handle big blocks of data moving predictably through a pipe can’t handle hundreds of thousands of tiny IOs hitting the storage layer all at once, demanding immediate access. It’s like trying to funnel a waterfall through a garden hose.

So what happens to HPC’s parallel file systems? They’re still there and will be for a long time to come, pushing massive, sequential blocks of data through pipes that AI builders aren’t using anymore.

But in the AI world, parallel file systems have become a corner, an outdated, static backwater where big files go to sit and wait for compute nodes that no one is using.

And the AI builders? They’re building factories. 

They’re treating storage as the center of the architecture, not the afterthought. 

Subscribe and learn everything.
Newsletter
Podcast
Spotify Logo
Subscribe
Community
Spotify Logo
Join The Cosmos Community

© VAST 2025.All rights reserved

  • social_icon
  • social_icon
  • social_icon
  • social_icon