Aug 18, 2025

Why Your Data Strategy Is Starving Your AI and Sabotaging Your GPUs

Authored by

Chris Snow, Senior Systems Engineer | Subramanian Kartik, PhD, Global Systems Engineering Lead

Your organization has invested millions in state-of-the-art GPUs. You have brilliant data scientists ready to build the next breakthrough model. But your training jobs are slow, expensive, and frustratingly inefficient. Sound familiar?

The uncomfortable truth is that your GPUs are likely spending most of their time idle, waiting for data. They are starving. This isn't a GPU problem; it's a data problem. The architectures that powered yesterday’s data analytics are fundamentally unsuited for the intense, random I/O demands of today’s AI workloads.

The Random-Read Gauntlet of AI Training

At the heart of training most deep neural networks (DNNs) is an algorithm called Stochastic Gradient Descent (SGD). To train a model, the algorithm feeds it a small, randomly selected batch of data, calculates the error, adjusts the model, and repeats.

In machine learning, training involves multiple passes over a dataset, known as “epochs.” During each epoch, the data is randomly shuffled and split into new batches. This randomization is a critical best practice that forces the model to learn a dataset's underlying features instead of just memorizing the data's order, thereby improving how it generalizes to new, unseen data. While modern LLMs might only train for an epoch or two, this technique remains essential for models like Convolutional Neural Networks (CNNs), which often require many epochs.

However, this constant shuffling creates a brutal I/O pattern for the underlying storage system. To circumvent this bottleneck, some benchmarks (often with encouragement from GPU manufacturers) blatantly cheat by serializing complex datasets into a single large file using formats like TFRecords or RecordIO. This workaround makes it easier for legacy parallel file systems, such as Lustre, to read the data sequentially. The approach has become common because these older systems struggle to handle the raw, randomly accessed small files that are native to the datasets.

The data loader isn't reading a large file from start to finish; it's frantically “plucking” thousands of random rows or files from all over your dataset. This is the polar opposite of the sequential, full-column scans that traditional data lake formats were designed for.

Where Cloud Data Lakes Fall Short

Modern data lake architectures, often using table formats like Apache Iceberg over Parquet files, are powerful for SQL-based analytics. They store data in large, columnar files, perfect for calculating the average of one column across billions of rows.

However, for AI training, this is a disastrously inefficient model, as we examined in a previous blog. To retrieve a single random row for an SGD batch, the system might have to download and read a massive multi-megabyte Parquet file chunk that contains thousands of other rows it doesn't need. This is known as read amplification, and it's a primary cause of I/O bottlenecks. You end up reading terabytes from storage just to use gigabytes, all while your expensive GPUs sit idle.

The VAST Advantage: Built for AI, Not Just Analytics

This is precisely the problem the VAST AI Operating System was engineered to solve. Our revolutionary DASE architecture eliminates the trade-offs of legacy systems.

It all comes down to our fine-grained, symmetrical architecture. Instead of large, monolithic files, VAST breaks data down into much smaller, granular chunks - known as Elements - and understands the relationships between them. When your data loader asks for a random set of rows or images for a batch, VAST can retrieve just the specific, tiny pieces of data required, no matter where they live on our all-flash media.

For Unstructured Images/Video: The challenge is similar: reading millions of small files. On VAST, accessing data via our high-performance NFS or S3 protocols results in the same blistering performance. They are simply different “front doors” to the same unified, all-flash data store.
For Tabular/Time-Series DNNs: VAST dramatically reduces read amplification for random reads, meaning your data loader gets the exact data it needs with minimal overhead, allowing it to keep up with the GPU's insatiable appetite.

Supercharging Tabular Workflows with VAST DataBase 📊

For organizations working with massive tabular datasets, we've taken this a step further with VAST DataBase, a database built for the AI era and designed to fuse your data lake and data warehouse into a single, high-performance platform.

At petabyte scale,moving data from a data lake to a specialized database for analysis and then to a file system for AI training is impossibly slow and expensive.

VAST DataBase eliminates this entire process.

No More ETL: You can run fast SQL queries and complex AI/ML training on the same data, in the same place. VAST DataBase brings database functionality like ACID transactions, time travel, and schema enforcement directly to your data lake. This means you can run interactive SQL queries with Trino or large-scale data processing with Spark on the exact same datasets, at the same time that your data scientists are doing random-read model training. This concurrent, no-compromise access eliminates data silos and the need to move or reformat a single byte.
Performance for Both Worlds: Because VAST DataBase is built on the DASE architecture, it's uniquely capable of excelling at both workloads simultaneously. It can service the wide, columnar scans of a complex SQL query and, in the next moment, deliver the narrow, random row selections needed for SGD training with unparalleled efficiency. This dual-workload optimization makes VAST DataBase the ideal foundation for any large-scale tabular data strategy.

With VAST, Do You Still Need Tools Like MosaicML?

This leads to a critical question. If tools like MosaicML's StreamingDataset were designed to work around the limitations of slow cloud storage, do you still need them with a high-performance platform like VAST?

For the most part, no. A primary motivation for these tools was creating a software workaround for high latency and poor random-read performance. Think of StreamingDataset as a sophisticated off-road suspension system designed to make a bumpy, slow road feel smooth.

The VAST AI OS, however, is like paving that bumpy road into a smooth, multi-lane highway.

Because VAST delivers data with extremely high speed and low latency, your data loader doesn't need a complex suspension system. It can simply ask for the data it needs and get it immediately, removing the core performance problem that these software workarounds were built to solve.

That said, these libraries can still offer value at the software level:

Deterministic Resumption: They are excellent at tracking training progress, allowing a failed job to restart from the exact sample it left off on.
Workflow Portability: They provide a consistent abstraction layer for teams working in hybrid environments across on-prem VAST systems and public cloud buckets.

The crucial difference is that with VAST, using these tools shifts from being a performance necessity to being a workflow choice. You are free to use them for their software features, not because you're forced to by slow infrastructure.

From Workarounds to True AI Infrastructure

Stop starving your AI. The era of repurposing analytics infrastructure for deep learning is over. To win, you need a foundational data platform that delivers performance without compromise. Key strengths such as 100% online operations, an extraordinarily sophisticated security architecture, and full multi-protocol access for file, object, and structured tabular data give you the freedom to build the best, most resilient AI pipeline for your team.

Learn what VAST can do for you

* Required field.