The End of “Bad I/O”

Why AI workloads are forcing a fundamental rethink of storage architecture, GPU utilization, and data movement.

Authored by

Jim Crook - Director, Corporate Communications

For decades the HPC community has operated under a stern, almost puritanical doctrine: if your application exhibits "bad I/O," it is a failure of the researcher. The data infrastructure was a rigid temple. Users were expected to optimize their code - aligning blocks, smoothing out randomness, and buffering small writes - before they were allowed to touch the storage fabric. But as we move into the era of massive-scale AI, this paradigm is collapsing.

Users loading millions of small images and handling unpredictable metadata requests is now the standard. As VAST CTO Alon Horev and Field CTO Sven Breuner discussed during a recent conference session, the industry is shifting from a world where we optimize the application for the storage to a world where we must architect the storage to gracefully handle the chaos of the application.

According to Breuner (who also is the creator of BeeGFS and knows a thing or two about architecting high-performance file systems), it’s about accommodating the reality of modern research. "We cannot assume that the users are super eager to spend a lot of time on optimizing their I/Os. It’s unavoidable that there are bad I/O patterns,” he said.

This shift has profound architectural implications. When "bad" patterns are the norm, the infrastructure must stop relying on the user’s ability to create sequential, large-block transfers. It requires a system that treats a four-kilobyte metadata request with the same urgency and efficiency as a multi-gigabyte stream.

VAST’s Disaggregated and Shared Everything (DASE) architecture, as Horev and Breuner explained, enables such systems by allowing any protocol server to handle any request without the cross-talk or metadata locking that traditionally turned small I/O into a system-wide crawl.

The Linux Page Cache: The Invisible Performance Engine

The modern data platform engineer might not be familiar with the Linux Page Cache, but as Horev explained, it functions as a high-speed application buffer that abstracts away the physics of the network. Every time an application opens a file for a standard read or write, it is performing buffered I/O, which means it is interacting directly with local RAM, not the storage media.

Understanding these Page Cache mechanics is critical because it creates a massive performance gap between "synthetic benchmarks" and "application reality” that transforms the user experience. While a benchmark using direct I/O might be capped at 12,000 IOPS due to the round-trip latency of the network, an application using buffered I/O can achieve over 600,000 IOPS because its data is already waiting in local memory.

Direct I/O is network-limited; buffered I/O serves data at memory speed.

While the Page Cache is "always on," it isn't always optimized for high-performance scale-out architectures. Modern distributions often ship with a default read-ahead value of 128KB, a setting better suited for a Raspberry Pi than an AI training node, Horev suggested. VAST and the broader Linux community recommend cranking this value up to several megabytes (e.g., 16MB).

"We have customers who saw over a 10x improvement just by increasing the read-ahead value," Breuner said. "It allows the kernel to be much more aggressive in saturating the high-bandwidth links we have today."

Too Much GPU, Not Enough Data Path

In the hierarchy of performance, GPU Direct Storage (GDS) represents an architectural "shortcut" on the client to move data from the NIC directly into GPU memory, based on RDMA. While standard buffered I/O is the workhorse for most applications, GDS is the high-performance bypass used when the traditional CPU and system memory (SYSMEM) path becomes a hard bottleneck.

In other words, when your GPU wants to feed faster than the CPU can deliver it, that's where GPUDirect becomes highly useful. And perhaps the most compelling reason is the recovery of computational overhead. Without GDS, the GPU’s Streaming Multiprocessors - the core engines of AI compute - can stall simply managing the data transfer handshake.

"If we're using the classic staging approach,” says Breuner, “then the streaming multiprocessors are basically stalled 18 percent of the time. With GDS, they are blocked 0 percent of the time because the network card is doing all of the work."

By offloading the data movement to the NIC, GDS ensures the GPU remains a pure compute engine rather than a data-moving clerk, effectively "buying back" nearly a fifth of your cluster's processing power.

A New Architectural Mandate

For senior leaders the question is no longer "how fast is the storage?" but rather "how much compute can we unblock?" Horev and Breuner believe organizations have the tools to build more efficient engines of discovery.

Learn more about VAST's philosophy and technical approach to ensuring sustained, real-world performance for the most demanding data-intensive workloads:

The End of “Bad I/O”

Why AI workloads are forcing a fundamental rethink of storage architecture, GPU utilization, and data movement.

The Linux Page Cache: The Invisible Performance Engine

Too Much GPU, Not Enough Data Path

A New Architectural Mandate

More from this topic