What Breaks at Scale: The S3 Illusion and the Physics of Infrastructure

Author

Nicole Hemsoth Prickett, Head of Industry Relations

The funny thing about object storage is how easy it looks at first.

It’s got that IKEA effect—you assemble it yourself, it holds stuff, and you get this weird glow of competence because it feels like infrastructure without making you bleed like infrastructure.

You point to an application at your shiny new S3 endpoint—open source, friendly, stateless, seemingly infinite—and everything just works.

Until it doesn’t.

VAST has had a front-row seat to this unfolding drama more times than can be counted. It starts with ambition and clean YAML, and ends with white-knuckled ops teams whispering “never again” into their coffee mugs.

At around five petabytes, things look fine. At 20, quirks surface—little flutters of latency, odd behavior under load, peppered with a few inexplicable restarts.

And at 50 or 100 petabytes, the architecture begins to reveal its actual shape, which is often a layer cake of compromises held together by good intentions and best-effort consistency.

Then, at say, 200PB—if you’re brave or unlucky enough to push that far—it all starts to collapse under its own (philosophical) weight.

One favorite anecdote, if you can call near-disaster an anecdote, came from a customer who had bet heavily on one of these platforms. They believed in the dream. They tuned, they optimized, they scaled, they threw really smart engineers at it. Yet still, when the cluster grew past a certain threshold, performance just flatlined. Job completion times went non-linear. Simple object operations became dangerous. Engineers began disabling features to keep things online.

You know it’s bad when a system that was supposed to “just work” starts demanding operational rituals like an angry deity.

This customer now runs on VAST. They’re still rebuilding trust—not in us—but in the idea that infrastructure can be both super-high performance and, well, boring. And that’s the bar, by the way. Not bells and whistles, not open source cred, not even cost efficiency in the spreadsheet sense.

The bar is that data infrastructure should never make your GPUs idle.

Because that’s the real issue with these not-to-be-named-but-very-much-known platforms: they’re good at storage until you ask them to serve the machine.

As we all know, GPUs do not forgive I/O bottlenecks. AI training workloads do not wait politely. When you’re checkpointing terabytes every few minutes, or pulling model state into memory at terabit speeds, your object store can’t just be S3-compatible—it has to be faster than your bottlenecks. It has to be built for flash. Not tolerant of flash, not passably performant on SSDs. Actually, deliberately, from-the-ground-up optimized for NVMe-scale IOPS, not to mention concurrency.

The vendors that miss this end up reinventing the wheel in their customer support queues. And the ones that try to fake it with scale-out hacks or proxy-layer metadata schemes always run into the same wall: you can’t outsmart physics with abstraction.

But okay, fine, maybe performance isn’t your problem. Maybe you’re just trying to run a modern multi-tenant platform with shared infrastructure and some basic sanity around user isolation.

What happens then?

Well, in a lot of these same open S3 platforms, it turns out “multi-tenant” means “good luck.”

In these scenarios usage visibility is an afterthought. Throttling is primitive if it exists at all. And in the real world, there’s zero way to know who’s slamming your drives at 3am until someone from Finance comes asking why invoices jumped 30% last month.

Oh, and forget cross-site replication. That’s usually DIY too—some mix of scripting, third-party tools, and…I don’t know, vibes.

And yet customers routinely enter these environments thinking, “it’s S3, how hard can it be?”

Here’s the thing: Amazon made S3 look easy, but it’s the most battle-tested infrastructure surface on the planet. Mimicking the API isn’t the same as building a system that can actually do the job. It’s the difference between drawing a picture of a submarine and getting one to survive 4,000 meters below sea level.

So yes, we’ve seen customers come to us after pushing these platforms to a breaking point. And the pattern, it’s always the same.

What worked yesterday doesn’t work at 200 petabytes.

What started as a storage project turns into a cost spiral, a stability nightmare, or an outright incident.

And that’s when they call us.

They’re not looking for flashy. They’re looking for stable. For quiet. For a system that doesn’t require belief or workaround culture or a new Slack channel for escalating metadata lock bugs.

They want something that just… works.

And if they’ve come from one of those open S3 platforms, they already know what not working looks like. They’ve lived it.

Seriously though. We’re not here to dunk. Everyone’s trying to solve hard problems. And many of these open platforms are richly engineered—at a certain scale, in a certain context. But real-world infrastructure is messy. Workloads sprawl. Data grows unpredictably.

You don’t get to hit pause and redesign at 80PB.

Which is why VAST doesn’t retrofit enterprise features onto a minimalist design.

VAST started from a completely different premise: that storage isn’t just storage anymore. It’s first-class infrastructure.

It’s a real-time, always-on, multi-tenant performance surface that needs to serve the fastest systems in the world without blinking.

We’re flash-native, because latency and throughput aren’t optional when you’re feeding GPUs. We replicate across protocols—S3, NFS, SMB, and beyond—because failure domains don’t respect interface boundaries.

We built multitenancy and workload isolation as primitives, not features. And why do we keep banging this drum? Because in modern infrastructure, noisy neighbors are a fact of life, not a corner case.

More importantly, we’re not just claiming theoretical scale—we’re living it.

Our largest deployments are already in the hundreds of petabytes. They are active, hot, production workloads. All the hard stuff: AI training pipelines. Vehicle telemetry and video. Live inference and simulation.

And they run because the system doesn’t break. It adapts. It expands. It replicates. It balances. It hangs right the hell in there.

So no, we’re not here to out-market anyone. We’re not trying to out-minimalist the open source crowd. We’re just building the thing that has to exist when ambition becomes infrastructure and infrastructure meets the edge of what’s real.

So…Call it object storage. Call it S3.

Just don’t call it finished until it survives at scale. Because in this business, everything works—until it doesn’t.

This is why customers like CoreWeave, Lambda, Jump Trading, and others have turned to VAST - not because they gave up on S3, but because they learned exactly where it shines… and where it falls apart.

They’ve felt the pain of chasing performance through layers of abstraction, of watching platforms buckle under the real-world pressure of AI-scale workloads. These are teams that push the edge by default, and they’ve realized that compatibility isn’t the same as capability. With VAST, they’ve found a foundation that scales without surprise, performs without ceremony, and stays quiet even when everything else gets loud.

What Breaks at Scale: The S3 Illusion and the Physics of Infrastructure

More from this topic