DeepSeek showed that in sparse models, the real delay isn’t compute but waiting on the right weights. In the age of expert routing, storage becomes the scheduler.
Somewhere in a datacenter, a high-end GPU sits idle.
Not because it lacks input or instructions or because it’s bandwidth starved, but because it is, quite literally, just waiting for a file.
And not even a very big file, perhaps just a few shards of a specific layer from a specific expert in a sprawling MoE model. It might be the first time that GPU has ever had to pause for something as simple as a fetch.
This situation isn’t a fluke, it’s more of a symptom that the arrival of DeepSeek made visible.
The idea of storage for AI infrastructure has been treated as something important but peripheral.
You need it, of course. You can't train or serve a model without access to weights, datasets and checkpoints. And you need somewhere to write logs or store tokens or archive attention matrices that no one will ever actually look at again.
But storage was always the thing you provisioned after you picked your compute. You calculated what you needed based on model size and data volume, maybe ran some benchmarks, and then assumed the bottleneck would always be the same…the GPU pipeline.
And for the last few years, it was. But then came dramatic sparsity.
DeepSeek didn’t invent the MoE model, but it helped normalize it.
It pushed the idea of conditional computation into the center of practical deployment. With MoE, not every part of the model fires on every pass. Tokens are routed to specific subnetworks that are trained to specialize in different semantic or syntactic regions of language. The model, instead of being a nice neat grid of neurons, becomes a branching structure of possibilities. You ask a question, and only the experts deemed relevant by the router respond.
This is clever and efficient and helps models scale up without requiring linearly increasing compute. Ah. But it also introduces a subtle kind of chaos. Because now, instead of loading a single static block of model weights and running inference over it repeatedly, the system has to fetch different slices of the model at different times, depending on what’s being asked and who needs to answer.
These slices, or little weight shards, are often too large to keep all in memory across a fleet, and too rarely accessed to justify pinning in cache. So they are retrieved. Over and over again. And as DeepSeek began to demonstrate at scale, the real performance cost wasn’t in how big these weights were, it was in how fast they could (or couldn’t) be fetched.
What this means, in practice, is that storage has been dragged onto center stage.
In this world described here, storage is no longer just a passive repository, a back-end system that can be swapped or tiered or abstracted without consequence.
As it’s been deployed historically, it’s become a bottleneck, a source of latency, a variable in the scheduling of inference.
In the DeepSeek deployment model, it is not uncommon for the delay in retrieving expert weights to exceed the time it takes to actually perform the compute. The GPU is ready. The memory is available. The token stream is live. But the expert weights haven’t arrived.
And so, for the first time in modern infrastructure, storage is acting like a scheduler.
This shift has been slow to register, in part because it doesn’t show up where people are looking. GPU utilization stays high. Network congestion doesn’t spike. But as teams here at VAST have seen, if you profile inference across a DeepSeek-style MoE deployment and isolate token latency, you start to see strange patterns. You’ll see stalling, jitter, unexpected wait states.
These aren’t due to system overload or routing inefficiencies, they’re due to missed fetch windows. The right expert wasn’t in memory when the router called it. And so the token paused.
This might seem like a niche problem, but we look at it more like an early warning. Because as models continue to grow, and as MoE architectures become more common this becomes the default.
You don’t just load a model once and run it. You load it in pieces, in real time, under pressure, in an unpredictable order. And every piece you don’t have ready is a performance hit.
Object stores are optimized for throughput and capacity, not sub-millisecond fetch precision.
Cloud storage is tuned for redundancy and resilience, not inference-time responsiveness.
Even fast block storage struggles when asked to repeatedly retrieve and serve discontiguous fragments of large tensor arrays without advance notice.
MoE models aren’t reading full checkpoints, they’re reading slices of slices, often under high fan-out.
So picture this: a modern, multimillion-dollar AI cluster is sitting on top of an I/O system that is quite literally slowing down the model by being too general-purpose.
It’s no longer enough to have a fast GPU and a clever router. You need a storage layer that behaves like a predictive cache, that knows what experts are likely to be needed next and stages them accordingly.
You need prefetch logic not just for data, but for model components. You need your object store to understand your scheduler not just serve it.
Let’s put this in better context. Inference runtimes are being coupled with telemetry systems that track expert activation frequency and use that data to warm caches ahead of time. Weights are being co-located with flash or tiered across high-performance storage fabrics to reduce fetch time. Some setups even treat expert shards as ephemeral memory objects, loading and unloading them in a continuous cycle that mimics virtual memory paging.
These are not stopgaps. They are early signs of a new paradigm, in which storage is no longer just about holding models, it’s running them.
If your storage layer can’t keep pace with your model's routing logic, your scheduler must either wait or re-route.
That turns a storage miss into a compute penalty, which is exactly what MoE was meant to avoid.
Basically now the infrastructure starts to work against the model instead of with it. And as DeepSeek showed, the more sparsity that gets introduce, the more you depend on the agility of storage. This isn’t about volume anymore.
This also suggests a future where the roles inside AI infrastructure start to blur. Where schedulers read I/O logs. Or where storage systems carry activation maps. Where model graphs are optimized not just for accuracy or FLOPs, but for prefetch probability.
Which brings us back to that idle GPU, blinking patiently, waiting not for power or work, but for a sliver of a tensor to make its way out of a store that was never designed to be fast.
The lesson here isn’t just about DeepSeek. It’s about what happens when models evolve faster than the systems built to serve them.