May 8, 2025

OpenAI’s Obsession with Bigger Models Exposes AI’s Dataflow Weakness

Nicole Hemsoth Prickett

Scaling is more than a play for bigger models—it’s a stress test for every data pipeline, exposing the cracks in AI’s infrastructure as OpenAI’s Jason Wei reveals.

Scaling is more than a play for bigger models—it’s a stress test for every data pipeline, exposing the cracks in AI’s infrastructure as OpenAI’s Jason Wei reveals.

Scaling isn’t a choice. It’s a gravitational pull, an inevitability that drags AI deeper into the compute abyss, burning terawatts and investor dollars in the process. 

But the real mystery isn’t the cost or the heat or even the sheer mass of data we’re hurling at these models, rather, it’s how and why scaling works so reliably, so relentlessly, so damn predictably

Jason Wei, a researcher at OpenAI, leans into that question with the kind of focus that suggests he’s been circling it for years.

Scaling as Law: The Predictable Mechanics of More

Here’s the thing about scaling laws: they don’t just predict performance—they practically mandate it. Add more compute, and the loss curve dips. Pour in more data, and the model’s predictive accuracy tightens like a noose. It’s simple, almost offensively so, and that’s the part that gets under the skin.

"If you increase the compute by a certain amount, you can predict how much the loss will go down," Wei says, his delivery casual, like he’s explaining how gravity works. 

The uncanny regularity of this law isn’t just surprising—it’s downright eerie. For seven orders of magnitude, the scaling curve holds, inching downward with the reliable precision of a metronome.

It’s as if scaling has become AI’s universal solvent—the answer to every difficult question. The model can’t translate effectively? Scale it. The model can’t debug code? Scale it. The model can’t pass an eighth-grade math test? Scale it. 

Scaling isn’t just a tactic; it’s a reflex, a knee-jerk response to any capability gap, a way to brute-force competence.

But here’s the catch: scaling is also hammer that only a few players can swing. 

It’s a shift from the scrappy, single-researcher projects of five years ago to sprawling engineering behemoths staffed with data scientists, model trainers, and infrastructure engineers.

"If you're a small language model, you don't have that many parameters. Memorization is super costly," Wei says. But a large model can afford to be “extremely generous in memorizing tail knowledge.” 

That generosity, of course, is not free. It’s paid for in GPU hours, power consumption, and the kind of data acquisition that makes surveillance capitalism look quaint.

The shift isn’t just quantitative—it’s cultural. Gone are the days of bottom-up research culture, where a lone coder could hack their way to a breakthrough. Now, it’s all about compute orchestration, data wrangling, and scale-first infrastructure. 

You’re not building a model; you’re staging a logistical war, with Kubernetes clusters as infantry and power grids as artillery.

Scaling’s Uncanny Thresholds

So yes, scaling laws predict performance, but what they don’t predict is what happens when a model crosses a threshold and suddenly learns how to do something it couldn’t do yesterday. Translation, for example. Debugging code. Solving math problems that would stump a college sophomore, that kind of thing.

Wei calls these “emergent abilities,” the kind of leap where the model doesn’t just get marginally better–it goes from not getting it at all to nailing it with eerie competence. The transition isn’t gradual, it’s seismic.

Here’s the setup Wei described to highlight: OpenAI ran next-word prediction tasks across its models, from Ada to Babbage to Curie. Ada and Babbage failed at translation. Curie, scaled just a wee bit larger, suddenly just gets it. 

To emphasize– This is a model that was spitting out nonsense in one iteration but suddenly locks in, translating fluently in the next. That’s not linear or incremental, it’s a phase change, the kind of leap that should be unpredictable but isn’t. Because, as Wei puts it, “When the trend continues over time, you see this emergent behavior.”

The Unsettling "Why" of Scaling

This is the part that sticks in the throat: nobody really knows why scaling works. 

Why a larger model suddenly gets the joke. Why it can debug code or compose a sonnet or pass the bar exam when it couldn’t yesterday. 

Wei doesn’t have the answer, and he’s refreshingly blunt about it saying it’s just. “a question that is sort of a glaring question, but we don't have a great answer to in the AI research community.”

Theories abound, of course. One is that large models can afford to be promiscuous with memory, they can hoover up every rare fact, every niche skill, every long tail bit of knowledge that smaller models would ignore. They can afford to be sloppy, to overlearn, to absorb everything, you know, just in case. And somewhere in that glut of data, they accidentally learn to do things that weren’t part of the plan.

Another theory is that scaling simply gives models more room to play or experiment or chase down connections smaller models would discard as noise. The model can afford to “memorize tail knowledge,” as Wei puts it. And in this theory, sometimes that tail knowledge is the missing piece in a multi-step reasoning task that suddenly clicks into place.

But I digress because this whole “we don’t really know how this works” rabbit hole is deep and should come with a stiff cocktail at the bottom. 

So But What Happens When Scaling Hits a Wall?

Scaling has been the winning formula, but what happens when the losses stop decreasing and the emergent abilities dry up? 

What happens when the model is as big as it can get, and the power grid is straining, and the training costs have gone from obscene to unsustainable?

If scaling has taught us anything, it’s that compute is the new oi blah blahl, and those who can extract it efficiently are the new superpowers et cetera. But the infrastructure needed to sustain scaling—let alone optimize it—isn’t built yet. 

We’re still in the phase where the industry is learning to crawl, and scaling is the blunt-force solution to every problem. But when brute force hits its limit, what do we do then?

Wei’s talk is a harbinger, a reminder that scaling works, at least for now. But the real breakthroughs will come when we learn to do more with less, when we stop throwing teraflops at the problem and start asking why scaling worked in the first place. Until then, we’re stuck in the same loop: more data, more compute, more capabilities, more questions. And the questions are only getting louder.

And on that note, scaling laws have made one thing painfully clear: AI doesn’t run on algorithms—it runs on infrastructure. And as models balloon from billions to trillions of parameters, the data pipelines that fed them start to look like leaky garden hoses trying to fill an Olympic swimming pool.

The problem isn’t just compute; it’s orchestration. It’s memory bandwidth. It’s latency. It’s the chasm between where data lives and where it needs to be when a model needs it.

Scaling’s dirty secret is that it’s not just about bigger models—it’s about making sure every byte of data is exactly where it needs to be, exactly when it’s needed. 

It’s about turning petabytes of raw input into something more than a storage problem. 

It’s about building a substrate that moves data like a nervous system moves signals—a continuous, synchronous flow of context and content, with nothing lost in transit.

Imagine some kind of magical infrastructure that doesn’t just store data but treats it like an operating system treats memory. A shared pool, infinitely accessible, always on, always aware. 

This thing I described, and Wei opened the door to understanding, presents a future where systems aren’t just a place to put data but the very framework through which data flows, a single, sprawling namespace that acts as a unified neural layer across the entire stack.

Scaling laws dictate that the bigger the model, the more data it can gobble, the more context it craves, the more nuance it requires. And (deep breath) the more that data must move seamlessly between compute nodes, storage layers, and memory caches without a single dropped packet or wasted millisecond. 

In a world where milliseconds matter, data isn’t just a resource—it’s a living, breathing entity, and the infrastructure that manages it must evolve from static storage to dynamic orchestration.

And in that future, scaling isn’t just a matter of more—it’s a matter of moving faster, thinking faster, and making every single byte count.

Subscribe and learn everything.
Newsletter
Podcast
Spotify Logo
Subscribe
Community
Spotify Logo
Join The Cosmos Community

© VAST 2025.All rights reserved

  • social_icon
  • social_icon
  • social_icon
  • social_icon