Of all the metaphors AI likes to drape itself in (brains, rockets, evolution) the most overlooked is probably “space.”
Not outer space, although that’s flashy enough, but the real, boring, profoundly complicated three-dimensional space we all occupy.
For AI, space isn’t just another capability or benchmark to hit. Language is linear, flat, and digital, while the world is not. And this is what makes Fei-Fei Li’s new obsession so fascinating: building AI systems that finally understand and navigate real, spatial, 3D environments.
At a fireside chat in San Francisco, Li, founder of ImageNet, early driver of vision-language systems, and one of AI’s genuine north-star figures, made no attempt to sugarcoat the difficulty of the spatial-intelligence problem. In fact, she seemed delighted by it.
“My entire career,” she said, grinning, “is going after problems that are just so hard, bordering on delusional.”
The room laughed but Li didn’t. Because she knows it’s at the heart of a conceptual leap AI has to make if it’s ever going to resemble something like real intelligence. Language, after all, is just words, an abstraction we invent. But space is irreducible. Space is physics, geometry, occlusion, gravity.
And that, Li argues, is the difference that makes spatial intelligence not just harder than language, but more important for AGI. "Language literally comes out of everybody’s head," she reminded the audience, flatly. "There’s no language in nature. But the world is far more complex."
The challenge, she explains, isn’t the dimensionality alone but the nature of the signal itself. Language is one-dimensional, sequential, digital. Spatial intelligence is projected, lossy, and when you think about it, mathematically ill-posed. Cameras collapse three dimensions into two, leaving AI the unhappy task of reconstructing an inherently incomplete view. Humans solve it with binocular vision, priors, and lots of inference but AI doesn’t have that luxury, at least not yet.
This is the reason Li created World Labs, her new startup explicitly designed to build models whose output isn’t text, or even images, but structured, physics-aware, generative worlds. “AGI,” she said plainly, “will not be complete without spatial intelligence.”
What that means in practice is still somewhat cryptic and she’s careful not to reveal too much about World Labs’ internal progress. But her description implies an approach that looks something like a hybrid of generative AI, robotics, differentiable rendering, and neural-field representations like NeRFs.
Spatial intelligence models will need to seamlessly combine real-world reconstruction (for robotics and embodied tasks) with generative imagination (for simulations, design, and virtual worlds). They’d have to simultaneously solve perception, reasoning, and physics-constrained generation, an integrated computational cocktail that no existing system has mastered.
And as if all that weren’t enough, Li pointed out an even more subtle complication. Spatial data scarcity.
Language models rely on an internet flooded with text, trillions of tokens, all neatly digitized. Spatial intelligence, though, has no such conveniently scrapeable corpus.
"Where is the data for spatial intelligence?" Li asked rhetorically, shrugging. "It’s all in our heads. It’s not accessible like language."
Without an ImageNet-style dataset of 3D environments, training these models becomes yet another hard-to-borderline-impossible task. She confirmed World Labs is tackling this through some hybrid of synthetic data generation and careful curation, but specifics are sparse.
But really, it’s not just a product she’s after, it’s a paradigm shift, a necessary evolution for AI. Her previous projects suggest a pattern: first build the infrastructure (ImageNet), then let the field discover how essential it is (AlexNet).
With spatial intelligence, she’s aiming for something similarly infrastructural and potentially field-defining. A platform on which everything else (robotics, AR/VR, embodied cognition) could be built. And she’s assembled a seriously strong team to make it happen. Justin Johnson (PyTorch3D, neural style transfer), Ben Mildenhall (NeRF), and Christoph Lassner (Pulsar).
She’s pulling together distinct threads of spatial modeling, rendering, and world-representation into a single unified narrative, namely that the next evolution of AI isn’t just better LLMs or diffusion models, it’s systems capable of fully understanding and generating structured spatial realities.
The first application areas to benefit from spatial intelligence embedded in AI are still developing but she sees creative design, robotics, and gaming as first movers.
In creative design she says spatial models could empower architects and digital artists to generate worlds where structures are both visually convincing and physically coherent. The difference matters. Walls hold ceilings, chairs slide under tables, every object respects gravity.
Then there’s robotics, where spatial reasoning separates lab demos from real-world embodiment. Li describes spatial intelligence as crucial for robots that don’t just recognize tools, but understand precisely how to use them (calculating balance, friction, and force intuitively).
Li also highlights gaming and the metaverse as areas still awaiting spatial intelligence breakthroughs. “It’s still not working,” she admits plainly. The missing piece is physics-aware world modeling—generating virtual spaces you don’t merely observe, but genuinely inhabit. Objects respond naturally, actions feel intuitively correct, interactions become seamless. For Li, spatially intelligent models aren’t about hype; they’re about transforming virtual worlds into authentically immersive environments.
Ultimately, what adding spatial intelligence into the mix means will be hybrid architectures that straddle reconstruction and generation, datasets synthesized to precision, and physics constraints baked directly into the modeling itself.
She doesn’t hedge on the complexity: projection ambiguity, representational gaps, data scarcity but overcoming them, Li insists, is precisely what moves AI from merely describing the world to genuinely inhabiting it.