Reinforcement Learning Is How We Raise Agents, Not Just Build Them

AI agents won’t be built; they’ll be raised, trial by feedback loop, learning to crawl, stumble, and eventually run toward something that looks like real autonomy.

At a conference full of shipped systems and best practices, Will Brown offered something else entirely: a glimpse into what might be coming next.

A machine learning researcher at Morgan Stanley and a theorist by background, Brown’s lens is unusual. Before working on production LLM systems, he spent his graduate years at Columbia studying the theory of multi-agent reinforcement learning–the mathematics of how independent learners can evolve inside shared environments. It shows.

Where others talk about models as tools, Brown talks about them as creatures, as things that can interact, experiment, and--if given the right structures--learn.

Today’s AI systems, he argues, aren’t agents in any meaningful sense. They are pipelines: clever assemblies of chatbots, reasoners, and tool users, all tightly wrapped in human-made workflows.

They can answer questions, solve small tasks, search the web. But they can’t truly act. They can’t pursue goals over hours or days. They can’t independently find new paths forward when a plan fails.

And, crucially, they don’t get better at doing these things over time.

Now, we’ve gotten very good at building pipelines in the last several years in particular. Cursor IDE, WindSurf, Replit’s Cody, every flavor of research assistant—but they’re pipelines under the hood. Chains of prompts, tool invocations, and decision trees carefully managed by humans.

The systems that seem more autonomous (Devin Operator, OpenAI’s DeepResearch project, for instance) are notable precisely because they are so rare. And even there, Brown notes, "there’s not that many agents nowadays that will go off and like do stuff for more than 10 minutes at a time."

The common wisdom says to wait: Let better models arrive, with bigger context windows, sharper memory, deeper world models. Surely, better base models will solve autonomy for us, right?

Well, Brown’s view is colder and sharper: size alone won’t get us there. To have agents, we need systems that learn, not just respond. And for that, we will need good old fashioned reinforcement learning.

True reinforcement learning isn’t just tuning a model’s output to match human preferences. It’s a framework where a system interacts with an environment, pursues goals, gets feedback, and improves based on its own trial and error. Not because a human labeled more data, but because success itself becomes the teacher.

And despite all the progress, in today’s LLM world, that notion has barely begun. Most large models are still shaped by pretraining, synthetic data, and Reinforcement Learning from Human Feedback (RLHF).

Pretraining makes them knowledgeable. Synthetic data makes them scalable. RLHF makes them...polite. But none of these make them autonomous.

As Brown puts it, pretraining is running into diminishing returns. Synthetic data compresses capabilities, but doesn't create new ones. RLHF creates friendly chatbots, not evolving thinkers.

The glimpse of something new came recently for Brown with the release of DeepSeek’s R1 model and its accompanying paper. For the first time, someone spelled it out: how models like OpenAI’s O1 had crossed a threshold. Not by throwing more data at them, but by turning the crank of reinforcement learning: setting up challenges, letting the model try, scoring the outputs, and teaching it to lean into what worked.

And on that note, chain-of-thought reasoning–those deep, multi-hop inferences we admire–didn’t emerge from prompt templates. It emerged because the models discovered it was an effective strategy under reinforcement.

This is the quiet revolution underway: better agents won’t be built by bigger models alone. They’ll be raised through interaction.

Yet even here, the frontier is only half-breached. DeepSeek’s models, OpenAI’s DeepResearch project — they operate at single turns or narrow task windows. Multi-step, multi-tool agents who adapt across hours, or days, or open-ended problems? Still mostly a dream.

And it’s here that Brown’s work turns personal and practical:

In the weekend following DeepSeek’s release, he set out to test something small: could reinforcement learning be made simple enough to democratize?

He took a basic LLaMA 1B model, a Hugging Face trainer, and a few lines of code. He defined a scoring rubric: not just "right answer or wrong," but also intermediate behaviors like following structure, formatting answers correctly, making progress even when not fully correct.

The result wasn’t state-of-the-art. But it learned, by gum. It corrected itself. It started reasoning longer when longer chains helped solve problems.

It wasn’t about the sophistication of the model — it was about the simplicity of the feedback.

He tweeted it out, expecting little. What happened instead was a quiet viral explosion: forks, blog posts, modifications, replications. Not because the model was miraculous, but because the code was accessible. It invited people to tinker.

It made it obvious that rubric engineering–the design of reward signals for learning–could be as fundamental a skill as prompt engineering had been a few years earlier.

Rubric engineering is an act of architecture, not patchwork. Instead of hand-evaluating models after they perform, you define the very metrics that guide their growth. You reward not only success, but traits that lead toward success. You penalize drift. You build invisible structures that teach the model how to behave.

Done well, it becomes a path to raising real agents.

Done badly, it invites reward hacking: models that optimize for the letter of the task, not its spirit.

Brown has since begun shaping this work into a more formal open-source framework — a way to plug reinforcement learning loops into existing agent environments without having to retrain from scratch or reimplement massive libraries.

The hope is to make the infrastructure of learning–not just inference–accessible to the broader engineering community.

Because if there’s a message under all this, it’s this:

The future of agents won’t just be about bigger context windows or smarter prompts. It will be about feedback loops. About teaching. About evolution inside environments we design.

Today’s AI engineering disciplines — prompt crafting, tool orchestration, eval design — have built the pipelines we needed. But the next era will ask for a different craft.

The agents of tomorrow won’t be shipped. They’ll be raised.

And it won’t be the biggest models that win — it’ll be the ones who learned how to learn.