Perspectives
Jun 24, 2026

When Files Stop Being the Unit of Management: Lessons from ServiceNow’s AI Research

When Files Stop Being the Unit of Management: Lessons from ServiceNow’s AI Research

Authored by

Jim Crook - Director, Corporate Communications

When people talk about AI infrastructure, the conversation usually starts with GPUs.

How many GPUs are available? How quickly can they train a model? How efficiently can they serve inference?

But listening to ServiceNow’s Christian Hudon describe the realities of operating one of the industry's larger enterprise AI research environments, a different challenge emerges. The hard part is increasingly not generating intelligence. It's managing the growing ecosystem of models, datasets, checkpoints, evaluations, and governance processes that surround it.

That shift is forcing organizations to manage a growing inventory of AI assets.

“We care about models, datasets, and checkpoint directories,” said Hudon, Applied Research Scientist and Architect at ServiceNow's AI Research group. “My boss doesn't care about individual files.”

That distinction might sound subtle. It isn't.

When Files Stop Being the Unit of Management

ServiceNow's AI Research organization operates nearly 500 NVIDIA H100 GPUs and manages roughly 15 petabytes of data. Like most advanced AI teams, one of its earliest requirements was straightforward: support large-scale training.

“The first key challenge as soon as you're doing large-scale training workloads is checkpointing,” said Hudon.

Modern training jobs routinely span hundreds of GPUs. Every few minutes, those systems pause to save their state. The longer those checkpoints take to write, the longer expensive GPUs sit idle waiting to resume work.

While some researchers have argued that checkpoint bandwidth is not always the bottleneck it is assumed to be, checkpointing still creates significant operational challenges at scale.

As organizations accumulate thousands of experiments, model versions, and training runs, storage consumption can become difficult to understand. Hudon shared one example where disk utilization suddenly approached capacity. After investigating, the team discovered that a well-intentioned intern had configured experiments to save massive checkpoints every five minutes.

The result was multiple petabytes of storage consumed by experiments that no longer needed to exist.

The interesting part of the story isn't the storage itself, but what happened next.

Rather than focusing solely on finding large files, the team began asking a more important question: How can infrastructure understand what these files actually represent?

Moving Beyond Files

Traditional storage systems are remarkably good at managing files. AI organizations increasingly need systems that can reason about higher-level objects.

A model is not simply a file. A checkpoint directory is not simply a collection of files. A dataset contains business-relevant characteristics that matter to researchers, security teams, compliance officers, and platform operators.

The challenge is that most of this context remains invisible to infrastructure, and Hudon's vision is to change that.

Using audit events generated when files are created, ServiceNow is exploring workflows that automatically classify newly created artifacts, identify whether they are models, datasets, or checkpoints, extract metadata, and make that information searchable.

More than just better storage administration, it's better organizational visibility.

“Show me all the models on storage.”

“Show me all checkpoint directories and how frequently they're saving.”

“Show me which teams are using storage inefficiently.”

Those are governance questions, not infrastructure questions. Yet answering them increasingly requires infrastructure to become an active participant in AI operations.

This is one of the more significant shifts happening inside enterprise AI today. Organizations are moving beyond simply storing AI assets toward automatically understanding and governing them.

The Return of an Old Machine Learning Lesson

Hudon offered another observation that feels increasingly relevant as enterprises race to deploy generative AI.

Despite the extraordinary advances in model capability, many of the lessons from earlier generations of machine learning remain unchanged.

“The mindset relevant in 2005, during the earlier days of ML, is still very much relevant right now,” he said.

The biggest lesson is that AI systems require statistical thinking.

Traditional software development relies on deterministic behavior. Engineers can test a handful of edge cases and gain confidence that a system will behave predictably.

AI systems don't work that way.

A model can perform perfectly on a carefully selected set of examples while failing on inputs that better represent real-world usage. As a result, evaluation requires a fundamentally different approach.

Hudon's recommendation is intentionally provocative.

“My benchmark is that you should be testing things with at least a thousand examples.”

Talk about a mindset shift. Testing thousands of examples forces teams to think about distributions rather than individual cases. It requires them to understand how users actually interact with systems instead of relying on intuition or spot checks.

For organizations deploying agents and generative AI applications, that may be one of the most important operational lessons to emerge from the current wave of adoption.

Infrastructure Becomes a Control Plane

Taken together, these ideas point toward a broader industry trend.

The next generation of AI infrastructure won't be defined solely by throughput, capacity, or latency. Those remain necessary requirements, but they are no longer sufficient.

As enterprises accumulate growing inventories of models, datasets, checkpoints, evaluations, and agents, infrastructure is being asked to play a larger role. It must help organizations understand what they have, automate governance processes, trigger security workflows, and provide visibility into how AI assets are being used.

In other words, infrastructure is evolving from a passive repository into an operational control plane for AI.

The organizations that succeed with AI at scale may ultimately be the ones that recognize this shift earliest. Building and serving models remains important. But operating them safely, efficiently, and with enough visibility to satisfy researchers, security teams, and business leaders is increasingly becoming the harder problem.

More from this topic

Learn what VAST can do for you

Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

* Required field.