Perspectives

Jun 19, 2026

The Hidden Challenge of AI Infrastructure: Operations at Scale

Authored by

Jim Crook, Director, Corporate Communications

One organization grew its AI-focused staff from 70 specialists to nearly 200 in four years.

Another found itself in the unusual position of having funding for AI infrastructure before researchers knew how to use it.

Neither challenge involved GPUs.

At a recent industry event leaders from the University of Utah and Italy's national supercomputing consortium, CINECA, offered a candid look at what happens after an institution decides to invest in AI. While much of the industry conversation focuses on accelerators, model performance, and infrastructure scale, their experiences pointed to a different reality: the hardest part of building an AI factory is operating it.

The transition from traditional HPC to AI is often described as a technology shift. In practice, it looks more like an organizational transformation. Teams expand. Support models change. Infrastructure becomes more interconnected. Users arrive with different expectations. And systems that were once optimized around jobs must evolve to support end-to-end workflows spanning cloud platforms, data services, orchestration frameworks, and GPU clusters.

The technology is challenging. The operational change may be even harder.

AI Is Forcing Research Organizations to Reinvent Themselves

For research computing centers, AI introduces a new class of user.

Historically, HPC teams supported researchers running well-understood applications with established workflows. Success was largely measured by how efficiently systems could schedule and execute computational jobs.

Today's AI users arrive with different requirements. They expect access to frameworks, containers, cloud environments, orchestration platforms, data pipelines, and massive GPU resources. Many are still learning how those pieces fit together.

Sam Liston, Senior IT Architect for the University of Utah's Center for High Performance Computing, described the challenge bluntly. After the university launched its Responsible AI initiative and invested heavily in AI infrastructure, researchers were excited about the new capabilities but often lacked the experience to use them effectively.

"Our researchers were going, 'Those are pretty cool, but I have no idea how to use that.'"

The challenge quickly became one of enablement rather than infrastructure.

At CINECA the organizational response has been equally significant. As AI became a growing part of the organization's mission, traditional HPC operational models were no longer sufficient.

"We were like 70 people," said Daniele Cesarini, CINECA's head of AI/HPC architecture. "Now we are closer to 200 who work only on AI."

The growth was not simply a matter of adding headcount. Teams were reorganized around specialized domains including AI storage, AI networking, cloud infrastructure, Kubernetes, GPUs, and user support. What had once been a relatively straightforward supercomputing operation evolved into something much more complex.

The lesson is becoming increasingly common across research institutions: deploying AI infrastructure is often easier than building the expertise required to support it.

The Unit of Work Is No Longer a Job

The organizational changes are being driven by a deeper shift in how researchers consume infrastructure.

Traditional HPC environments were designed around jobs. Researchers submitted work to a scheduler. Data was stored in files. Systems were optimized to maximize throughput and utilization.

AI workflows stretch far beyond that model.

Training, fine-tuning, inference, data preparation, retrieval systems, cloud services, and orchestration frameworks must increasingly work together as part of a larger pipeline.

"The point today is not to focus just on what is happening to my job on Slurm," Cesarini said. "It is to help our users realize the end-to-end workflow."

That requirement is forcing institutions to rethink how infrastructure is assembled and operated.

CINECA's environment now spans multiple data centers, cloud regions, AI-specific systems, traditional HPC resources, and a distributed data lake architecture. Researchers increasingly expect the ability to move seamlessly across those environments without needing to understand the underlying complexity.

As Cesarini observed, today's AI and cloud users "are not classical HPC users." They expect access to a complete platform rather than an isolated system.

The result is a growing emphasis on service-oriented architectures that connect storage, compute, orchestration, and data services into a cohesive environment.

In many ways, research institutions are becoming platform operators.

Why Reliability Is Becoming a Strategic Requirement

Perhaps the most surprising part of the discussion was how little attention was given to performance.

For organizations operating at this scale, reliability and operational simplicity emerged as recurring themes.

Liston described how the University of Utah originally adopted VAST to simplify storage operations and consolidate multiple storage tiers. As the environment expanded to support AI workloads, those operational benefits became increasingly important.

One capability he highlighted was analytics and visibility into user behavior and data movement. Understanding who is consuming resources, where bottlenecks originate, and how workloads interact has become critical in increasingly complex environments.

More revealing, however, was Liston's observation about changing user expectations.

I think as people we become less tolerant of downtime. People just don't tolerate it.

That shift extends beyond AI. Cloud services have fundamentally altered expectations around availability. Researchers increasingly expect infrastructure to behave like the services they use every day: always available, always responsive, and largely invisible.

Liston referenced a comment from another HPC leader that resonated with him:m "I like to forget that I have a file system."

For infrastructure teams, that may be the ultimate measure of success.

As environments become more complex, administrators have less time to manage individual systems and more responsibility to support users, workflows, and services. The systems that create the most value are often the ones that disappear into the background.

Operational consistency becomes every bit as important as raw performance.

The Next Phase of AI Infrastructure

The organizations represented on stage differed dramatically in scale. One supports a regional research community. The other operates national infrastructure spanning multiple data centers and hundreds of petabytes of storage.

Yet both described remarkably similar challenges.

The conversation wasn't ultimately about GPUs, storage architectures, or AI frameworks. It was about what happens to an organization when AI becomes a core service rather than a specialized workload.

Research computing centers are discovering that the next phase of AI infrastructure is not simply a question of scale. It is a question of operations: how to build teams, platforms, and services capable of supporting an increasingly complex ecosystem of users, workflows, and data.

That may prove to be the hardest AI problem of all.

The Hidden Challenge of AI Infrastructure: Operations at Scale

AI Is Forcing Research Organizations to Reinvent Themselves

The Unit of Work Is No Longer a Job

Why Reliability Is Becoming a Strategic Requirement

The Next Phase of AI Infrastructure

More from this topic