The Future Of Supercomputing Is Still a Data Problem

Authored by

Nicole Hemsoth Prickett

More processors, faster interconnects, and ever-larger GPU clusters have defined each new generation of supercomputing systems over the last two decades. But that equation is changing as AI becomes a permanent part of scientific computing. The problem is shifting (again) toward something far more complicated, which is building infrastructure that can simultaneously support traditional HPC, large-scale AI training, inference, and increasingly sensitive data.

That shift is evident in the design of Alps, the flagship supercomputer at the Swiss National Supercomputing Centre (CSCS).

During a recent episode of the Shared Everything podcast, Maxime Martinasso, head of engineering at CSCS, described how the center deliberately abandoned the traditional idea of a single supercomputer environment in favor of infrastructure that can be assembled differently for each scientific community.

“We started to look at reversing this equation,” Martinasso said. “Not having the user adapt to the machine, but having the machine adapt to the user and science.”

That philosophy touches nearly every layer of Alps. Instead of treating compute nodes, storage systems and networking as fixed infrastructure, CSCS designed the machine so resources can be composed into dedicated platforms with their own software stacks, schedulers and services.

Materials science researchers receive an environment optimized for high-throughput computing. Climate scientists use different scheduling policies for massive simulation campaigns. AI researchers have separate environments for training large models and serving inference workloads, including Kubernetes deployed directly onto compute nodes.

The hardware itself reflects the same philosophy. Alps combines NVIDIA Grace Hopper systems, AMD MI300A processors, earlier-generation GPUs, CPU-only nodes and multiple storage technologies, allowing CSCS to match infrastructure to applications instead of forcing every workload into the same configuration.

“We can combine both the hardware you need and the software for science,” Martinasso explained.

He emphasizes that flexibility extends well beyond compute.Like most large HPC centers, CSCS spent years relying on Lustre as its primary parallel file system. Performance was never the only consideration, however. As researchers began bringing new AI workloads and increasingly diverse requirements, the storage layer itself became a limiting factor.

“Lustre has these issues,” Martinasso said. “For us the particularity is a lack of features.” Those features had less to do with raw bandwidth than operational capabilities. Different research communities required different approaches to data management, multi-tenancy, encryption and security. Rather than searching only for a faster file system, CSCS was looking for infrastructure that could support fundamentally different operating models.

“We were looking for a solution with a richer set of features,” he said. Initially, replacing Lustre was never the goal. CSCS simply wanted additional capabilities while maintaining performance. Early benchmarking looked promising, but real scientific applications initially failed to match Lustre’s performance.

Rather than abandoning the effort, engineers from both organizations spent months profiling workloads, tuning software and adjusting I/O behavior until application performance matched what researchers expected.

“What was very surprising,” Martinasso said, “is that by doing deep technical work and finding the right combination of parameters, we ended up having actually the same performance.”

That work eventually allowed CSCS to begin evaluating the platform as a scratch file system for production scientific workloads while retaining capabilities that traditional HPC storage had not previously provided.

Those capabilities became particularly important in a second project that Martinasso believes represents one of the most important directions for national supercomputing centers.

Healthcare organizations increasingly want access to AI infrastructure for personalized medicine, but they cannot simply upload sensitive patient information into a shared supercomputing environment. Regulatory requirements, liability and public trust make that impossible.

“The potential is so high,” Martinasso said. “But you need the data, and this data is confidential.”

Instead of becoming the controller of hospital data, CSCS designed a trusted research environment that effectively hands a secure portion of Alps over to participating institutions. Individual partitions are isolated from the rest of the system. Hospital administrators control encryption keys. CSCS provides compute capacity while deliberately preventing itself from accessing protected information.

“We should not be the data controller,” Martinasso explained. “We just enable the capability to process the data.”

That distinction fundamentally changes responsibility. If hospitals control encryption keys stored outside the supercomputing center, they retain ownership of their information while still accessing world-class AI infrastructure. Should those keys ever be destroyed, the stored information becomes unreadable. “If something happens, they just destroy the keys,” Martinasso said. “What we have is encrypted data that cannot be decrypted anymore.”

Without capabilities such as multi-tenancy and external key management, Martinasso believes the alternative would have required physically separate infrastructure for every participating institution.

“If we didn’t have multi-tenancy, then we would have to procure systems per hospital,” he said. “There is no way around it.”

The discussion ultimately returned to what may become the next major infrastructure challenge facing HPC.

Training foundation models consumes enormous computational resources, but scientific inference may ultimately generate even larger storage problems. Weather forecasting, scientific simulations and other AI-assisted research can produce terabytes of output in minutes. Managing, classifying and protecting those datasets could become as significant an engineering challenge as building larger GPU clusters.

“When you start to do science,” Martinasso said, “suddenly you can generate huge amounts of data… 10 terabytes of data per minute.”

That explosion creates demands not only for storage capacity but for resilience, security, lifecycle management and efficient sharing across research communities. AI may have increased demand for compute, but its long-term effect may be to elevate data infrastructure from supporting technology to primary infrastructure.

For much of HPC’s history, storage was expected to keep up with compute. Increasingly, compute may become the easy part. The harder challenge is building infrastructure flexible enough to manage, secure and serve the unprecedented volumes of data that modern AI-driven science will continue to generate.

The Future Of Supercomputing Is Still a Data Problem

More from this topic