Q&A: Glenn Lockwood Joins VAST With Eye on Next-Gen AI Infrastructure Strategy

Author

Nicole Hemsoth Prickett, Head of Industry Relations

We’re excited to welcome Glenn K. Lockwood as Principal Technical Strategist at VAST Data where he will bring his deep expertise in large-scale infrastructure to help organizations address the demands of extreme-scale workloads.

His career has focused on applied R&D in parallel systems, with particular expertise in scalable architectures, performance modeling, and emerging technologies for I/O and storage.

Glenn comes to VAST from Microsoft, where he supported the design and operations of the Azure supercomputers that trained many leading LLMs. Prior to that, Glenn led the design of several large-scale storage systems, including the world’s first 30+ PB all-NVMe Lustre file system for the Perlmutter supercomputer, at NERSC. He holds a Ph.D. in Materials Science.

Nicole: You’ve worked across research, large supercomputing centers, and industry; what have these different roles taught you about designing systems to scale with unpredictable AI demands?

Glenn: Going really fast in a straight line isn’t as valuable as it used to be. Flexibility and agility in infrastructure is becoming equally important since the workload is evolving faster than technology can keep up.

It wasn’t long ago that designing for performance along a few key dimensions got you most of the way to a solid system that could solve most problems. Maximize FLOPS, buy as much memory bandwidth as you could afford, slap a bandwidth-optimized parallel file system down, and interconnect it all with a low-latency network: this was always the best way to design a system. As a result, system design really meant spending a lot of time fine-tuning the details of this recipe to get the best possible value.

The rise of AI has changed all of this, and it’s common to see a system that was designed for training suddenly convert to being used for inference after only a year.

While the GPUs used for training and inferencing might be the same, the data demands of training (checkpointing, batch loading) and inferencing (RAG, key/value caching) are wildly different. As a result, performance that is optimized exclusively for synchronous model training isn’t worth much if that performance isn’t accompanied by the flexibility to also deliver performance and efficiency for, say, vector search and whatever is right around the corner.

Listen to Nicole and Glenn continue the conversation in a new podcast on Shared Everything.

Nicole: You’ve spent your career solving large-scale infrastructure problems, what do you consider the most technically challenging shift happening now in AI infrastructure?

Glenn: New technologies are always hard, and the rapid pace of innovation being driven by AI isn’t making things easier. However, the biggest challenge doesn’t lie in any specific piece of hardware or software in large-scale infrastructure; it’s in all the connective tissue that turns a bucket of parts into a system. It can take months for developers and operators to build all theautomation and integrations necessary to make new compute or data platforms run efficiently at scale, but the companies developing and deploying models can’t wait that long. The end result is that many large-scale infrastructure providers are building the airplanes as they fly them, and this is a brutal environment to work in.

I’m not sure there’s an end in sight since the rate of innovation in AI isn’t slowing down, either.

The only ray of hope is that some of the people making the airplane parts are paying more attention to the rough edges of what they’re providing.

If we’re all going to continue building planes in mid-air, working with parts that fit together easily–via open standards, API-driven interfaces, and transparency–will become the minimum bar as infrastructure providers continue to scale.

Nicole: Architecturally, what’s the most interesting development you’ve seen on the data side of systems for AI over the past year or two?

Glenn: Understanding how to deliver high-performance, cost-efficient inferencing went from 0 to 100 almost overnight as really strong open-weight models began hitting the market and people had to start figuring out the complexities of what were otherwise hidden behind a simple inferencing REST API.

The hardest data challenges at scale went from “read a big file really fast” to “search a vector index, rank and filter documents, load cached KV activations, cache new KV activations, and repeat.” Not only are access sizes and patterns different across memory and storage, but application developers are expecting data access modalities that are much richer than the simplistic bit streams offered by files.

This isn’t to say that accessing structured data is new, but I don’t think it’s ever been in the critical path of a high-performance application like it is with inferencing.

And since inferencing is how model developers ultimately realize the return on their investment in AI, this gnarly new data challenge is in the critical path of both performance engineers and bean counters.

We are now in a world where the CFO who blocked GPU purchases over cluster utilization may soon start questioning vector query latency as a key driver of overall cost efficiency.

Nicole: What technical assumptions or accepted wisdom about AI infrastructure do you find yourself increasingly skeptical about?

Glenn: People love to speculate that a pressure relief valve is right around the corner, and that the demand for AI infrastructure will downshift as soon as there’s a new DeepSeek-like moment. What they don’t appreciate is Jevons paradox–the observation that as a resource gets cheaper, its consumption goes up.

The growth of AI is probably going to be a one-way street, and as it gets better, there will be more GPUs, more networks, more datacenters, and more power. And even if AI stops getting better, society has only begun to scratch the surface of the ways in which it can improve productivity and the quality of life for people around the world.

As a result of this, the systems and technologies we have today are undoubtedly insufficient for the future regardless of what new models are released. Absolute performance is good, but flexibility will be better in every aspect of AI infrastructure.

This is why GPU demand is so high despite the existence of faster bespoke accelerators, and it will be why every other part of the stack–from networks to storage and even to racks, cooling systems, and power distribution–will shift from “tried and true” to “fast and flexible.”

Nicole: With your systems-level perspective, what’s the single most critical thing infrastructure architects must get right in the next five years to keep pace with AI?

Glenn: “Nobody got fired for buying XYZ” does not apply in this new world of AI, because it reflects designing systems for a workload that evolved over years or decades, not weeks or months.

All of the companies at the forefront of AI today are there because of big bets they continue to make year over year. Nobody designing AI infrastructure really knows what they’re doing; nobody knows what the ideal system for training or inferencing will look like in five years. The only certainty is that buying XYZ for the next five years solely because it worked for the last five years is not a big bet, and it will likely constrain the ways in which you can innovate.

So, the single most critical thing is to get comfortable embracing the unknown; accept that not every decision will be the right one, and maintain agility as much as possible.

In practice, this means selecting technologies that have the strongest foundations to be repurposed for wildly different workloads and that can be expanded quickly when the next breakthrough happens.

It also means working with people who want to learn it all, not those who already know it all, because those are the ones who are most likely to be ready for the future when it arrives.

Nicole: I have to ask in light all of these answers, Glenn, why VAST and why this moment?

Glenn: When I decided that it was time for me to try something new, I decided that I wanted my next step to build on what I’ve accomplished in my career. Technically, this meant working with a technology stack that was not only ready for the extreme demands of today’s hyperscale AI workloads, but was headed down the right path to meet whatever comes around the corner.

But personally, it also meant building on the experience I developed in both designing systems to enable complex scientific workflows at NERSC and supporting hyperscale AI infrastructure at Microsoft. And I wanted to work with people that shared the same values as me–leading with facts and clarity over bold claims, learning and sharing in equal parts, and racing towards hard problems.

I have worked with VAST continuously since early 2018, in capacities ranging from customer to cloud partner, because I have been continually impressed with the strength of the DASE architecture and the innovation it enables.

Over the last seven years, I have also gotten to know many of the people at VAST–ranging from founders to engineers to sales directors–and I have been universally impressed by the ethos of its people.

When it came time to decide on my next steps, VAST was the clear choice: not only does it deliver zero compromises to its customers, but it is a place where I wouldn’t have to compromise with myself.

Follow Glenn and VAST to stay up to date on the latest in AI, data infrastructure, and next-gen computing.

Stay current with Nicole on Shared Everything as she explores emerging AI trends and data insights to help tech leaders build smarter, faster, and more sustainable digital futures.

Q&A: Glenn Lockwood Joins VAST With Eye on Next-Gen AI Infrastructure Strategy

Listen to Nicole and Glenn continue the conversation in a new podcast on Shared Everything.

More from this topic