GTC 2026 Further Revealed the System Behind the Models

Authored by

Nicole Hemsoth Prickett, Head of Industry Relations

As someone who’s been going to GTC since its inception (back when the goal was to convince supercomputing shops they could offload some science to a gaming graphics card) watching the progress of a handful of accelerated systems to countless AI factories is striking. Throughout the years there have always been a few themes that stood out, often meeting user trends (porting of early applications, industrial and scientific use cases, etc.). But what’s different from GTC 2026 is how fine-grained and deeply the focal points are, at least from this view.

Last week’s GTC sessions exposed a clear shift in how AI systems are actually being built. Inference now drives system behavior, context persists and has to be reused. Recompute shows up as a real cost and pipelines are splitting into prefill and decode. Work is now routed dynamically based on where data and context live and power limits are shaping batching and reuse decisions.

The thematic takeaway from GTC 2026? The stack itself is compressing as layers collapse under the cost of moving and duplicating data. Taken together, this is a system that is continuous, stateful, coordinated, and constrained by efficiency rather than raw compute.

The parallels between GTC themes and what VAST has been talking about in tandem are abundantly clear. What these themes imply is the very system VAST CEO and founder, Renen Hallak, and cofounder Jeff Denworth described at VAST FWD. What showed up last week is that same architecture, now visible across the industry.

Here are some of the most noteworthy themes and observations from GTC 2026.

Inference Is Now the Dominant Systems Problem

Inference isn’t suddenly important out of nowhere, of course, but if one thing was clear last week it’s that everything now bends around it and that pressure jumps out in terms of the constraints that define the largest and fastest-growing AI systems.

Latency and throughput per watt define how far a system can scale and request orchestration becomes constant work as workloads arrive unevenly, with different context lengths, different timing expectations, and different cost profiles. The interaction pattern itself has shifted toward multi-turn flows that accumulate state over time.

The underlying shift here (again, not new but increasingly important) is that training is finite and inference is continuous.

All this means we are no longer designing for peak events that can be earily scheduled and completed. We’re now designing for persistent load with variability, where every request arrives with different requirements and the system has to respond in real time. Scheduling becomes routing under constraint, memory hierarchy becomes a live decision about what must stay close to compute, and data placement becomes part of performance, not an implementation detail. And above all (as we’ll cover in a minute) caching becomes a requirement.

The KV Cache Is the New Bottleneck Layer

There’s a pattern emerging: context windows are expanding, recompute costs are rising with them, and GPUs are spending cycles regenerating state the system has already processed. So what looks like a compute problem is actually the cost of failing to retain and reuse context.

As workloads shift toward multi-turn interactions, it’s far less about the single request but rather the accumulated state behind it. Every additional token compounds and every failure to reuse it forces the system to pay the full compute cost again. At scale, this shows up as utilization loss, where expensive resources are tied up repeating prior work instead of advancing the current task.

This is where the system breaks from the model-centric view. As VAST CEO Renen Hallak said on stage at VAST FWD a month before GTC, context is part of the workload and has to be persistent, indexed, and retrieved with the same expectations applied to any other critical data structure. If it is treated as ephemeral, performance degrades in proportion to context length and efficiency collapses under redundancy.

The constraint shifts accordingly. It’s way less about how fast you can compute and more about how effectively you can retain and reuse what you have already computed. That moves the problem into the all-important data layer, and once that becomes visible, the rest of the system can reorganize around it.

When you think about it this way (and the industry is) you’d think VAST had a crystal ball…

Disaggregation Is Not a Design Preference Anymore

Disaggregation is a response to conditions the system cannot absorb which means this is not much of a choice at scale. This shows up in the separation of prefill and decode, the independent scaling of compute, memory, and storage, and in the network becoming a scheduling constraint rather than a passive layer.

The system is being forced to separate concerns because the workload no longer fits inside a single, uniform shape.

As mentioned before, monolithic GPU clusters break under these conditions. Uneven request sizes create imbalance that brute force cannot smooth. Dynamic workloads shift faster than static allocations can adapt. Real-time inference requirements leave no margin for inefficiency. What looks like a scaling problem turns into a coordination problem, and if there’s one thing for sure, monoliths aren’t good at coordination under pressure.

So the natural response is to pull the system apart along functional lines. Compute can be isolated so it can be placed precisely where it is needed. Also, memory is treated as its own layer, tied directly to the workload that depends on it. Storage extends that memory hierarchy rather than sitting at a distance. The network becomes central to how work is routed and how state is accessed.

The goal is not to create more pieces but to remove all those couplings that introduce latency and inflexibility under continual load.

Orchestration Is Becoming the Control Plane of AI

Here again it feels like VAST’s talented team of engineers had a crystal ball a few years ago. They saw the orchestration monster coming and built a stronghold.

First, they saw the limiting factor is no longer the speed of a single GPU. It is how the system decides what happens next. Work has to be routed across uneven resources. Context has to live somewhere predictable and accessible. Compute has to be invoked at the right moment and all that context has to persist across requests that are no longer independent.

At scale, every request then becomes a decision. Where it runs, what context it requires, what can be reused, what must be recomputed. These decisions happen continuously under latency and resource constraints Once workloads become continuous and stateful, static scheduling breaks down. The system has to respond in real time, adjusting to changing demand and evolving state. Orchestration becomes the mechanism that governs that behavior. Not as a background layer, but as the core logic of the system.

This no longer looks like infrastructure composed of independent components. It looks like an operating layer that determines how data, memory, and compute interact. The primary function is no longer to run code efficiently, but to decide how and where that code should run in the first place.

Efficiency Per Watt Is Everything

This is not a sustainability or vague green computing talk in 2026. This is the hard limit that shapes what systems can do.

You see it in inference batching strategies designed to extract more work from the same energy envelope, in workload consolidation to avoid idle capacity, in aggressive memory reuse to eliminate recompute. Every redundant operation now carries a direct energy cost.

Power ceilings cap deployment, cooling constraints limit density and at the end of the day, the cost of inference becomes inseparable from the cost of sustaining the infrastructure itself.

This pushes efficiency into the core of the architecture. Recompute is no longer just wasteful, it is expensive. Moving data unnecessarily is no longer tolerable, it is limiting. Systems that can reuse state, minimize movement, and operate within tight energy budgets are not just faster, they are viable.

The Stack Is Collapsing Into Fewer Layers

This shift is subtle and might get overlooked in favor of some of the more specific concerns but consistent.

What used to exist as separate systems, storage, database, cache, streaming, orchestration, is converging into platforms that handle multiple roles simultaneously. The boundaries still exist conceptually, but they are being pulled inward or eliminated entirely.

Each boundary carries cost. Data movement introduces latency. State duplication creates inconsistency. Operational overhead increases as more systems have to be coordinated and maintained. Under continuous inference load, those costs accumulate quickly and show up directly in performance.

The response is integration. Systems are being designed to handle data persistence, access, and execution within a tighter loop. The distance between where data lives and where it is used is shrinking. The goal is not consolidation for simplicity, but the removal of penalties imposed by unnecessary separation.

What emerges is a more unified platform that is cohesive but not a monolith in the traditional sense. What VAST has built is a system where functions once distributed across layers are designed to operate together by default and an architecture that’s been reshaped to reflect the realities of continuous AI.