2026: The Year of AI Inference

Authored by

Derrick Harris, Technology Storyteller

Drafting off 2025 as a year of massive investment in AI infrastructure, 2026 is poised to be the year that the business of AI inference kicks into high gear. Not just with simple LLM prompts and image generation, but with mission-critical workloads that take advantage of generative AI models as a key component of the enterprise and scientific application stacks.

We’re confident this shift will happen because we’re seeing the types of things our customers are doing and hearing about what they’d like to do. We’re also working closely with our partners to enable leading-edge inference capabilities across multiple platforms — the products of demand from their customers. It’s a natural evolution as model builders, infrastructure vendors, and users have had a few years to talk, collaborate, and mature alongside each other.

So with that, here are five trends to keep an eye on in 2026, based on VAST’s work with some of the world’s leading AI users and developers.

If you want to learn more about the future of AI and network with a who’s who of practitioners and experts, attend our VAST Forward user conference — Feb. 24–26 in Salt Lake City! Register here.

1. The emergence of AI agents — for real and at scale

Although some large language models have exhibited agentic behavior for a while now, particularly as it relates to reasoning capabilities and user-defined skills, we’ve yet to see many AI agents operating at their full potential. In this case, that means looking beyond the LLM boundaries and executing complex tasks (1) in real time; (2) across a range of models and tools; and (3) autonomously, without human prompting. But that’s changing fast.

Probably the easiest places to find evidence are in ecommerce, where agentic protocols are facilitating frictionless purchases directly from chat sessions, and with the Linux Foundation’s Agentic AI Foundation, which includes popular developer-focused projects and tools such as MCP (from Anthropic) and AGENTS.md (from OpenAI). Where there’s developer interest paired with open source tools and money to be made, innovation tends to follow.

However, VAST’s customers are also beginning to take agentic deployments to another level in terms of scale — something made possible by our unique data architecture. For example, we’ve detailed a large city government that’s improving public safety by using AI agents for video analysis. Each new video file triggers a pipeline that chunks, vectorizes, and analyzes surveillance videos in real time, before ultimately suggesting a course of action or even acting autonomously to alert authorities or emergency services.

VAST is also working with partners like Leidos and NVIDIA on cybersecurity, analyzing and acting on network traffic in real time. Whereas human security professionals can’t possibly keep up with the pace of suspicious network activity, AI agents can. By triaging threats as they arise, these agents free up human experts to focus on resolving issues, assessing incidents, and building safer systems.

2. The fluid cost of AI operations

The cost of actually running AI applications is always a moving target, and there’s no sign that will change in 2026. Everybody involved — model builders, cloud providers / inference platforms, vendors, and end-users — continues adapting to new models, hardware, and usage patterns that alter the economics of their particular workloads. Even as per-token costs fall, for example, larger context windows and reasoning models might result in more tokens (and, thus, more compute usage) for each task.

One solution for stabilizing the cost of AI operations is to offload as much work as possible onto a platform like VAST. For example, although GPUs are necessary for running most AI inference tasks, they have finite memory that fills up fast when generating tokens. Offloading the KV cache mechanism to SSD storage helps free up GPU resources where they’re needed, while maintaining high performance at a much lower price point.

VAST enables cost stabilization across other parts of the AI stack, as well. Users can switch cloud GPU providers or use multiple of them — to balance price, performance, and workload requirements — without incurring data egress fees or suffering performance degradation. And they can utilize our integrated event broker, vector database, and data lake functionality to achieve massive scale and throughput without the complexity of standalone, shared-nothing data systems.

3. The evolution of model and hardware architectures

If there have been two ground truths in AI over the past several years, they’re that transformer models are king and that the highest-tier GPUs are always the best. Both are fading before our eyes. On the model front, in fact, Hugging Face CEO Clem Delangue recently suggested that although the industry is not presently experiencing an AI bubble, it might be experiencing an LLM bubble.

That’s not to say LLMs aren’t useful, just that they (and huge transformer-based foundation models, in particular) are not the right lens through which to view all problems. It’s difficult to argue with this assessment, especially when considering the still relatively untapped potential of video-language models and other multimodal models, as well as other approaches that might be better suited to challenging scientific workloads.

Model architecture and use case have everything to do with hardware architecture. Take NVIDIA’s recently announced Nemotron 3 models as an example: It’s a family of hybrid Mamba-transformer LLMs that combine the strengths of each architecture to maximize compute efficiency while still delivering world-class accuracy. And the Nemotron 3 Nano model can run a wide range of Nvidia GPU architectures beyond the high-end gear typically required for large-scale training.

More broadly, though, AI inference workloads simply don’t require the same performance (in terms of FLOPS) as do training runs. For this reason, you’ll often see hyperscalers and AI labs reserve the latest-generation GPUs for training foundation models, while recommissioning older generations to run inference jobs and even train smaller models. This is also why VAST is working with a wide range of hardware partners to ensure maximum compatibility and performance with their various platforms.

4. The crystallizing data stack for AI applications

Data architecture is an underappreciated aspect of running AI, but this should change as production inference workloads begin emerging en masse across a wider range of industries. The issue isn’t that we don’t have the right component pieces — databases, data lakes, and event-streaming aren’t going anywhere. Rather, it’s that many current incarnations of these systems were built for batch processing and web applications, not real-time AI applications.

In practice, this results in data architectures consisting of multiple independently managed systems (e.g., Kafka, Spark, a data warehouse, and multiple databases), each with its own pitfalls around operations (including cost), scalability, and performance. Production AI inference workloads will require a data platform intentionally designed for massive (and simple) scalability, high throughput, and low latency.

The difference is more than just adding model calls into the pipeline. Agentic workloads, for example, might need fast access to multiple sources of historical data, in addition to fresh data just vectorized and available for RAG. What’s more, they can produce an incredible volume of logs and other internal communications, all of which must be processed, stored, and analyzed in real time if you want to identify issues before they waste resources or cause other headaches.

VAST has been building out just this type of platform for a while with our DataEngine, SyncEngine, and InsightEngine features, and we expect it’ll drive serious wins for our customers over the next year. Generative AI represents a rare opportunity to capture new business opportunities; it shouldn’t be weighed down by data infrastructure from the Hadoop era.

5. “Boring” work takes center stage

Certain things need to happen when any application moves from project to production, and that doesn’t change with AI. Production AI workloads must also be secure and highly available, although achieving these things might prove trickier than some folks assume. Observability and overall data governance could be more important than ever.

Where AI differs from traditional web applications is in the types of things that agents and LLMs can do, and how they communicate with other parts of the system. In addition to concerns around latency and throughput, for example, a fleet of agents communicating with each other, with data stores, and with external tools or models will also create a mountain of events to log, store, and analyze (ideally in real time).

AI models and agents also might need differing access levels and permissions for different types of data. Enforcing these types of rules has proven difficult enough for human users in the face of data sprawl; protecting sensitive data will be markedly more difficult with agentic workflows.

The natural response to any and all of these concerns might be to add more shards, more tools, and more processes. Of course, doing so also means introducing more failure points and higher latency. Given the dynamic nature of AI workloads, consolidation looks like a much better choice. A single data platform designed for multitenancy and data security at AI scale means less infrastructure babysitting and more focus on implementing the right policies.

Learn more at VAST Forward

But because we can only go so deep in a blog post, we want to invite you to attend our inaugural user conference — VAST Forward — Feb. 24-26 in Salt Lake City. There, you can kick off the new year by hearing about all these topics and more, straight from our customers, engineers, and industry experts. It’s a great opportunity to network, learn about the future of AI infrastructure, and set up your AI strategy to succeed.