The Scale Up vs. Scale Out Dilemma: Rethinking Architecture for Next-Gen AI Video Intelligence

Learn more about the partnership with TwelveLabs!

Authored by

Nicole Hemsoth Prickett, Head of Industry Relations

TwelveLabs is rooted in video, a category of data that has always been ahead of the systems built to work with it, where meaning is spread across visual, audio, and temporal signals that can only really resolve when processed together over time.

As one can imagine, this has always been a challenge and with the sheer increase in video volume alone (not to mention complexity) this has required next-level thinking about the systems needed to handle large-scale AI video as a service (which TwelveLabs already does) not to mention for those in ultra-secure, regulated, and air-gapped environments on prem.

Multimodal data (video, audio, temporal signals) has long been central to analysis but difficult at scale. But instead of treating that as a limitation, the company’s Head of Growth Maninder Saini tells us it became the starting point with the goal of designing models and systems around how this data actually behaves, instead of trying to force it into structures that were never built for it.

In these environments, interpretation is the hard part. Video, audio, and related signals only make sense when viewed together and over time. While many have tried to simplify the problem by breaking data into smaller pieces, tagging it, or sampling it, those approaches strip away the relationships that define what is actually happening.

As data grows in volume and complexity, the gap between what is captured and what can be understood widens.

Saini says the founders learned early on that this problem is seen across industries, anywhere there is high-volume or high-value video data like media, entertainment, sports and defense. Data is stored and moved, but not fully used because systems lack the context to interpret it.

Video brought this challenge into sharper focus at a much larger scale, and the TwelveLabs team took a different approach to solving it.

Video Breaks the System Assumptions

“Video is a very unique type of data format. Within it you have audio information, visual information, and also spatial, temporal information, things happening over space and time and that structure creates more complexity than systems designed for simpler data types can handle,” Saini explains.

The contrast becomes clear when compared to text and images. Those formats can be broken into units and indexed without losing most of their meaning but video can’t. As Saini puts it, “compare that to text data or image data. Those are lower dimension data formats, the information density isn’t as high. So when video is forced into the same patterns, systems can either strip out context or fail to capture how events unfold.”

That is why large video archives are often searchable at a surface level but difficult to use for more complex tasks. It’s also why TwelveLabs has been drawing so much attention. Their view is that models built for text or images can’t handle data that is continuous, multimodal, and time-based. Instead of adapting those models, they built their own.

“We own the whole video intelligence stack, and since we build our own models, we have our own orchestration.” The shift is simple: they no longer treat video as an extension of other data types, but as its own category.

“The limitation of general-purpose models isn’t that they can’t process video at all, but that they simplify it in ways that remove important context,” Saini says, adding that models tend to break video into frames or rely on derived signals, which loses the relationships that develop over time.

A video-native model changes how the data is handled, which shifts the constraint to whether the data layer itself can support that continuity at scale.

Instead of reducing the input, TwelveLabs processes visual, audio, and temporal together, meaning they can preserve continuity across events and capture how elements relate over time. But, as Saini explains, even then, that capability depends on whether the system can access and process the full video corpus consistently, which is not guaranteed in most environments.

The First Layer of Value: Making Video Queryable

With this approach, video becomes something you can query. Instead of relying on manual tags or limited metadata, the system can index large archives and return results based on intent. As Saini explains, “you could use our models to index all of that content, then do search and retrieval and metadata generation, for example.” This shifts video from something you store to something you can access, he adds.

The approach works because it fits how people already think about accessing information. You can search across archives, find specific moments, then generate structured outputs without needing to understand the model. It makes large datasets usable in a way that was not practical before. Retrieval works when the query maps to clear elements in the video but it starts to fall apart when the question depends on interpretation.

Finding a clip is one thing. Understanding what matters in that clip is another. That requires data to be accessible and consistent, which breaks down when archives span multiple systems and formats. That’s where a system like VAST is needed to maintain that continuity across data.

Context Across Time Breaks Retrieval

As Saini tells us, “if you are frame sampling one frame per second, it would be the equivalent of blinking every second. You would lose a lot of really important context of what is actually going on.” That’s the power of continuity and when that’s broken, the system loses how events connect. Search can return clips that match elements, but it can’t explain how those relate across a sequence, not because of Twelve Labs models but because of siloed data. Further, actions and outcomes depend on timing and interaction, not just what appears in a frame.

That is where the data layer starts to define the outcome. Once video is fragmented across systems and formats, maintaining continuity becomes a system problem, and something like VAST is required to unify that data so the model can operate against it coherently.

A model can only be as effective as the data it can consistently access, and that makes a unified data layer a requirement.

When Twelve Labs thinks about the future of on-prem or tightly controlled environments, “having a unified data layer is a huge unlock for us,” Saini says. “If you don’t have a unified data layer, the value we’re able to provide is fragmented across all those different places that the data may live.”

When data is consolidated and accessible through a single system, the model can maintain context across sources and time. Without that, each query is limited to whatever subset of data happens to be reachable.

And as TwelveLabs moves into environments where data can’t be centralized (regulated, air-gapped, ultra-secure, etc.), the infrastructure has to support running models where the data already exists.

AI Moves to the Data and Runs Continuously

TwelveLabs sees the future of AI video analysis clearly and knows that once deployment becomes part of the design, the direction is clear. We need to think less about models operating on curated datasets in controlled environments and more in terms of them running against large, distributed, and often sensitive data that can’t be moved easily or at all. This applies across public sector and regulated industries where data control is enforced at the infrastructure level.

Even where movement is possible, it is often impractical, Saini says. “Video carries a high cost in storage and transfer. There’s a material cost of ingress and egress and it’s not always trivial to take a multi-petabyte archive and put it on the cloud.”

Moving data becomes a bottleneck, which reinforces the need to bring models to where the data already exists, which becomes a bigger problem as volume grows (which it inevitably does).

“We now regularly see data footprints in the hundreds of thousands of hours, if not millions of hours,” Saini says and “in these conditions, the system isn’t processing a fixed dataset like it used to, it’s operating now over a constant flow of data, where understanding depends on maintaining access to everything that came before it.”

What this requires is a system where data access is part of how the model runs, not something separate and as TwelveLabs sees well, VAST provides that by keeping video, metadata, and derived data in one place, continuously accessible.

Instead of moving data between systems or working across fragmented stores, the model operates on a complete, consistent dataset. That’s what allows context to persist across time and queries, which is necessary for video to be understood as a sequence rather than a set of clips.

Video is not static either, it’s ingested, indexed, and queried continuously, often across large archives that keep growing. VAST handles that in a single system, so the model can move from indexing to retrieval to interpretation without switching environments. That removes the friction that normally shows up when systems scale, where access slows down or context gets lost.

TwelveLabs brings the model capability to understand video, but that only works if the system can provide full, consistent access to the data behind it. VAST is what makes that possible. It turns a set of disconnected steps into a continuous system where video can be processed, interpreted, and queried at scale without losing context.

“VAST affords us the ability to more gracefully and efficiently deploy our models and agents into these on-prem environments where it would otherwise be more challenging to get things working because we don’t provide the surrounding infrastructure,” Saini concludes.

The Scale Up vs. Scale Out Dilemma: Rethinking Architecture for Next-Gen AI Video Intelligence

Video Breaks the System Assumptions

The First Layer of Value: Making Video Queryable

Context Across Time Breaks Retrieval

AI Moves to the Data and Runs Continuously

More from this topic