Solutions

Jan 15, 2026

Unlock Video Intelligence: Building Real-Time Video Search and AI Summary

Unlock Video Intelligence: Build Real-Time Video Search and AI Summary

Authored by

Simon Golan, Senior Solutions Engineer and Ram Bansal, Developer Advocate

Reasoning with multimodal data at scale is unlocking the next wave of AI applications.

Video search powered by Vision Language Models (VLMs) enables systems to identify and act on relevant events in real-time, transforming legacy video infrastructure into intelligent, agentic applications for use cases like public incident detection.

The VAST AI OS powers real-time multimodal inference applications at unprecedented scale and performance. The platform is built for data-intensive applications such as video processing to power agentic workflows:

Let’s walk through the key steps to building a real-time video search and summarization application to create the foundation for agentic, multimodal workflows.

Why Developers Build on DataEngine

The VAST DataEngine runs on the Disaggregated Shared-Everything (DASE) architecture, leveraging the VAST Event Broker to ingest hundreds of millions of messages per second. The VAST AI OS is built from the ground up for AI-native applications to enable serverless functions powered by event streams and vector storage. VAST’s decoupling of compute and storage creates new opportunities for developers to run massive, real-time AI workloads without the bottlenecks of legacy infrastructure.

In this blog we’ll walk through how to use the VAST DataEngine to build a real-time video search and summarization application. The DataEngine orchestrates video pipelines composed of serverless functions that operate directly on video stored in VAST’s S3-compatible object store. End users can search and explore video content through a client-facing Angular application, backed by a Python-based API. For a deeper overview of the VAST DataEngine, see our previous post.

Prerequisites

Before we dive in, you’ll want to complete the following setup steps:

Source code:
- The entire source code is available on GitHub
- All functionality is available as images via public docker registry
Access to NVIDIA Models:
- NVIDIA Cosmos Reason VLM
- NVIDIA NIM API Endpoints for embedding (nvidia/nv-embedqa-e5-v5) and reasoning (meta/llama-3.1-8b-instruct) models
VAST Management System (VMS) setup by VMS admin:
- Ensure DataEngine is enabled for the specific tenant
- Create and share S3 and VastDB credentials for users building pipelines on DataEngine
VAST DataEngine user:
- Access to create and deploy DataEngine pipelines (pipelines = triggers + functions)
- Credentials, secrets, endpoints configuration details for VastDB, S3 buckets, and VLM are required for both pipelines and backend/frontend application
A remote Kubernetes cluster to deploy the client application frontend/backend
Network access to VastDB and S3-compatible object storage

The Architecture

The frontend/backend applications are deployed on a remote Kubernetes cluster that’s accessible from your local development environment. Serverless functions are deployed to the DataEngine to orchestrate the end-to-end pipeline, from video ingestion and retrieval to multimodal inference and embedding generation.

The diagram below provides an overview of the system architecture:

Serverless Functions

Let’s walk through the serverless functions that make up the video pipelines powering search and summarization. Here we are focused on the two primary pipeline flows (stage 1 and stage 2):


Flow 1:

Trigger 1: user uploads video to S3 → Function: break video into segments



Flow 2:

Trigger 2: video segment added to S3 → Function: generate video summary w/ VLM → Function: generate summary embedding  → Function: store embeddings in VastDB

Segmenting Videos

In the first flow, once a video lands in the S3 bucket, video-segments, it is split into reasonably sized segments. These segments are stored in a separate S3 bucket, video-chunks-segments, which triggers the subsequent flow responsible for generating and storing video summaries.

By default, videos are segmented into 5-second slices, but this can be adjusted via the (segment_duration) environment variable.

The code snippet below handles storing the video segments:

Next, we’ll examine the DataEngine function that splits the video into segments and stores the results in a separate S3 bucket.

Generating Video Summaries

In the first flow, we retrieve videos and store segments of the video in an S3 bucket, video-chunks-segments. This triggers the second flow, where we generate summaries of each video segment and store them as vectors in the VAST DataBase.

We use the VLM to generate a summary for each 5-second segment. The code snippet below demonstrates how to generate a video summary, which is then returned from the function and passed to the next step in the pipeline:

Data returned from one function can be consumed by other functions in the DataEngine, with each function’s output triggering the next in the pipeline. At this point, we’ve covered how to generate descriptive summaries for video segments.

Generating Summary Embeddings

Once a video summary is generated, we use an embedding model to convert it into vectors. The pipeline produces summary embeddings for all video segments, enabling fast search and retrieval through the client-facing application. The code snippet below demonstrates how to generate these vector embeddings:

The embedding is returned from the function and consumed by the next step in the pipeline. At this point, vector embeddings for all video summaries have been successfully stored.

Storing Embeddings

The final function in the pipeline stores all generated embeddings. VastDB enables AI workloads and applications with native support for storing and searching vectors. Using the VastDB SDK, we can store vectors efficiently with PyArrow:

Note that this is a reference architecture and that pipeline functions can be adapted to meet specific application requirements. For example, functions can be consolidated or expanded as needed.

With that, we’ve completed the second flow, generating and storing vector embeddings for all video segments.

Deploy the Client Application

In this section, we’ll review the frontend and backend components (Stage 3 Retrieve and Search). The full-stack code is available on GitHub.

Note that after deploying the client application, the DataEngine pipelines must also be deployed for users to upload and search videos.

To deploy the frontend and backend applications to the remote Kubernetes cluster, run the deployment script as follows:

After following the deployment instructions and updating your local /etc/hosts file, open the application in your browser (e.g., http://video-lab.v209.vastdata.com) to test authentication. Users are first presented with a screen to log in using their S3 credentials:

After authenticating, the user is able to access the video search and summarization UI:

In this flow, the user authenticates and gains access to the video search interface. They can upload videos to be processed by the DataEngine, enabling search once the video has passed through the processing pipeline.

When a user enters a search query and clicks “Search,” the backend performs a similarity search. The query is converted into an embedding, which is used to query the VastDB vector field and return matching video segments. The code snippet below demonstrates how the backend performs this similarity search:

The user can continue to further prompt a specific search result of videos by toggling Enable LLM Response, which calls the below functionality to further prompt and reason about the videos returned in the search results:

Build on VAST

Together, these steps demonstrate how to build a complete real-time video search and summarization pipeline using VAST DataEngine, serverless functions, VLMs, and VastDB to enable fast, scalable, and searchable video workflows.