Why the World's Leading Autonomous Vehicle Companies Run on VAST

Authored by

Aaron Chaisson, VP Product and Solutions Marketing

The automotive industry is becoming a software industry. As advanced driver assistance systems (ADAS) evolve from basic lane-keeping and adaptive cruise control toward full autonomous decision-making, automakers and their technology partners are transforming into software-defined enterprises. The AI models powering this shift are getting dramatically more complex, and the data required to train, validate, and continuously improve them is growing at a pace that legacy infrastructure simply cannot sustain.

The companies winning this race aren't just building better algorithms. They're rethinking the data platform that supports their network of connected, autonomous vehicles, and the ecosystem supporting it.

The Scale Problem Is Compounding

A single sensor-equipped test vehicle generates terabytes of data per day across cameras, LiDAR, radar, and ultrasonics. Multiply that across a global fleet of data collection vehicles, then factor in synthetic data generation and simulation workloads, and you're looking at hundreds of petabytes under management.

But volume alone isn't what makes ADAS data infrastructure so challenging. It's the diversity of access patterns across the pipeline. Ingesting raw sensor data looks nothing like serving randomized video clips to a GPU cluster for training. Running validation workloads against safety-critical edge cases requires different performance characteristics than syncing refined model outputs back to vehicles through over-the-air updates. And increasingly, the platform also needs to handle real-time telematics from connected vehicles already on the road, streaming telemetry data that feeds both analytics dashboards and the next generation of model training.

Every stage of this pipeline has different throughput, latency, and governance requirements. Traditional architectures force teams to build and maintain separate systems for each one, creating deep data fragmentation across global development centers. When data is scattered across incompatible silos, the sheer scale of the datasets makes consolidation impractical. Data becomes too massive and too dispersed to move efficiently between sites or from on-prem to cloud. This fragmentation prevents teams from establishing a single source of truth and slows collaboration between developers and data scientists worldwide.

The result is a fragile patchwork of HPC file systems, self-managed object stores, and cloud staging environments held together by custom ETL pipelines. Teams end up spending more engineering cycles managing data movement than improving models.

Legacy Infrastructure Breaks Down at ADAS Scale

The pattern is consistent across the industry. Companies that started with traditional HPC file systems like GPFS found them powerful for early workloads but operationally complex and brittle as data volumes grew into the hundreds of petabytes. These shared-nothing parallel file systems were designed for the sequential I/O patterns of traditional HPC simulation, not the random I/O demands of deep learning. The result is I/O starvation, where expensive GPU clusters sit idle because the storage layer simply can't supply data fast enough to keep them saturated. Bolting on separate S3-compatible object storage introduced protocol silos. Scaling metadata performance to handle billions of small files (think individual labeled video frames) required over-provisioning capacity just to avoid bottlenecks.

One global autonomous vehicle software company, now running hundreds of NVIDIA H100-class GPUs and over 100,000 CPU cores across on-premises and cloud environments, hit exactly this wall. Their legacy GPFS and self-managed S3 infrastructure couldn't scale without multiplying operational complexity alongside it. Every expansion meant more systems to manage, more data copies to synchronize, and more failure modes to troubleshoot. GPU utilization suffered because the storage architecture couldn't keep pace with the compute investments the company was making.

A second company, a safety-focused ADAS software developer, faced a different but related challenge. Their team was making the leap from training perception models on still images to training on motion video, a shift that dramatically increased throughput demands and data volumes simultaneously. With hundreds of GPUs shared across a research team of hundreds of data scientists, their existing infrastructure couldn't independently scale performance to keep pace. Metadata constraints were causing outages. And legacy compression was delivering just 1.1:1 reduction on their JPEG and PNG training datasets, meaning raw storage costs scaled nearly linearly with data growth.

What ADAS Pipelines Actually Require

When you strip away vendor positioning and look at what autonomous driving teams actually need from their data platform, the requirements converge:

Sustained, predictable throughput measured in tens of gigabytes per second to keep GPU clusters fully utilized during training, eliminating the I/O starvation that plagues legacy architectures. Native streaming data ingest that can handle terabytes per day of sensor and telematics data in real time, with Kafka-compatible interfaces so the same platform that stores training data can also serve as the event backbone for live vehicle telemetry. Multi-protocol access spanning NFS and high-performance S3 so the same data can serve HPC workloads and cloud-native pipelines without copies or migration. The ability to scale performance, capacity, and metadata independently so teams aren't forced to over-provision one dimension just to get more of another. Data reduction that works on already-compressed formats like JPEG, PNG, and encoded video. A global namespace that extends across on-prem clusters and into public cloud environments for burst compute, solving data fragmentation by bringing compute to the data rather than the other way around. And cross-site data sharing that doesn't require replicating entire petabyte-scale datasets between locations.

No single legacy system checks all of these boxes. That's why leading ADAS companies are consolidating onto a unified data platform built for AI workloads from the ground up.

How Leading AV Companies Are Solving It

The global AV software company consolidated their GPFS and self-managed S3 infrastructure onto the VAST Data AI Operating System and NVIDIA DGX SuperPOD for ADAS and LLM development. The results were immediate and measurable. A unified global namespace spanning NFS and high-performance S3 eliminated the protocol silos that had been driving data duplication. VAST's disaggregated, shared-everything (DASE) architecture, where every CPU and GPU can access all data across all SSDs simultaneously, replaced the east-west traffic bottlenecks of their legacy shared-nothing systems with truly parallel data access. Data reduction delivered up to 12:1 efficiency on code repositories. And the platform's cloud bursting capabilities allowed the team to dynamically push compute workloads into public cloud environments and sync results back to on-prem, all through a single namespace.

Today, this company has more than 10 petabytes of effective capacity on VAST and is on track to double that within two years. Their real-time mapping systems, which detect and share road hazards between vehicles at centimeter-level precision, now run on infrastructure that can actually keep pace with the data volumes involved.

The safety-focused ADAS software developer saw equally significant gains. VAST sustains 50GB/s of throughput for their training workloads, keeping their GPU fleet utilized rather than starved for data. The platform's advanced similarity-based data reduction algorithms achieve 2.5:1 reduction on JPEG and PNG datasets, more than doubling the effective capacity compared to legacy systems limited to 1.1:1. The DASE architecture allowed them to add compute and storage nodes independently, and as needed, without disruption. This proved critical and allowed them to add compute and storage nodes independently of capacity licenses, which proved critical for scaling metadata performance and resolving the outage-causing constraints they experienced on previous infrastructure.

With multiple VAST clusters spanning separate data centers, the team now uses global snapshots through VAST DataSpace to share complete datasets across locations without replication, directly addressing the data fragmentation that had previously limited cross-site collaboration. Researchers can access what they need from any site, scramble data for model variation, and collaborate across geographies without data bloat. For a team of hundreds of data scientists competing for GPU time in a shared scheduler, this kind of frictionless data access is a strategic advantage.

From Training to Real-Time Inference

The ADAS conversation often centers on training, but training is only half the story. The models that power scene understanding, prediction, and path planning need to run in production, processing continuous streams of sensor data and making decisions in real time. The infrastructure that trains the model needs to also serve the inference pipeline, and the data flowing between vehicles, edge systems, and the data center needs to be ingested, enriched, and acted upon as a continuous stream, not as a batch job that runs overnight.

This is where VAST's architecture extends beyond what legacy storage platforms can offer. The VAST DataBase functions as a real-time event broker with Kafka-compatible and Apache Arrow-native interfaces, capable of ingesting and querying streaming telematics and sensor data at scale. The same platform that stores petabytes of training data can simultaneously ingest live vehicle telemetry, run real-time analytics, and feed inference workloads without requiring a separate streaming infrastructure stack.

For ADAS teams, this means the collect-to-train-to-validate loop doesn't end at model export. The trained perception model that identifies a construction zone, the planning model that reroutes around it, and the prediction model that anticipates the behavior of surrounding vehicles all depend on the same underlying data platform to deliver context at speed. OTA update validation, digital twin simulation, and real-time dashboards on fleet behavior are not future use cases that require a separate architecture. They run on the same VAST AI OS that powers training today.

This end-to-end capability is what separates a data platform, or an operating system, from a storage system. ADAS development doesn't stop at training a model that works in simulation. It requires infrastructure that can carry that model through validation, deployment, and continuous improvement in the real world, all on a single platform.

The Bigger Picture

ADAS is the leading edge of a much larger shift. Beyond automotive, robotics companies, industrial automation platforms, and drone fleet operators are converging on the same fundamental challenge: building AI systems that perceive, reason about, and act in the physical world generates data at a scale and complexity that legacy infrastructure was never designed to handle. The same architectural requirements that ADAS demands today, massive streaming ingest, real-time inference, global data access, and continuous model improvement, are becoming table stakes for any organization building physical AI.

The companies that treat data infrastructure as a strategic investment rather than a cost center are the ones shipping safer, smarter systems faster. The autonomous driving industry is proving that the AI operating system underneath the models matters just as much as the models themselves.