Building AI Factories: A Blueprint for Next-Gen GPU Clouds with VAST Data

AI factories require infrastructure, software, and people that work as a system that is completely different from the way legacy or even general purpose IT works. As cloud service providers (CSPs) and neoclouds race to stand up AI factories powered by NVIDIA HGX™ and NVIDIA GB200/GB300 NVL72 systems, they are finding that legacy systems simply cannot keep pace. An AI factory demands a high performance GPU and high-performance storage powered production line that never stops, where data is the raw material and intelligence is the output.

But here is the friction point for every CSP: traditional tiered storage architectures across an AI pipeline - with their separate silos for ingest, training, and archive - are collapsing under the weight of trillion-parameter models. Add to that many storage vendors still ship separate products for file (NFS) and object (S3).

For cloud customers, this results in significant fragmentation. Workloads using NFS versus S3 require different endpoints, tools, security controls, performance expectations, and lifecycle practices. Rather than a unified, governed data plane, organizations are left with non-standard architectures and siloed datasets, frequently forcing expensive data copies and ETL to shuttle data between high-performance training storage and data lake storage (usually S3 buckets) used for ingest, sharing, and analytics.

In collaboration with NVIDIA, VAST has released two new validated NVIDIA Cloud Partner (NCP) high-performance storage reference designs that solve these exact bottlenecks. Whether you are deploying 1,000’s of GPUs or scaling beyond 40,000, this architecture rewrites the rules of engagement for AI clouds.

Designing the Data Plane for GB200/GB300 NVL72 at Cloud Scale

AI infrastructure is rapidly shifting from clusters of GPUs to AI factories (often hosted by AI / GPU cloud providers). AI factories are specialized systems that turn data into intelligence across the entire lifecycle of ingest → train/fine-tune → deploy high-volume inference. In essence, NVIDIA views the product of an AI factory as intelligence, a result typically measured by its token throughput.

At the same time, from mainstream HGX GPU servers (H200, B200, and B300) to rack-scale systems like NVIDIA GB200 NVL72 and GB300 NVL72, the bar is rising for what cloud providers must deliver: higher GPU density and bandwidth demands, larger NVIDIA NVLink™ domains (up to 72 GPUs), liquid-cooled rack-scale design, and soaring FP4 performance, all of which amplify one key aspect: the data plane is integral to driving AI performance for the accelerated computing platform.

This blog breaks down what GPU cloud customers require in the AI factory era and maps those needs to the architectural advantages of the VAST AI OS aligned with the NVIDIA Cloud Partner high performance storage reference design for HGX H200, HGX B200, and HGX B300 deployments, as well as rack-scale GB200/GB300 NVL72 systems.

What GPU Cloud Customers Require and How VAST Delivers

When customers consume GPU cloud, they’re not purchasing raw FLOPS; they’re purchasing fast, predictable time-to-results at a cost that meets business requirements. In practice, that expectation translates into a clear, consistent set of platform requirements:

1. Flexible Building Blocks for Different GPU Cloud Scenarios

GPU cloud infrastructure deployments face new challenges for data center operators given the unique power, rack space, and failure domain (availability zone) requirements.

Providers need one data platform that can adapt to all of these, not a different product per topology. The VAST AI OS, built on the DASE (Disaggregated, Shared-Everything) architecture, solves this by decoupling software from hardware. DASE microservices can run on separate controller and flash media nodes or co-reside on the same server, giving two interchangeable deployment options with full feature parity:

C + D model (CNodes + DBoxes): Diskless x86-based CNodes act as the clustered I/O and management layer, with separate DBoxes for dense NVMe capacity (ideal when you want to scale performance and capacity independently).
EBox model (integrated C+D): The EBox packs the controller and NVMe drives into a single x86 appliance, delivering high performance density for high-performance storage designs (e.g., NVIDIA HGX and NVIDIA GB200/GB300 NVL72 ) while reducing rack space, power, and overall cost; available broadly from tier-1 OEMs.

Because DASE is a software architecture, cloud providers can choose C+D or EBox per site or use case while presenting a single, consistent VAST AI OS experience to every GPU cloud customer.

NCP Deployments with 16,128 GPUs 176 CNodes and 70 DBoxes

NCP Deployments with 16,128 GPUs: 151 EBoxes

2. Predictable Performance for Unpredictable Workloads

AI workloads are notoriously bursty. A robust high-performance storage design must prioritize massive, sustained read throughput for training while efficiently absorbing periodic checkpoint writes without allowing any I/O contention to starve GPUs of data.

The DASE architecture is engineered to fully exploit NVMe flash, utilizing NVMe-oF and RDMA to provide the essential high-speed data transfer capabilities that keep GPUs highly utilized. By facilitating NFS multipathing, RDMA access over NFS (v3/v4) and NVIDIA Magnum IO™ GPUDirect™, the platform is capable of delivering multiple terabyte-per-second throughput from a single mount point, ensuring that training, fine-tuning, and inference pipelines never stall waiting for data.

3. Fast Checkpointing + Resilient Recovery

Modern training pipelines rely on frequent checkpointing - triggered by steps/iterations, epoch boundaries, or time-based policies - using both synchronous and asynchronous checkpoint I/O. Customers need checkpoints that are fast, durable, and minimally disruptive, plus they need rapid restore so jobs can resume in minutes (not hours) after faults or maintenance events.

To support the intensive, I/O demands of modern training, VAST leverages NFS multipathing, RDMA, and NVIDIA Magnum IO GPUDirect to achieve terabyte-per-second throughput, ensuring frequent checkpoints never throttle GPU utilization. This performance is paired with intelligent data management that treats checkpoints as governed assets, allowing MLOps teams to track lineage and rapidly resume training flows from precise points in time.

4. Secure Multi-Tenancy without the Noisy Neighbor Penalty

To meet enterprise demands for strict isolation and predictable performance, the VAST AI OS implements a robust Zero Trust framework that eliminates the traditional trade-off between security and tenant density. The solution supports hard network isolation by assigning per-tenant VIP pools paired with VLAN/VRF segmentation, ensuring that data traffic remains completely separate across the fabric.

Furthermore, granular QoS policies and tenant-specific quota settings actively prevent I/O contention, ensuring that one workload’s bursty activity never degrades the performance of neighbors. This allows CSPs to maximize infrastructure utilization while delivering secure, isolated environments with unique encryption keys and dedicated directory hierarchies for every tenant.

5. Breaking the Protocol Barrier: One dataset, many consumers

Real-world AI pipelines are often fragmented by storage silos, forcing data engineers to copy data between object stores for ingest and file systems for training. The VAST AI OS eliminates this inefficiency by supporting multi-protocol access - NFS, SMB, S3, and NVIDIA Magnum IO GPUDirect Storage (GDS) - to a unified namespace. This allows data to be ingested via S3, pre-processed using high-performance file access, and analyzed by SQL engines like Spark or Trino, all without ever moving the data.

Underpinned by the VAST Catalog, every file and object is automatically indexed, ensuring consistent governance, lineage tracking, high speed metadata queries and auditability across all consumers and stages of the pipeline.

6. Deep Observability

In a multi-tenant AI cloud, slow performance is often a mystery. To eliminate the finger-pointing between network, compute, and storage teams, the VAST Management System (VMS) provides deep, continuous observability that acts as a performance flight recorder for cloud infrastructure.

Instead of aggregate metrics that mask the root cause, VMS provides a real-time User Activity view that pinpoints exactly which job, user, or client is generating load, identifying not just noisy tenants, but specific consumers saturating I/O. This allows platform teams to correlate storage behavior (latency, IOPS, throughput) directly with machine learning events like checkpoint bursts, evaluation sweeps, or inference surges. By making model artifacts and intermediate checkpoints first-class governed assets, operators can instantly identify if a stall is due to a bulk feature extraction or an unexpected data export, enabling rapid remediation and strict enforcement of fair-use policies.

7. Converged Network Fabric (SN5600/SN5610)

The VAST AI OS adheres to the NVIDIA Converged Network Design, utilizing NVIDIA Spectrum™ SN5600/SN5610 leaf switches to eliminate the need for separate in-band management and storage fabrics. In this architecture, both front-end (client access) and back-end (internal storage) traffic share the same high-bandwidth, RDMA-capable Ethernet leaf layer. This converged approach significantly reduces cabling complexity and switch sprawl while maintaining strict performance isolation through VLAN segmentation (e.g., internal communication versus client access) and Priority Flow Control (PFC) using DSCP to ensure lossless behavior.

The NVIDIA + VAST Advantage for CSPs

AI factories need specialized data infrastructure. Generic cloud storage can’t keep NVIDIA GB200/GB300 NVL72 racks fully utilized to sustain SLAs. The VAST Data + NVIDIA Reference Design gives CSPs three clear advantages:

Proven at Scale: Scale from ~1K to 40K+ GPUs with pre-validated sizing. Ratios of EBoxes (or C+D) to GPUs are tested and validated with NVIDIA, so CSPs can deploy VAST for high-performance storage with confidence.
Enterprise-Grade SLAs: Architected for six-nines availability and hard multi-tenant isolation, CSPs can confidently offer premium SLAs to their top enterprise and AI-native customers.
Operational Simplification: Replace fragmented tiers (data lakes, archives, fast scratch space with flash) with one unified data platform. No more brittle data-movement scripts, no more copy tax; just one governed dataset serving all AI pipelines.

Ready to build your AI Factory? For a deeper technical breakdown of the VLAN configurations, cabling diagrams, and specific sizing tables for NVIDIA GB200, GB300 and HGX servers, refer to the full VAST Data NCP High-Performance Storage Reference Design documents using our EBox design with OEM partners.

Building AI Factories: A Blueprint for Next-Gen GPU Clouds with VAST Data

Designing the Data Plane for GB200/GB300 NVL72 at Cloud Scale

What GPU Cloud Customers Require and How VAST Delivers

The NVIDIA + VAST Advantage for CSPs

More from this topic