Jul 9, 2024

When Worlds Collide: The VAST Data Platform Is Now Certified for Cloud Partners in the NVIDIA Partner Network

Jeff Denworth, VAST Data Co-Founder

Today, we’re excited to announce that the VAST Data Platform has achieved certification for NVIDIA Partner Network cloud partners. This designation validates solutions that can operate at massive scale – with up to 16,000 NVIDIA Tensor Core GPUs.

The NVIDIA cloud partner reference architecture is designed to meet NVIDIA’s exacting performance standards across all AI cloud providers who adopt it. The result is a consistent and high-quality user experience across all NVIDIA cloud partner platforms. The reference architecture also aligns with VAST’s vision of zero-trust, confidential computing for scalable AI cloud tenants.

In 2024, it’s very evident that we’re now settling into a new frontier of bespoke GPU clouds. As business-critical AI use cases expand in these AI factories, we see a number of strategic factors aligning between our vision and that of our customers.

AI Research and AI Production worlds are colliding. As AI models are growing in size and complexity, the combination of training, inference and fine-tuning is creating a need for more robust infrastructure. Customers need closed-loop systems that can interactively and iteratively address the entirety of the data pipeline, and these solutions must have an answer to enterprise data management that is not often found in single-tenant HPC research centers. The data protection and governance mandate for AI inference is far more stringent than is often found in training environments, and uptime is paramount when decision systems and robotics (virtual, physical) depend on real-time infrastructure.

This is where we come in. VAST systems, unlike parallel file systems, were not designed for single-tenant HPC centers. Instead, we built a system that delivers supercomputer-level scale with the operational robustness of a true enterprise appliance. Parallel file system customers have previously believed they needed to trade off uptime for scale and performance, but such days are over with our new DASE architecture.

SLAs matter. VAST has been selected at some of the world’s largest GPU clouds; organizations such as CoreWeave, Lambda, Taiga Cloud, and more are building AI factories with VAST. Across these environments powering hundreds of thousands of GPUs, we see over 99.9999% uptime. This is not only unheard of with most HPC technology – it’s essential to cloud vendors who are building training-as-a-service platforms and is critical for inference-as-a-service.

How is it that VAST Data Platform is more robust?

1. Architecture: Architecturally, we’ve removed many levels of complexity from systems that are common with HPC systems.

a. Without the need for cache — and therefore without the need for cache coherency — VAST systems don't experience data loss during power outages or require batteries to hold data in DRAM, as conventional storage arrays have. With legacy systems, power outages can contribute to data loss.

b. Our containerized, stateless controller architecture allows survival of far more failures than systems where machines own a portion of the namespace. With partitioned parallel file systems, if you lose a controller or a controller pair, you’ve lost access to all of your file system. With VAST, you can still get full access to all of your data in real time without the need for rehydration or any other data recovery gymnastics, so long as you have one surviving system in our cluster.

c. VAST requires no parallel file system client. VAST’s preference for standard client protocols (like NFS) makes it such that you never need to install our software in your host. We’ve resolved bottlenecks at the server level so customers can use kernel-native data access protocols. This advances operational uptime, since you a) never need to worry about your host operating system when upgrading your file system, b) never need worry about your file system when you need to do OS upgrades and c) never experience server-side failures when a client hiccups.

Having kernel-level file system client support also makes it easy to support new systems. We’re already working with Arm-based NVIDIA Grace CPU systems at several deployments, and there was no development or QA required to begin working on these new platforms. Linux is Linux. NFS is NFS.

2. Product Design: If you look under the covers of many popular parallel file systems, it’s a complicated combination of independently developed software packages. VAST is a monolithic stack designed 100% in house. No external failover managers. No external file system gateways. No external disk file systems. No random RAID controllers. It’s 100% VAST. 100PB clusters are upgraded by the push of a button, without the need for individual controller management nor any of the legacy HPC systems management concepts.

The Texas Advanced Computing Center (TACC) is a top 10 global HPC site. Its considerable uptime requirements led the site’s team to VAST, which resulted in them selecting us for their Stampede3 system and our collaborating with them on exascale computing. The overarching observation about our system: “pleasantly surprised.”

3. Our massive product development effort is only overshadowed by our QA investments: Our partners have over $100M of equipment running in labs all to make sure we can recreate any issue at any scale. Half of our R&D team is dedicated to QA. We care deeply about quality and have been able to strike a good balance of quality all while driving innovation. To put this in perspective, we’re taking the parallel computing market by storm: operating at 6-nines of availability while delivering over 100 features per year – out-innovating conventional technologies that are steeped in legacy concepts and legacy architecture.

AI and Big Data are also colliding. The NVIDIA cloud partner reference architecture focuses on high-performance storage for training and fine-tuning, and we find that AI pipelines need a broad set of tools to address everything from data prep to model deployment. To build an end-to-end AI pipeline, you also need structured data management tools for data preparation, anonymization, tagging, and inference response logging for both regulatory compliance and model fine-tuning. Data warehouses are increasingly critical to the process of feature engineering and data analysis.


This is why VAST has built a data platform that combines support for unstructured data and structured data into a common system that is designed to accelerate every aspect of the AI data pipeline. VAST’s systems integrate a distributed file system with a massively scalable, NVMe-optimized transactional data warehouse and SQL query engine. When combined, these technologies give customers a platform for:

  • data ingest
  • data preparation and data anonymization
  • AI model training
  • AI model inference and logging.

Our computational capability is why we call the system a Data Platform and not just storage. While other storage systems may adopt the badge of a data platform, VAST Data enables querying on structured and unstructured data using SQL tools.

AI and Enterprise Data Management are colliding. Globally, regulatory pressures are bearing down on model builders and development teams to support the compliance requirements that enforce everything from data privacy to model tampering avoidance to model reproducibility. Product developers have often prioritized scale and speed over data management requirements, and their architectures can also constrain their ability to easily support the needs of enterprise features. To illustrate this, after decades of development, several of these systems still don’t have quality, non-disruptive snapshot managers, and snapshots are a relatively easy problem to solve.

VAST is not only supporting large cloud providers, we’re also engaged with the world’s largest enterprise IT organizations to build the tools needed to apply proper data management against petabytes to exabytes of data. The data lifecycle tools we use to support the data foundations for Tier-1 banks, healthcare systems and intelligence agencies are becoming extremely relevant to model builders and inference services. These tools include:

  • WORM-lock (write-once, read-many) tools enable avoiding model tampering by making every file/object/table immutable. These tools have been developed to the standards of the US Securities and Exchange commission.

  • Ransomware-proof snapshots can prevent your critical datasets from being compromised. Our new anomaly detection features allow users to proactively determine if their data (or systems) have become compromised by using internal machine learning tools to observe significant changes in user behavior and data composition.

  • Audit trails can index every admin and user action and instantly query these actions in our database, as the VAST Data Platform comes configured with its own security information and events management (SIEM) tools.

AI and cloud worlds are colliding. HPC predates cloud by decades, and the systems to support HPC weren’t originally envisioned for scalable, zero-trust service delivery. VAST was born in the age of cloud, and we’ve always understood that we have to build the infrastructure for the next generation of cloud. This compelled us to develop:

  • Multi tenancy that supports tens of thousands of isolated tenants

  • Support for tenant-based enterprise key management

  • Support for tenant-based quality of service, chargeback, billback

  • Support for network-based data isolation (underlay, overlay networks and VXLAN)

  • Support for host-based I/O termination and data isolation using DPUs

The path to high GPU utilization is secure and flexible multi-tenancy. VAST has designed an offering that makes sure you never have to thick-provision GPUs or storage or databases to isolate services for tenants.

Clouds also need object storage. While other companies force you to create buckets in isolated containers or to deploy wholly independent object storage systems, we believe multiple solutions just compound data management complexity. VAST has been building a single unified data management system that allows data to be accessed from both file and object and table protocols simultaneously, with an enterprise multi-protocol authentication and access control system. VAST allows customers to process locally at high speed using files, and to share and ingest data globally using S3 protocols.

VAST has proven itself at scale.

Our first customers were all companies with very poor experiences with parallel file systems and that wanted something scalable, affordable and simple. We now call hundreds of organizations in the supercomputing space our customers.

  • In 2023, VAST became the first NFS storage offering certified for NVIDIA DGX SuperPOD.

  • Parallel file systems were the only choice for 10,000+ GPU systems, then VAST started deploying in systems that range from 20,000 GPUs to over 100,000 GPUs.

  • And today, we are proud to be certified for the NVIDIA cloud partner reference architecture.

There’s a trend here…

Today’s CSPs face the tough realities of enterprise data management and uptime requirements when all of these worlds collide. This is why many of them advocated for us to be included in the NVIDIA Partner Network’s cloud partner reference architecture program. And now, we have exceeded the burden of proof as we build one of the most successful data infrastructure companies in history.

We look forward to showing you all we can do in the spirit of evolving the state-of-the-art AI clouds and further evolving the VAST Data Platform to meet the new needs of the world’s most ambitious and intelligent cloud end users. We design with our customers and partners, and we love to be challenged to create a better future.

If you’ve gotten this far, thanks for taking the time to read our NCP story 😊. Happy to dive very deep into any of the above topics… as you wish.


