Introducing: Rack-Scale Resilience

Authored by

Howard Marks, VAST Technologist Extraordinary and Plenipotentiary

This blog post was written in 2020 and reflects product capabilities at that time. Some information may be outdated.

Since this blog post was written, we’ve adjusted enclosure resilience (renamed as "DBox-HA") to write a maximum of two strips from each erasure code stripe to any enclosure (DBox) instead of the three-strip limit in VAST version 3. That means the eleven DBox cluster described in this post will write erasure code stripes of 18D+4P; not the 29D+4P described in this post.

This shift to a two strip per DBox limit allowed us to further increase cluster availability to survive two DBox or EBox failures while remaining online and to implement failure domains, which enables racks with multiple EBoxes or DBoxes to fail without taking the cluster down. For a more recent and deeper discussion of DBox-HA and failure domains, see: Bringing Rack-Level Resilience to VAST - Part 2: Multi-Failure Resilience in AI-Scale Data Systems.

The VAST Data Platform has always delivered significantly higher levels of resiliency than conventional storage systems. Where most vendors consider RAID6-like N+2 erasure coding good enough, we recognize that multi-petabyte storage systems have more components to fail and therefore require a higher level of protection. VAST’s locally-decodable, fail-in-place erasure-coding provides N+4 data protection across drives in highly-available NVMe flash enclosures that have no single point of failure. Every element in the data path, from Ethernet ports to fabric modules, and SSDs is in some way redundant. No single device failure can take the enclosure offline. The combination of these safeguards provides a system Mean Time to Data Loss (MTTDL) calculated in the millions of years.

Although VAST Enclosures do not have any single point of failure, these Enclosures can still be taken offline by external failures, like tripped circuit breakers, or leaky sprinklers, that take a whole rack or data center row offline. While your data center may always have sufficient redundant power and your staff never accidentally disconnects one more power cord than they should during maintenance, not everyone runs their data center as well as you. All the work VAST does to keep running through component failures would be for naught if a failure upstream of the VAST system took an enclosure offline. Recently, we were asked to address the question of “what happens when a whole rack fails?”

Until VAST Version 3, the answer to that question was “the system goes temporarily offline while the rack is down”. This behavior was the result of one of those classic tradeoffs we’re proud of breaking – the tradeoff between resilience and overhead. VAST systems use very wide erasure-code stripes, ranging from 36+4 to 146+4, in order to minimize data protection overhead. Wide striping is great for efficiency and cost, but it doesn’t inherently provide protection against a rack or whole HA Enclosure failure.

The downside to very wide erasure-code stripes is that while they provide protection against up to four concurrent device failures in a single write stripe, the system writes many more than four data strips to each enclosure. In a five-enclosure cluster each enclosure holds an average of 30 data strips in each locally decodable stripe.

With VAST 3 we’re breaking our own tradeoff and allowing customers to choose the additional resiliency of operations and rebuilds after an enclosure failure, with slightly higher overhead. An enclosure resilient VAST system distributes data and parity strips across all the VAST enclosures in the cluster ensuring that no enclosure holds more strips than the cluster can afford to lose.

Enclosure resilience is more than just a snappy comeback to the “what happens when a whole rack fails?” question. This increased standard of availability also protects customers from compounding hardware failures, human error, and Murphy’s Law in general. When a site reliability engineer’s job is to ensure 99.999% uptime to a many-petabyte namespace, protecting against the loss of a full rack or full HA Enclosure helps them at night.

How Enclosure Resilience Works

Large storage systems can now distribute the constituent strips of each locally-decodable erasure code stripe across the SSDs in the cluster so that no enclosure holds more than 3 strips from any erasure code stripe. Since all VAST systems write four parity strips to each stripe, the system can recover from the loss of an enclosure with sufficient remaining data protection to ensure your data is safe, even if there’s an additional problem during the rebuild..

In an eleven-enclosure cluster, the system would write in 29+4 erasure encoded stripes where each enclosure would hold three of the 33 combined data and parity strips. VAST systems with Enclosure Resiliency have a bit more than the standard 3% erasure-coding overhead of standard VAST clusters, adding only about 9% overhead to deliver rack-scale resilience.

This level of efficiency is much more cost-effective than replication systems that typically have 200% overhead, or even the 35% overhead of erasure-coded object stores. As with our standard erasure coding, stripes grow wider on larger clusters adding three additional strips per stripe for each additional enclosure making larger systems even more efficient.

After an enclosure failure, the system rebuilds the erasure code stripes across the remaining enclosures to restore n+4 protection.

VAST started out by breaking the traditional tradeoff between resiliency and efficiency with very-wide erasure code stripes, empowered by locally decodable codes. With Enclosure Resiliency we break that same tradeoff once again, at a higher level of resiliency. Where previous solutions had 33-200% overhead VAST adds Enclosure Resiliency with as little as 12% overhead.

Introducing: Rack-Scale Resilience

How Enclosure Resilience Works

More from this topic