Bringing Rack-Level Resilience to VAST - Part 2: Multi-Failure Resilience in AI-Scale Data Systems

Author

Howard Marks, Technologist Extraordinary and Plenipotentiary

As we saw in part one of this two-part series, DBox-HA answered the question: “What happens when a whole DBox blows up?” DBox-HA has proved popular enough to become the default data layout for larger VAST clusters.

But DBox-HA is designed to protect against a single DBox going offline at a time, and it provides rack-level resilience in deployments with one DBox per rack. VAST’s rack-level resilience expands on the level of protection provided by DBox-HA in two dimensions:

Supports multi-DBox/EBox failure domains, including configurations with multiple boxes per rack
Maintains system and data availability over two simultaneous DBox/EBox failures, not just one

As clusters grew and data center space didn’t, we set out to support higher density with multiple DBoxes per rack. Since EBoxes are just x86 servers with all the typical single points of failure, we needed to ensure the system could tolerate any two EBoxes going offline simultaneously, not just one. Rack-level resilience extends DBox-HA by supporting multiple box failures across two dimensions, both within a single failure domain and across multiple failure domains.

Defining Failure Domains

The first step in setting up rack-level resilience is to define how many failure domains are in the cluster and to which failure domain each DBox, EBox, or VAST on Cloud instance in the cluster belongs. If failure domains are not properly configured, clusters with DBox-HA enabled will act as one failure domain.

While failure domains and racks are frequently congruent, failure domains can also refer to rack pairs, cages, data halls, or cloud provider failure domains. That means any reference in these blog posts to rack, or rack-level, is also true of any non-rack failure domain definitions.

In clusters using failure domains, each erasure code stripe is built using the two SSDs within each failure domain that have the most available capacity and endurance, selected from across all DBoxes or EBoxes in that domain.

Upping SCM Protection

As we’ve seen, the resilience of locally decodable erasure codes is high enough to support multiple box failures if the data is properly distributed. We could have achieved rack-level resilience by similarly fine-tuning VAST Classic to mirror write buffer pages and metadata blocks to SCM SSDs in separate failure domains/racks instead of just different DBoxes.

But while failure domain anti-affinity addresses the requirement to support multiple DBoxes/EBoxes in a failure domain, mirroring isn’t good enough to allow any two EBoxes across failure domains to go offline and maintain availability. When any two DBoxes/EBoxes go offline there is always some risk that those two Dboxes would hold both copies of some write buffer pages or metadata blocks. That would, of course, cause the system to go offline as well. Because the I/O patterns to the write buffer and system metadata are very different, we solved the problem of N+2 protection for each in a distinct way.

Write Buffer Erasure Coding

Our first step toward N+2 protection for SCM was transitioning from mirroring to erasure coding data in the write buffer. Under VAST Classic, the CNode receiving a write request immediately writes the data from that request as the same size I/O into two write buffer pages. As the name implies, write buffer erasure coding replaces that mirroring process with erasure codes - specifically, very CPU-efficient double parity erasure codes.

With write buffer erasure coding, the CNode assembles data from multiple write requests potentially from different clients and targeting different elements - and assembles erasure code stripes in memory. These stripes are then across the SCM SSDs before any write is acknowledged to the client. During periods of low write bandwidth, the system times out after a few hundred microseconds to manage write latency. It then acknowledges the write and fills any empty space in the stripe with zeros.

Write Buffer Erasure Coding

Write buffer erasure coding not only increases the data protection level – allowing the cluster to continue operating even with any two SCM SSDs, DBoxes, or EBoxes offline - but also boosts write performance by as much as 100%. This performance boost comes from the higher efficiency of erasure codes. When a CNode mirrors a 1 MB write, it has to write 2 MB of data and consume 2 MB of bandwidth between the CNode and the SCM SSDs. When that CNode writes erasure code stripes with 10 data and 2 parity strips, that overhead drops from 100% for mirroring down to 20%, and the CNode only has to write 1.2 MB. Coalescing multiple writes, along with the greater parallelism of writing to more SCM SSDs, also contributes to additional performance gains.

How the cluster distributes erasure code strips depends on the cluster’s availability level:

With VAST Classic: Stripes of 6D, 8D or 10D+2P with no more than 2 strips per DBox
- Survives 1 DBox or 2 SCMs offline
With DBox-HA: Stripes of 6D, 8D or 10D+2P with no more than 2 strips per DBox
- Survives 1 DBox or sny 2 SCMs offline
With Metadata Triplication (RLR or EBox) Stripes of 6D, 8D or 10D+2P with no more than 1 strip per Failure Domain
- Survives 2 failure domains or 2 DBoxes or any 2 SSDs offline

Metadata Triplication

The erasure coding methods VAST clusters use for the write buffer strike a balance between the greater efficiency of wider stripes and the impact those stripes may have on latency for typical storage write I/O patterns. Under heavy write demand, CNodes can assemble full erasure code stripes and maximize bandwidth. When demand is lower, they pad stripes with zeros, trading unused bandwidth for reduced latency. Any wasted SCM space will be quickly and inexpensively recovered when the write buffer is migrated to QLC.

As efficient as those erasure codes are for newly written data, they don’t fit the I/O patterns for metadata updates. Every write or S3 PUT causes the creation of multiple new pointers, creating new 4 KB metadata blocks and new versions of existing blocks. All those small I/Os landing in the middle of erasure code stripes would trigger significant read-modify-write I/O amplification, the last thing you want for your metadata.

Metadata Triplication

VAST clusters protect the system’s metadata against two EBoxes going offline simultaneously by triplicating, or writing each metadata update to three SCM SSDs in different DBoxes, EBoxes, or failure domains. The CNode processing the write I/O from a client writes the new metadata block(s) to three SCM SSDs on different failure domains simultaneously. It sends an acknowledgment of the write to the client when the NVMe over Fabrics (NVMe-oF) acknowledgment comes back from all three.

Redistributing Erasure Codes for Failure Domains

The DBox-HA data layout writes two strips from each erasure code stripe to each DBox. With locally decodable codes, the system can rebuild from as many as four missing strips per stripe. This combination has led some people to assume that a DBox-HA cluster can continue if two DBoxes go offline simultaneously. There are problems with that assumption for both data and metadata. The metadata problems can be solved by enabling metadata triplication, as described above. The data problem is a bit more subtle.

In a brand spanking new DBox-HA cluster, every DBox holds two strips of each erasure code stripe. If two of those DBoxes were to fail simultaneously in that brand new cluster, the system should be able to reconstruct all of its data. The key word in that sentence is should. If any DBox in that cluster has already failed, the rebuild process may cause some DBoxes in the cluster to hold three strips of certain erasure code stripes. This is illustrated in the figure below, which should look familiar to readers of Part 1.

Bringing Rack-Level Resilience to VAST - Part 2: Multi-Failure Resilience in AI-Scale Data Systems

A DBox-HA Cluster after rebuild

Once a cluster has even a small number of erasure code stripes with three strips placed on the same DBox, it introduces a risk: if that DBox fails – along with any other Dbox holding two additional strips from the same stripe – the system loses fie stripes from the stripe, making it unreadable. That exceeds the recovery limits of the locally decodable codes. As a result, some data would be unavailable, taking the entire cluster offline.

To support our goal of keeping the cluster available through the failure of any two DBoxes or EBoxes, we had to stop leaving three strips of erasure code stripes on a box or in a failure domain after every failure.

Bringing Rack-Level Resilience to VAST - Part 2: Multi-Failure Resilience in AI-Scale Data Systems

As shown in the figure above, the solution is to write slightly narrower erasure code stripes that exclude one failure domain from each stripe. Any time an EBox or DBox in the cluster goes offline, the two erasure code strips that box holds are reconstructed to a box in the failure domain that was excluded from the reconstructed stripe, rather than creating the third strips in some failure domains.

Clusters with failure domains and metadata triplication can remain available through multiple simultaneous failures:

Two full failure domains
One full failure domain plus 1 EBox/DBox from any other failure domain
One full failure domain plus any SCM SSD from any other failure domain
One full failure domain plus any two QLC SSDs from any other failure domain
Any 2 EBoxes in the cluster
One EBox and any one SCM SSD
One EBox and any 2 QLC SSDs in the cluster
Any 2 SCM SSDs
Any 4 QLC SSDs

Thanks for reading about how VAST builds on DBox-HA with rack-level resilience and advanced protection strategies to keep clusters online even through multiple simultaneous failures. By combining write buffer erasure coding, metadata triplication, and failure-domain-aware striping, VAST ensures both performance and availability at scale. These innovations allow VAST clusters to meet the high-density, high-reliability demands of modern AI and data-driven workloads.

Bringing Rack-Level Resilience to VAST - Part 2: Multi-Failure Resilience in AI-Scale Data Systems

Defining Failure Domains

Upping SCM Protection

Write Buffer Erasure Coding

Metadata Triplication

Redistributing Erasure Codes for Failure Domains

More from this topic