A key design tenet of VAST’s DASE architecture is resilience. Much of the platform’s unique value comes from the efficiency and exceptional data availability – achieving 11 nines – provided by locally decodable erasure codes. As VAST clusters now operate across varied hardware environments, from EBoxes with single points of failure to cloud data centers, we’ve evolved our data protection strategies. The latest VAST Data Platform release introduces the ability to safeguard against the failure of a DBox, an EBox, or a full rack, maintaining both data availability and durability.
This blog series provides an in-depth look at the data protection mechanisms within the VAST Data Platform, covering resilience at multiple levels. If you’re primarily interested in the practical outcomes - without all the technical jargon - the summary at the end of Part 2 reviews the range of device and appliance failures a VAST cluster can survive without interrupting data access for your applications.
Locally-Decodable Codes
Before we get into the details of how VAST clusters implement rack-level resilience, let’s review how VAST clusters have traditionally placed data across their SSDs:
Locally decodable erasure codes protect data from as many as four simultaneous SSD failures with four parity strips per data stripe
Stripe width (8-146 data strips per stripe) is determined by the number of SSDs in the cluster
Four SSDs are excluded from each stripe to allow rebuild in place
The SSDs with the most endurance and capacity remaining are selected for each stripe
DBox placement is not a factor in SSD selection
The figure below shows a hypothetical small VAST cluster with 12 QLC SSDs. Each color represents an erasure code stripe of six data strips, shown as solid blocks, and four parity strips, shown as blocks with diagonal stripes.
This data layout, which I’ll refer to hereafter as “VAST Classic” for simplicity, is optimized for efficiency, reaching the maximum stripe width and, therefore, the minimum 2.7% overhead on clusters with only 150 SSDs. It does, however, rely on highly available DBoxes to ensure that SSDs remain connected to the NVMe fabric.
We started shipping highly available DBoxes in 2019, and while we no longer sell hardware, our software still powers them. So a data layout that relied on highly available DBoxes worked just fine - until we met the cloud providers.
Cloud Datacenters Are Messy
Most early VAST customers were happy with the availability DBoxes provided but you can’t please all of the people all of the time, and we did have one cloud operator who presented us with a problem.
This CSP’s internally developed applications were built at massive scale with a fail-in-place philosophy. Since their software treated each rack as a failure domain, and the applications could, therefore, continue even when a full rack failed, someone decided to save money in the data center buildout by only equipping each rack with a single PDU. This meant that racks regularly lost power.
VAST Classic provides a high level of protection against SSD failures. However, because it relies on the redundancy built into each DBox to ensure fault tolerance, it does not protect overall cluster availability if an entire rack - and thus its DBoxes - goes offline due to a power failure. Each DBox holds more than the four strips the erasure codes can reconstruct data around, so the system can’t read data with a DBox offline.
As an old-timer enterprise architect, this humble reporter assumed an unreliable infrastructure with one PDU per rack to be a one-time thing. Surprisingly, over the ensuing years, we’ve built clusters in more data centers with questionable power, one top-of-rack switch per rack, and other threats to rack-level availability more than often enough that we started building protections against such things into the VAST Data Platform.
Enter DBox-HA
Our cloud provider customer with questionable power availability wanted their cluster to be highly available even when the racks that housed in the highly available DBoxes weren’t themselves available. Our solution was a new data layout we call DBox-HA that adds DBox awareness to the data placement algorithm.
A cluster running DBox-HA limits each DBox to two strips of each erasure code stripe, which could each hold data or parity and are written to two different SSDs in the DBox. In the event of a DBox failure or circuit breaker trip, each erasure code stripe in the cluster will be missing two strips, well within the ability of locally decodeable codes to correct.
The figure below shows a VAST cluster running the DBox-HA data layout across eight DBoxes. Note how each DBox holds two strips of every stripe.
Since the number of strips per DBox is limited, erasure code stripe width in a DBox-HA cluster is determined by the number of DBoxes in the cluster - not the number of SSDs - and those stripes are narrower than the systemwould write across those same DBoxes in the VAST Classic layout. Our 8 DBox cluster above would have over 154 total SSDs and, therefore, would use erasure code stripes of 146D+4P with just 2.7% overhead.
With DBox-HA the erasure code stripes are 12D+4P for 25% overhead, which isn’t that bad. The overhead, of course, decreases as the number of DBoxes in the cluster grows, so a 20 DBox cluster has only 10% overhead, and clusters of 75 DBoxes or more reach the minimum 2.7% overhead.
DBox-HA is designed to keep a VAST cluster available through the failure of a single DBox and up to two more QLC SSD failures. In the event of a DBox failure, the system initiates a rebuild process. Each CNode in the cluster is assigned erasure code stripes to reconstruct the missing data. This continues until either the DBox is brought back online (e.g., by resetting a circuit breaker) or the cluster’s data is once again protected against up to four SSD failures.
During a rebuild, the cluster’s CNodes write the rebuilt data from the failed DBox’s SSDs to the surviving DBoxes in the cluster. The erasure code stripe width - for both newly written data and stripes formed during garbage collection - is equal to twice the number of active DBoxes at the time the stripe is created. This width automatically narrows as DBoxes go offline. As a result, some DBoxes may hold three strips of a given erasure code stripe, as shown in the figure below. This remains within the erasure code’s ability to tolerate up to four failures, allowing the system to continue operating even if those DBoxes go offline.
If a DBox returns during the rebuild process, the system halts the rebuild. At that point, each erasure code stripe will either have been re-written or is once again intact with the return of the prodigal DBox.Once the rebuild is completed, the system is fully restored to its design resilience of suffering one DBox and two QLC SSD failures while maintaining data availability.
DBox-HA solved our cloud provider customer’s rack-level resilience problem. They deployed in 11 DBox clusters with one DBox per rack and could ride through the inevitable circuit breaker trips just the way they wanted. So many other VAST customers appreciated the additional resilience and simplified maintenance DBox-HA provided that we made it the default data layout for large clusters. One customer even planned a move from one cage to another in their CoLo with DBox-HA, allowing them to take one DBox at a time offline to move.
In our next installment, we’ll explore how we expanded DBox-HA into full rack-level resilience, ensuring that VAST clusters can, as John Cameron Swayze used to say on the old Timex commercials, “take a licking and keep on ticking” - through not only single DBox failures but multiple failures across multiple failure domains.