Beyond Better Codes
In our last installment, we saw how locally decodable codes allow a VAST Universal Storage is a single storage system that is fast enough for primary storage, scalable enough for huge datasets and affordable enough to use for the full range of a customers data, thus eliminating the tyranny of tiers. System to use very wide stripes, typically 150D+4P, without having to read all 149 remaining data strips to rebuild from a device failure.
Some other vendor may have taken locally decodable codes and applied them to a traditional RAID system where users had to specify the number of data strips and protection strips for multiple RAIDsets, and the system reconstructed the whole contents of each failed SSD. But at VAST Data, we aren’t so easily satisfied.
No Nerd Knobs
Other storage systems make administrators choose between the efficiency of wide stripes, the lower write latency of replication, or narrow stripes. The large 3D Xpoint is a new non-volatile memory technology that has much lower latency and much higher endurance than NAND Flash. Optane is Intel’s trade-name for SSDs using 3D Xpoint as their storage media. VAST Universal Storage Systems use Optane SSDs to store system and element store metadata as well a... More write buffer in a VAST system decouples write latency from the backend data layout, since the system ACKs from the 3D XPoint, and accumulates data there, back end stripe width doesn’t effect write latency.
VAST systems determine stripe width automatically based on the number of SSDs in the cluster, the relative wear of those SSDs, and other more esoteric inputs freeing storage administrators from choosing the right striping and protection level for each workload. Our coding is also dynamic, so when one of 608 SSDs in a system that’s using 600D+8P encoding fails, we can rebuild onto 599D+8P stripes.
Intelligent, Data Only Rebuilds
Since the The VAST Element Store defines how VAST Universal Storage Systems store files and objects and the metadata that describes them. The Element Store is neither a traditional file system nor an object store, abstracting both to create an abstraction that serves both the hierarchical presentation of a fi... More (our hybrid file system and object store) manages files and objects — not just arbitrary blocks of data — we know which blocks on each SSD are used, which are empty, and even which are used but are holding deleted data. When an SSD fails, we only have to copy the data and can ignore the deleted data and empty space.
Rather than clustering SSDs to hold data strips, parity strips, or act as spares, the VAST Element Store distributes erasure-coded stripes across all the SSDs in the system. The system selects which SSDs to write each stripe of data to based on SSD wear, not the locations being written.
In a conventional RAID system, the rebuild after a device failure has to read from the remaining drives and write the reconstructed data to a replacement, or spare, device. This many-to-one rebuild can only proceed as fast as the spare drive can save data. It also creates an I/O hotspot, stressing the small number of drives involved in the rebuild and impacting application performance.
As illustrated above; when a drive in a distributed parity system fails, the system reads data from all the remaining drives and then writes the reconstructed data; not a dedicated spare drive, but to free space across the whole system (the blocks outlined in red). This many-to-many rebuild completes much faster, and since the load is spread across the whole system, there’s less impact on performance.
Eliminating the Controller Bottleneck
Declustering RAID from designated spare disks parallelized the rebuild process, and eliminated the bottleneck of having to write all the rebuilt data to one spare device. This is a good thing — but there’s an old saying among system architects: “You never eliminate a bottleneck, you just shift it somewhere else” and for most storage systems, decoupleing RAID results in the bottlenecks at the two controllers that are SAS connected to the devices being rebuilt.
VAST’s The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More architecture parallelizes the rebuild across not only all the SSDs in the system, but just as importantly across all the VAST Servers in the system. As you can see in the diagram below, each Stateless VAST Servers, which run in VAST Server Containers, provide all the storage protocol and management services in a VAST Cluster. Each VAST Server can directly access all the Optane and QLC SSDs in all the VAST Enclosures in the VAST Cluster. reads three surviving data strips (1/4th of 10 data strips, rounds up to 3) and the four parity strips from a stripe, rebuilds the data strip from the failed SSD and writes the rebuild data to a strip on an SSD that’s not already participating in the stripe being rebuilt.
With the VAST Servers working independently to rebuild each stripe, there’s none of the east-west internode chatter that can clog the networks of shared-nothing systems and a lot more horsepower available than a dual controller system could spare.
A Server sees there’s data to rebuild in the system metadata, claims a stripe, rebuilds it, and writes the result to flash, and in the metadata pointer to that strip. A large Universal Storage System will have dozens of VAST Servers sharing the load and shortening rebuild time.
Since a system with D+nP protection will only lose data when the n+1st device fails before the first device that failed has been rebuilt; rebuild time is an important factor in any data loss probability calculation.
1+1=3 for Erasure Codes Plus Architecture
As we saw in part 1, locally decodable erasure codes allow VAST Universal Storage Systems to provide a much higher level of data protection (typical Mean Time to Data Loss of 44 million years) and while only requiring a fraction, typically 1/4th of the data to recover from a device failure. VAST’s unique The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More architecture (DASE) further accelerates rebuilds by distributing the rebuild load across all the VAST Servers and all the SSDs In the cluster.
The result is that VAST Universal Storage Systems, like many cloud environments, manage failures in place. A device failure in a A VAST Cluster is the set of VAST Servers and VAST Enclosures that make up a single management domain and namespace. reduces available capacity a little, but with N+4 or higher levels of protection, a device failure is just a low-level event in your support ticket tracking system…not the crisis-inducing moment failures that frequent conventional storage systems.