Ever since the introduction of RAID, and the storage array, in the late ‘80s storage architects have had to trade storage capacity for the resiliency they need to live up to the storage administrator’s first commandment “Thou shalt not lose thy user’s data.” Our Universal Storage is a single storage system that is fast enough for primary storage, scalable enough for huge datasets and affordable enough to use for the full range of a customers data, thus eliminating the tyranny of tiers. systems break those trade-offs with new math, locally decodable erasure codes that can protect against multiple device failures without the overhead and rebuild complexity conventional solutions impose.
Reed-Solomon protection at a price
The vast majority (no pun intended) of storage systems that use erasure coding, including most RAID 6 implementations, today use a class of error correction codes developed in 1960 by mathematicians Irving S. Reed and Gustave Solomon. A system using Reed-Solomon encoding can calculate two, three or even more ECC strips to protect a set of data strips.
Since parity is the simplest form of erasure code the generalized notation for describing erasure coded data stripes is nD+xP for n data strips and x ECC or parity strips. A RAID6 or similar, system using 7 data strips and 2 ECC strips would, therefore, be 7D+2P. Reed-Solomon codes aren’t limited to one or two ECC strips theoretically supporting an arbitrary number of ECC strips. Some scale-out storage systems use Reed-Solomon codes with three or more protection strips.
Wide stripes create problems
The problem with Reed-Solomon codes, beyond the complex Galois field math they use, is that reconstructing an unreadable data strip requires the data from all the surviving data strips, and at least one of the ECC strips to rebuild the data. In a 7D+2P system like our example above one 4KB read request from a user-application would be amplified to a dozen I/Os and rebuilding from a failed drive will keep the remaining 8 drives in the set busy for a long while.
As a result, storage architects have to trade-off the greater efficiency and read performance, of wide data protection stripes against the I/O amplification wide stripes, can create. Should a conventional storage array receive a request to perform a 4KB write in the middle of a 16D+4P erasure code stripe (the widest I know of in common use), the system would have to read all 16 strips of data, recalculate the four protection strips and then write all 20 strips back.
While log structures ameliorate this read-modify-write problem to some extent rebuilding that 16D+4P system after a device failure will require reading the entire contents of at least 16 of the surviving devices. As a result, most storage architects use narrower stripes like with lower resiliency and 33% overhead vs. the 25% overhead of a 16+4 stripe, sacrificing both resilience and efficiency to the downsides of Reed-Solomon coding.
Locally Decodable Codes Break the Trade-Offs
VAST Data has invented a new class of error correction codes that borrow from principals that mathematicians define as locally-decodable codes to break the long-standing tradeoffs between writes stripe resilience and the cost of data protection. The locally-decodable aspect of our concept means that the A VAST Cluster is the set of VAST Servers and VAST Enclosures that make up a single management domain and namespace. can reconstruct unreadable data from only a “local” fraction of the total number of data strips in a protection stripe. For our codes that fraction is 1/Xth of the data strips where X is the number of protection strips in the set.
Higher Resiliency, Lower Rebuild Cost
Using locally decodable codes allows us to use wide stripes, with very high levels of resiliency. The codes scale dynamically with the cluster as the system scales, where both N (data) and (K) resilience can scale independently. Using Markov calculations, we’ve determined that an example 150+4 encoded stripe has an MTTDL (Mean Time To Data Loss) of over 42 million years with just 2.66% overhead. Data is protected against four or more failures – which is relatively very high as compared to how traditional systems protect a failure domain – while the system can reconstruct the data by reading from just 1/4th of the remaining stripes in an encoded write stripe.
While 150+4 is a somewhat typical example of the scale of our codes on an average petabyte-sized system, these codes can scale to provide greater levels of protection on smaller systems and can even scale wider on larger systems to feature higher levels of data protection – such as 500+8 – resulting in a mean time to data loss of over 1 quadrillion years.
How Locally Decodable Codes Work
Reed-Solomon codes generate P protection strips from a single set of data strips. Since each Reed-Solomon protection strip is generated from all of the protected data strips the calculation for regenerating the data from a lost strip requires all the surviving data strips and one or more protection strips.
Our locally decodable codes, on the other hand, calculate each protection strip from slightly different sets of data strips across more than just one stripe of data. This allows protection strips to “stand in” for groups of data strips in the rebuild calculations. A VAST system using P protection strips can reconstruct data from 1/Pth the data strips. For our typical nD+4P encoding we need to read just 1/4th the data during a reconstruction where a Reed-Solomon based system would need to read it all.
For our typical 150D+4P protection level that means the system has to read 1/4th of 150 data strips or 38 strips. The four protection strips contain enough data about the remaining 112 data strips that the contents of the missing strip can be calculated without them. This allows us to break the trade-off between storage efficiency and rebuild complexity that systems using Reed-Solomon codes are forced into.
Empowered by The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More
The math of locally decodable codes is only part of the story. Conventional storage platforms would struggle to assemble 150 data strips in memory, and then write that data to 154 backend SSDs without bottlenecking on their SAS links or inter-node networks. Our disaggregated shared nothing architecture (see previous blog post) allows our system to accumulate write data in 3D Xpoint is a new non-volatile memory technology that has much lower latency and much higher endurance than NAND Flash. Optane is Intel’s trade-name for SSDs using 3D Xpoint as their storage media. VAST Universal Storage Systems use Optane SSDs to store system and element store metadata as well a... More and destage it to SSDs in strips optimized to the SSD’s internal erase blocks, minimizing wear at the same time. More on that next time.
Locally decodable codes and the DASE architecture allow us to break the traditional trade-off between efficiency, protection and rebuild complexity. This gives you the user several benefits: