Breaking Data Reduction Trade-offs with Global Compression
Storage system architects have been using data compression and data deduplication to reduce the storage footprint of data for decades. While these techniques have made all flash arrays more affordable the trade-offs implicit in these technologies have limited their effectiveness. At VAST Data we’ve come up with a new data reduction technology based on data similarity that breaks many of those tradeoffs reducing data further than compression and deduplication.
While there are many compression methods and algorithms general purpose, lossless, compression methods all work in basically the same way. The compressor scans a set of data looking for repeating patterns. It replaces the repeating patterns with smaller symbols and builds a dictionary so the decompressor can reverse the process later.
So if we start with the opening to Hamlet’s most famous speech:
A compressed version might replace
~ for compressed text of:
A compressed version
Limits on Compression
For various reasons, including limits on dictionary size, compression reduces small repeating patterns in data like the null packing in fixed length record datasets but not larger repeating patterns.
The other problem with compression is that decompression is a sequential process. Any block of compressed data must be decompressed starting at byte 1. Since storage systems never know which data may be accessed randomly, and since reading 5MB of data to satisfy an 8KB read from the middle of a file would be a bad idea, storage vendors only compress data in smaller blocks limiting compression’s effectiveness.
Since most storage systems use block oriented metadata they can’t take full advantage of compression because once they’ve compressed 64KB of data down to 29,319 bytes they have to store those 29,319 bytes in 8 4KB blocks using 32,768 bytes. Our metadata operates at the byte level so we store 29,319 bytes of data in 29,319 bytes of flash, and start the next block of data at block 29,320.
Where compression reduces small repeating patterns in your data to even smaller symbols data deduplication identifies repeating data in larger blocks and storing only one copy.
As data is written to the system it uses hash functions to identify each data block and determine if the contents of that data block has already been stored, If it has it replaces the new data with a pointer to the old data. Deduplication works well on datasets with a lot of duplicates to eliminate like VDI installations with 100s of copies of Windows, and the patches to those systems Microsoft sends on Tuesdays.
Data Deduplication Trade-offs
Since deduplication only reduces data that’s an exact match at the block level, smaller block sizes will result in greater data reduction. Even the smallest change or difference between two data blocks will result in storing a whole new block.
While small block sizes boost deduplication efficiency, they increase the amount of metadata the storage system has to manage. A storage system has to look up hashes and determine what to do with a new data block quickly so storage systems have to keep that metadata in memory.
Storage systems, therefore, have to trade-off the efficiency of small deduplication blocks against the amount of controller memory they can consume, and the capacity they must support. Some systems use block sizes as large as 128KB to minimize metadata handling.
Breaking data reduction trade-offs with similarity
While compression and deduplication have been useful, we at VAST Data weren’t satisfied with traditional data reduction techniques. Instead of deduplicating data blocks and then locally compressing each one we’ve found a way to use compression globally by compressing similar blocks of data together.
How Similarity Compression Works
Similarity compression, like data deduplication, starts by breaking data into blocks and using a hash function to characterize those blocks. The first difference is that similarity compression uses a very different hash function than deduplication.
Deduplication systems use hashes, like SHA-1, that are designed to be collision resistant, generating large changes in the hash values they generate from small changes in the block’s data. Our hash functions, on the other hand, are designed to generate the same hash from similar blocks of data.
It’s a bit of an oversimplification but you can think of similarity hashes as dictionary suggestions. If two blocks generate similar hashes they’re likely to generate very similar dictionaries when compressed with the same compression engine.
We compress the first block that generates a specific hash value and declare that block the reference block for that hash. When another block generates the same hash we run the new block, and the reference block through the compression engine to create a difference or delta block. To oversimplify just a little once again; the dictionary is stored as part of the compressed reference block the delta file is a very small collection of reduced symbols without the overhead of the dictionary.
Advantages of Similarity Reduction
Similarity compression reduces duplicate blocks to metadata like deduplication, and since we use one of the latest and greatest compression algorithms it reduces the small repetitions that compression is good at reducing. Unlike deduplication and compression, it also gets the opportunity to reduce mid-size repetitions between blocks so a change to the quantity on hand in your inventory between snapshots will get stored as a few bytes of delta file instead of 27KB of compressed dedupe block.
When your application writes to a Universal Storage system’s NFS mount point or an S3 bucket the compute node receiving your request, writes it to multiple 3DXpoint SSDs in one or more NVMe-oF enclosures, and acknowledges the write. The data is reduced and erasure-coded when it is asynchronously destaged from 3DXpoint to QLC flash SSDs. That means data reduction has no impact on write latency because it happens after the data is safe in non-volatile 3DXPoint.
That 3DXpoint is also key to breaking the tradeoff between granularity and memory requirements deduplication systems have to make, All the metadata created by similarity compression is stored in the 3DXpoint in one or more NVMe-oF enclosures. Any compute node can access, or update, any of the metadata directly at any time. Since our system was designed to store reduced data from inception all that metadata points to bytes, not blocks, in flash so we avoid the packing inefficiencies that plague so many storage systems.
While as with any data reduction method the degree to which similarity compression can reduce any set of data depends on the data, some data is just more reducible than other data. To see if similarity reduction really could reduce data more than standard deduplication and compression we configured a popular backup application to deduplicate, or deduplicate and compress the backups of a set of virtual machines and use a Vast Universal Storage system as the backup target.
The Vast system reduced the deduplicated and compressed backup set by an additional 6:1, when we turned the backup compression off, the deduplicated set reduced 22:1. Your mileage will vary but we took this as a pretty good proof point.
A few additional workloads from our early customers:
Deep Learning Bitmap files
Cryo-electron microscopy images
Vast’s similarity reduction can reduce data more than the combination of data deduplication and compression used by other all-flash storage systems bringing the cost per GB of a Vast Universal Storage system down to that of disk-based object stores.
When we used a Vast storage system as a backup target: