Product

Oct 5, 2023

Understanding Commvault Dedupe and How WORM Affects Capacity on VAST

Authored by

Rob Girard, Principal Technical Marketing Engineer

This blog post was written in 2023 and reflects product capabilities at that time. Some information may be outdated.

In recognition of Cybersecurity Awareness month, VAST is focusing a series of blog posts on helping organizations keep data secure and protect their businesses. Here Rob explains how VAST’s advanced data reduction optimizes capacity and enables customers to take full advantage of Commvault’s advanced ransomware protection measures.

Today I’m excited to share some key characteristics about VAST’s data reduction working seamlessly with Commvault to deliver capacity relief for ransomware protection efforts. It’s worth noting that traditionally, most backup target vendors have recommended users disable the data movers’ software deduplication and compression for optimal results with the target capacity. Not so with VAST; we encourage customers to leave their SW dedupe and compression on and let the VAST target reduce their data even further. A best of both worlds approach, without compromise.

When Commvault deduplication and compression is used (recommended), there's a perception that dedupe on the storage target is no longer important or relevant. In the context of a single backup job iteration, this might be true and appears to be a level playing field across storage technologies. However, in reality, there are a lot of scenarios where having great data reduction at the storage layer complements this. Even if there were no further reductions to be had on the first land (plus subsequent incremental & full backups), having some kind of capacity insurance policy to ensure that if a whole new set of backup data needed to be written, your backend storage would be able to handle it. Without advanced deduplication in the storage target, customers would be forced to operate with large amounts of unused capacity reserves in order to safely handle these scenarios and prevent running out of space.

That said, with VAST, not only do we provide that capacity safeguard, our comprehensive set of data reduction technologies will further reduce the required capacity needed to store Commvault compressed & deduplicated backup data by an average of approximately 35 percent. These technologies in VAST include multiple types of compression, byte-granular VAST Similarity reduction, and dedupe that leverages "adaptive chunking" instead of fixed-sized offsets to align boundaries to increase the probability that exact block matches can be found, eliminating duplicates.

Commvault has already taken all the opportunity for space-savings via compression, so it's unlikely VAST’s compression would reduce any further. The same is true for Pure's FlashBlade, whose data reduction relies solely on compression, but Commvault already compressed the data. But can VAST reduce this Commvault deduplicated data any further...? Absolutely!

Commvault's dedupe is a fixed-block dedupe that stores hashes in a proprietary Commvault "DDB" (DeDupe Database), and does so at block size granularities of 32KB, 128KB (default for disk targets), 512KB (default for S3 targets), and 1024KB. A smaller, more granular block size enables more matches to be found across blocks because the probability of finding identical data in smaller blocks (ex. 32KB) is higher than finding larger blocks (ex. 1024KB) that are completely identical. A single bit difference between blocks means that they are not identical, and both blocks need to be stored instead of storing a common block and making a reference pointer back to it.

Granular dedupe is compute-intensive as more and more hashes of each block need to be generated and stored to determine whether future blocks have already been seen. A pointer then references the previous block instead of storing additional copies of identical blocks. Larger block sizes = faster performance, but the trade off is that more data needs to be written to storage when fewer exact matches are found. Below we show screenshots illustrating the reported and actual storage consumption during a Commvault backup.

VAST’s Capacity UI gives us granular per-path statistics. In the example above, Commvault wrote 1.4 TiB of data into a media library stored on VAST. Note that additional data reduction savings can be seen, even though Commvault had already reduced the data with compression and dedupe.

The Commvault GUI shows us capacity stats after a 2nd FULL backup was taken while the same DDB is used. We see a total application size of 2.88 TiB (ie. 2 FULL backups) of protected data, reduced down to the 1.43 TiB of data we can see is written into a Commvault media library on VAST. When not sealing the DDB, every subsequent FULL backup of the same data results in no data having to be written to storage, since it already exists from previous backups.

Without sealing the DDB, Commvault is deduping all of the data from the previous full backup, as expected. The corresponding VAST view after the second full backup shows there was no additional data written to VAST (i.e. Logical capacity has not increased).

For Pure's FlashBlade, whose data reduction techniques include compression but not dedupe, there are no further data reduction opportunities to be had after Commvault’s reduction. VAST’s advanced dedupe techniques, however, find additional savings by further reducing the amount of physical storage capacity required to store the logical blocks Commvault writes. In tests and across customers, we see the average additional data reduction rate (DRR) for Commvault data on VAST to be around 1.6:1. In other words, VAST reduced the data an additional 37.5% after Commvault’s data reduction. 1.6 TB of backend TBs, post-Commvault compression and dedupe (logical data from a VAST perspective) consumes 1 TB of usable data on VAST, but would consume the full 1.6TB of usable data on FlashBlade.

But wait! There's more!

In the age of Ransomware, there are constant threats to the backed up data. A popular method to protect against this is through strict policies that enforce immutability of data so that it can't be deleted or tampered with until after some defined retention period has expired. In Commvault, we leverage WORM to do this, which leverages S3 Object Lock in the case where the storage libraries use S3 buckets instead of more traditional NFS, SMB or local file systems.

This presents a new challenge... how do you lock an object and prevent deletion, but still desire to eventually delete the data once it's older than a desired retention period and considered obsolete? If backups were never pruned, they'd grow indefinitely, and include all the versions of data at all points in time. The older data gets, the less relevant it is to find an exact version and usually any copy of the data within a given month, or even year, is sufficient.

Add dedupe to this equation, and the challenge is greater. Not all data from a backup would be pruned when the retention period is up if there are newer backups that require those underlying blocks. For the blocks that are needed longer than the original retention period, they need to be protected from being deleted or modified. And when there are no longer any more references to those older blocks and they can be deleted, locks could interfere with being able to delete and reclaim that space.

Commvault has come up with a solution to this problem. Periodically, it intentionally doesn't dedupe and starts storing a whole new set of unique blocks. Some data lives on longer than the intended retention period, but at least it can safely be deleted at some known point in time in the future. There are some technical nuances (such as micro pruning vs. macro pruning) related to cleanup of data older than its required retention period, but I don’t want to get into those details in this article. For the context of this article and specific WORM-locking, macro pruning is the method of cleanup.

Unfortunately, the trade off here is a requirement AT LEAST DOUBLE the storage capacity needed for a base backup. Great news: VAST dedupes the second set to the first set and very little extra net physical capacity is required. The second copy is nearly free from a capacity perspective, even though there are two or more logical copies. If your storage doesn't have dedupe, such as Flashblade, you're stuck having to use double the physical capacity (or more!), and that gets expensive!

The method in which this is achieved is referred to as "sealing the DDB". This action means Commvault stops using the existing database of hashes to find duplicate blocks in future data and starts a new database of hashes to be compiled and used instead. At this point, all incoming data is seen as new and unique, and subsequently all of these unique blocks are written to storage as Commvault continues reading and backing up data. In other words, a completely deduped full backup is written to disk, and from a VAST perspective, this is a second logical copy containing content nearly identical to the content prior to sealing the Commvault DDB (plus/minus incremental changes). Future backups where the same data blocks have been previously written don't need to be written again, and reference pointers are used instead, pointing to those unique blocks that already exist in the backups.

As we get into how Commvault’s WORM protection has been implemented, the Commvault DDB gets sealed in a way that everything inside is lumped into the same retention period (usually 2x what the retention is configured as in Commvault). Now there is sufficient overlap to put newer backups lumped into another future retention period so the old ones can be bulk-deleted and freed up. Deleting this older data in order to free up space is only possible after the object versions’ `Retain Until Date` expire. But if it expires too early, some of the data in there that future backups need would be at risk.

Real-World Application

I had a question from one of our SEs about just how much extra capacity would be required. Their ask was "after sealing the DDB once per week for 90 days, how much space is used?" This isn't really the right question to be asking and is a bit out of context, so I wanted to take this opportunity to add more context and clear up any misunderstandings. The DDB wouldn't be sealed every week; it would be sealed after 90 days, and then a new one started. And the object lock on the first stuff that lands in there would be set for 180 days out.

Let's dive into what this looks like:

In the 90 day example... Today is October 6, and 90 days from now is January 4. The commitment to the business is to “have 90 days of backups". Fast forward to January 30.... and the backup admin is asked to restore something from "60 days ago" (Dec 1). It's very important that the data backed up at that time and any previous data required to make a complete point in time recovery is protected.

On January 6th, a hypothetical incident in which a compromised privileged account attempts to delete all previous backups needs to be protected against. Object Lock prevents object versions from being deleted prior to the date specified in `x-amz-object-lock-retain-until-date` that is applied to every locked object. And the object lock set in compliance mode (`x-amz-object-lock-mode=COMPLIANCE` - as is the case with Commvault’s WORM lock), there are no administrative overrides that can bypass controls and force a deletion, even for fully-privileged users.

So by doubling the Object Lock retention, basically all objects backed up by Commvault from October 6 -> January 4 all have the same "retain-until" object lock flag set to: April 3, 2024. That way, on March 14th, restoring data from "90 days ago" would ensure that December 15th is still around. Technically, December 14th and prior all could have been deleted, BUT, with the DDB (dedupe), some of the data backed up for December 15th already existed, and references to previous objects were created instead of writing a duplicate copy of the data (this is what dedupe at the Commvault layer is all about).

Now you can imagine that there would always be older data references going back a long time, and backed up data would grow and grow forever. At the time of backing it up, there's no crystal ball for Commvault to know whether or not future data being backed up will contain duplicates. Therefore, the lock applied ("retain-until-date") has to be appropriately long enough, but if you flagged everything as "retain forever", storage growth would be out of control.

The solution then is to "seal the DDB" periodically, and start a whole new set of objects. And eventually, be confident that the previous objects backed up in the previous cycle before the prior DDB was sealed would no longer be needed for restores.

After sealing the DDB, Commvault now shows a doubling of data written after another FULL backup.

This screenshot gives us another view of the same thing from the Commvault Perspective. The total application size (aka Front-End Terabytes) is 4.42 TB. After compression and dedupe, Commvault stored 2.83 TB of space (aka Backend Terabytes). This represents three FULL backups, two of which shared a DDB, and then one more after that DDB was sealed a new one started.

When looking at the VAST UI, we see the logical capacity matches Commvault’s reported “Size on Disk”: 2.835 TIB. More importantly, the usable capacity only increased by 92GiB (a tiny fraction of the additional 1.43 TiB of compressed and deduped data that was just written).

So in this 90 day example, we seal the DDB on January 4th and start a whole new set. This new set needs the "retain-until" to all be co-termed to a future date where nothing in the set would be required when it comes time to clean up. So the "retain-until" is set TWO TIMES THE RETENTION PERIOD (180 days)... which would be July 2nd, 2024. On April 3rd, that DDB would be sealed and a new one started, and all the objects written that relate to the new cycle include the same "retain-until-date" 180 days in the future from that time. In other words, the “retain-until-date” is not a rolling date relative to the date each object is created; all objects to be created in the subsequent 90 days all contain the identical “retain-until-date”.

Looking at total stored data January 5th (after a new FULL backup of everything is taken, using a new DDB), the data stored is 2X. And at any point in time looking forward there will always be at least 2X (or more). On FlashBlade, that is real, additional physical space consumed. On VAST, that 2nd logical copy is a nearly perfect overlap of the base backups, which would effectively consume the same capacity that single copy consumed (plus unique incremental deltas).

If you’ve made it this far, you likely have a greater appreciation for the win-win partnership that VAST and Commvault provide our customers. Together, we are helping customers protect data against ransomware threats without forcing them to double (or triple!) their required capacity because they’ve chosen to make their backups immutable. Just another way that VAST and Commvault are breaking the tradeoffs of legacy data protection.

Until next time,

-Rob

Understanding Commvault Dedupe and How WORM Affects Capacity on VAST

But wait! There's more!

Real-World Application

More from this topic