Introduction to Data Deduplication


Long gone are the days when software installers came on 3.25 inch disks and CDs could be considered a corporate backup medium. Storage has been firmly in the realm of commodities for years now, and as a result the amount of data within businesses is somewhere between staggering and “are you serious?” There are many, many problems associated with having mass amounts of data to steward. Two of the most obvious and painful are 1) How to make room for the continued growth of data, and 2) How to back up all of the data.

Indeed, how does one handle a terabyte or more of growth per year? It’s not trivial and many corporations are facing vastly larger growth rates than a mere 1 terabyte. Even if you can keep up with the data growth, you haven’t even begun the hard part yet. Backing up your data – ah! – now that is the true gauntlet to be won! Vast landscapes of storage are not easily backed up in under 24 hours and fitting your archives into a business-day-sized backup window can leave you with just minutes to spare before the next backup is kicked off.

Enter the miracle of deduplication!

In essence, there are two broad types of deduplication: single instance storage (SIS) and byte or block-level deduplication. Typically, the more granular a deduplication scheme, the greater the savings in storage space can be.

Single instance storage has existed for many years. If two file exist that are absolutely identical, one is discarded and replaced by a link to the remaining file. This type of system has been implemented most successfully in databases and backup systems. For example, Microsoft Exchange for many years used a SIS scheme to manage attachments. If one person emailed an attachment to ten people, the attachment was only stored once in the database. As another example, if a backup program archives the data in an office of fifty Windows PCs, there will likely be a supernumerary collection of common system utilities and DLL files. Many backup programs will simply store one of the duplicate files and then replace any identical files that it finds with a link to the single instance of it.

That sounds great! Except, there are some limitations with single instant storage that you need to be aware of. If someone makes the tiniest change to a file then a whole new file is saved. Adding one slide to a 20-slide presentation or even just modifying one pixel in a large graphics file will completely fluster SIS and cause your storage space to be eaten up.

Byte or block-level deduplication disregards the concept of files and instead looks deeper at the stream of data that is written on a storage system regardless of its association with a file.  As it looks at the storage system, it examines segments or “chunks” of data (a process referred to as “chunking”) and performs a hash calculation on it. That hash number is then stored in a hash table where new chunk hashes are constantly added. As the deduplication system continues to analyze data, it will start to find chunks that have identical hash values. In those cases the deduplication system assumes that the data is identical, places a pointer to the existing hash entry of that hash chunk value and discards the duplicate chunk. As a general rule, the larger your hash table is the greater your space savings will be. Thus you want to make sure that you’re comparing chunks from many different computers, rather than just chunking data on one computer at a time.

You might be wondering why the ambiguity exists as I use the phrase “byte or block level deduplication.” The fineries of how data is handled will have to be saved for a later article. However, let it suffice to know that, within reason, greater savings in space will be seen as the “chunks” get smaller. Since blocks are larger than bytes, a byte-level deduplication system will generally have greater savings in space. There are variables at play, of course, but that’s a good standard measure to go by.

What is it Good For?

It won’t take long for you to discover that most vendors of deduplication products focus on the backup market. Deduplication in backups is a hot topic. It makes sense; data that is backed up is important, but you’re increasing your business’s storage needs two, three or many more times if you use simple file copy methods to create backups. File-level deduplication has been around for years in backup systems and most of us should be familiar with the savings that differential and incremental backups give us. However, that’s not good enough in many instances. Adding deduplication to a backup system can result in tremendous returns on your storage budget and allow you to more easily meet the retention requirements of certain government and industry regulations.

The other way of using deduplication, which I believe is the most obvious in spite of vendors’ preferences, is to deduplicate live file servers. The deduplication can be part of an operating system, a third party software package or a fancy storage appliance. Once again we have to be careful of how the products are marketed. Windows Storage Server 2008, for example, touts itself as being able to perform data deduplication. However, its specific variety is file deduplication, otherwise known as Single Instance Storage (SIS), and not block or byte level.

Lower level deduplication appliances exist that can perform block or byte level deduplication on its contents. They will then present their storage space as a NAS device (most commonly via NFS or SMB) or as a SAN (accessed via iSCSI or Fibre Channel among other possibilities) or both if the appliance has gone down the road of unified storage.

While block level deduplication on a filer is helpful, know that it might not help your backup storage needs. Remember that the deduplicated filer will present all of its files as being truly there. Let’s say that you have four terabytes of files that were deduplicated down to require only two terabytes of physical storage. If you then point a backup tool at the storage array it may pull all files off and end up with 4TB of files to backup. It should be noted that some deduplication storage appliances can emulate a tape library to allow for integration with already existing backup and recovery tools. Make sure you evaluate the vendor’s product to see how it handles backups and if deduplication-aware backup tools can be used. You may simply need to get a backup suite that performs its own deduplication methods in addition to the filer’s deduplication system.

Reduction Percentages and High Hopes

Percentages in overall storage space savings are the most touted snippets of information that a vendor will want you to focus on. You may hear vendors talk about 95% decreases in the size of your backups, but before you get your hopes up you must read the fine print carefully. Very carefully. Many factors are at play in storage deduplication rates. Surprisingly, the algorithm used to perform deduplication might be the most insignificant part of the ratio.

The first variable that exists revolves around the type of files that will be deduplicated. Don’t dedupe compressed data. That includes things like ZIP and cabinet files. Don’t forget that many media file formats like JPG, MP3 and AVI are a stream of compressed bytes and thus by nature don’t have a lot of repeated symbols for a deduplication algorithm to chunk and hash. In essence, they’ve already had that procedure performed on them.

Of equal importance is how often the data to be deduplicated changes and how much archival data you keep around. If you’re simply using a product that deduplicates existing data on a file server then you don’t have to worry about archival. If you’re using deduplication as part of your backup system, archival data plays a big part in the equation. Does data change frequently? Are those changes drastic or just a few blocks within each file? How far back in time will you need to retain data? How often are backups performed and how many are full versus incremental and/or differential?

You can find deduplication calculators on the internet, however those don’t tell the whole story. In fact, they can’t tell the whole story. As you can see, the percentage of space that you will save by deduplicating data depends on many factors. It’s theoretically possible to have a 95% or more reduction in storage needs, or you might only reduce existing data by 2%. It’s highly likely that your rates will be somewhere in the middle, but don’t assume that. Ascertaining your precise reduction rates will be left as an exercise for the reader and his or her preferred vendors. However, just know that you might be able to cut your space needs in half – or you might spend a lot of money and time to shave off just a few megabytes.

Timing is Everything

When talking about deduplication in backup systems, where the data is deduplicated is a source of vendor contention. The details of timing in deduplication can make a big difference to the overall result. The three main types of data deduplication are source, target (or “Post-Process deduplication”) and in-line (or “transit deduplication”).

Source deduplication is performed on the device that is being backed up. Whatever data you are targeting for the backup is chunked and then hashed. As the index of hashes is built, deduplication occurs. You might be able to see two potential pitfalls already.

The first pitfall is that the source machine’s resources are used in this process. Thus you need to insure that you have the CPU and RAM headroom beforehand. It’s no fun to have an already stressed email server now have to perform its own deduplication as well. Certainly, some vendors promote their product as being light-touch, but the server is still being touched which might not be okay for some scenarios.

The second pitfall concerns the location of the hash table. In some cases the hash table is confined to each individual computer that is performing deduplication. If that’s the case, your hash table will only have the data on that one server to compare hashes to. The benefits of deduplication are restricted to one computer at a time. If identical data chunks exist on other servers, there’s no way for the current server to know that. In some cases the computers involved might forward their hashes to a central repository for storage and comparison and wait for a reply back. This would be something of a source/target hybrid and could be prohibitively chatty on your network.

Nevertheless, source deduplication has its place, especially in smaller IT departments that have few servers and perhaps can’t afford a central storage area with which to perform global deduplication. “What’s global deduplication?” you ask. That conveniently brings me to the next point.

Target or post-process deduplication requires that all of your PCs and servers send their data to a central repository. Once the data has arrived, the repository has a “global view” of all of the data in your organization. Or at least all of the data that you chose to back up. With all that data in view, the deduplication utility can create a definitive hash table of the data chunks

The first primary advantage to post-process deduplication is that deduplication potential is maximized. The larger that the total pool of data is, the larger the hash table will be. The larger the hash table becomes, the greater the potential for duplicate hashes to be found

The second notable advantage is that all processing is offloaded from the source machines to the central deduplication system. That’s less stress on the already taxed resources of your servers and PCs.

However, target deduplication is not all lotus eating. There are some downsides to be cautious of. The first downside being the storage space required. If you have a large enterprise, the repository required could be in the range of petabytes. You might be frothing “The whole point of deduplication is to save on storage requirements, and yet target deduplication requires a ton of storage space?!” Not to worry, though. The staging area for deduplication can be so-called “cheap storage.” A few SuperMicro chassis, several dozen consumer-grade Samsung SpinPoint drives and FreeNAS or OpenFiler as the filer OS can gain you dozens of terabytes for rather cheap. Once the data is deduplicated, you can then shuffle it off to more reliable (and therefore expensive) disk-based storage or to tape.

However, the second downside to target deduplication requires that you don’t skimp too much on your storage repository. Target deduplication requires that data be written to the repository disks before chunking and hashing can occur. This makes the disk subsystem a new bottleneck in the process. Make sure that your storage controller can write at the disk’s full speed and choose an array type that isn’t terribly slow.

The third downside is that if your hash algorithm isn’t strong enough, there’s a chance that two chunks of data that aren’t exactly identical will still end up with the same hash value. This is known as a “hash collision”. Hash collisions result in the original data being corrupted. To avoid hash collisions, you need to choose to use a stronger hash algorithm. Of course, stronger hashing algorithms require more computational power. That’s typically not an issue on target deduplication systems since they’re likely to be running on dedicated hardware that can handle the workload.

A fourth potential downside is that the full size of the source computers’ backed up data must flow out over the network causing congestion. You’ll have to solve that with either timing (spooling the data to the target server during off-hours) or a separate backup network (which isn’t uncommon in medium to large companies anyway).

In-Line or transit deduplication is sometimes explained as being done as the data travels from the source to the target. This is slightly misleading. The data isn’t magically deduped “on the wire.” In reality it means that the data is collected in the RAM of the target device and deduplicated there, before being finally written to disk. This takes the seek time of disks out of the speed equation.

In-line deduplication can be looked at in many ways as a better form of target deduplication. It has all of the advantages of a global view of your data along with the offloading of the hashing process but none of the disadvantages of slow disk I/O. However, you’re still stuck with large amounts of network traffic and a potential for hash collisions. An issue intrinsic to in-line deduplication is that it typically requires the most CPU power of all the different variations of deduplication.

Wrapping up

Deduplication technology can help you avoid the costs associated with building massive storage arrays. You can choose to use deduplication to keep your filers’ hard drives from breaking your budget, for keeping your backup sets from requiring a forklift to transport or both! Choose the type of deduplication wisely (source, target or in-line) and then implement it carefully (carefully choosing chunk sizes and hashing algorithms among other concerns). If you do, you’ll allow your company to smoothly grow with its storage demands and retain important information for longer periods of time.