A few years ago, Deduplication was the industries biggest buzz word. Well that hype actually pan-out and is a vital way to save disk space. In a nutshell, Deduplication is the process of eliminating duplicate copies of data.
Now instead of having an appliance dedup the data, Sun’s Zettabyte File System (ZFS) will do it for you.
From what we’ve found, ZFS Deduplication works at 3 levels, files, blocks, or bytes which are checksummed using some hash function that uniquely identifies data with very high probability.
With File-level deduplication, hash signatures are assigned to the entire file. When it comes to system resources, this level has the least amount of overhead. What you have to understand and keep in mind with this level is that if any byte of this file is changed. The depup process has to recompute.
Block-level deduplication has a little higher system overhead and works with blocks of data. So for example, VM images could dedup fairly well mainly because of the multiple copies of the OS for the guest virtual machines. Only the blocks that have changed are what gets updated.
With Byte-level deduplication may be a good fit for mail servers for example. If multiple copies of a mail message is stored, only certain bytes will be recomputed like the header portions. You must really understand your environment when using this level because the dedup process has to compute ‘anchor points’ to determine where the regions of duplicated vs. unique data begin and end.
To turn on ZFS Deduplication is fairly simple. Just tell ZFS to dedup a named storage pool, such as a proddata, and it’s datasets:
zfs set dedup=on proddata
zfs set dedup=on proddata/satimages
zfs set dedup=off proddata/statlogs
To tell ZFS to do a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same. Set it like this:
zfs set dedup=verify proddata
If you want to have ZFS compute faster using weaker hash functions, then use the fletcher4 checksum.
zfs set dedup=fletcher4,verify proddata
Worth noting, ZFS Deduplication holds it’s mapping tables in resident memory. If ZFS has to have the tables spill over into the L2ARC, performance will be slower and even more so if the tables have to be read from disk.
Follow more about the new features of ZFS and Deduplication on Sun.com