Data Deduplication is a fancy term for reducing redundant data. In a way, it is similar to data compression tech you might be familiar with such as Zip files, which look for recurring bytes to reduce a file size. Duplicate files attachments are one example of redundant data. You may have 50 to 100 copies of the same file attachment in your email. When deduplication is performed, only one copy of a redundant file is left. This saves space because obviously 100 copies of a 1 MB file takes up 100 MB of space. Consider this if you have many duplicated files, you are using more hard drive or network space than is necessary.
Using a deduplication program or service will save space, improve your data protection, and increase speed. Reducing the amount of data redundancy will make backups more efficient and quicker as well. Businesses will be able to allocate budgetary resources more practically and more cost effectively. Network bandwidth will also improve after deduplication.
Deduplication is achieved by two different methods. These methods are Source and Target Deduplication.
In Source Deduplication, data from the source are deduplicated. Usually this is done within a file system, and is referred to as Single Instance Storage. The file system will periodically scan new files as they are created and create hashes. (a file hash is a unique identifier) When duplicate files are found with the same hash, then the deduplication process will remove the copies.
Target Deduplication removes duplicate files at the secondary storage area. This might be a backup store, like a virtual tape library or a data repository.
There are three main methods in which Target Deduplication achieves this – Post Process, In Line, and Client Backup.
In Post Process Deduplication, your data is first stored on a storage device, and then later deduplication is processed to find any duplicate files. This method saves time by not relying on the need to perform hash calculations to discover the duplicate files. One of the marks against this method is the need to unnecessarily store duplicate files even for a short time.
In Line Deduplication creates the hash calculations on the target device as files are saved in real time. This helps to prevent duplicate files from being saved in the first place. On the plus side is the savings in storage by not creating the duplicates. However, the time taken to perform the hash calculations at the time of saving the file can create its own issues.
The Client Backup method creates the hash calculations on the Client or source machine. If a file has the identical hash on the client machine, it is not sent, thus preventing the duplicates. This prevents data being sent across the network and keeping the traffic load down.
One of the downsides to data deduplication processes is the data collision. Collisions happen when hash values are the same. Hash values are often made so large to avoid collisions that hardware might even fail before a collision.
Data deduplication is one of the best ways to avoid data redundancy. Symantec and Quantum are two companies that offer data deduplication solutions.
| If you enjoyed this post, subscribe to: |
|

