基于FSL数据集的去重性能分析

Deduplication Performance Analysis Based on FSL Dataset

摘要: 重复数据删除技术作为一种数据缩减技术，实现了对高度冗余数据集的压缩功能，可以有效地解决存储系统空间浪费所带来的成本开销问题。相较于过去大多针对小规模静态快照或是覆盖时间较短的快照的研究，该文基于从共享用户文件系统选取的覆盖时间较长的大规模快照，从文件、数据块以及用户的角度研究备份数据集的特征，分析不同数据分块方法、策略下去重性能的优缺点，得到最高的重复数据删除率，为未来的重复数据删除系统设计提出建议。

Abstract: As a data reduction technology, the deduplication technology realizes the compression function of highly redundant data sets, and can effectively solve the overhead cost which is caused by the waste of space in the storage system. Compared to the previous studies which were mainly based on small-scale static snapshots or short-coverage snapshots, the highest deduplication ratio can be achieved by using large-scale snapshots with a long-coverage time. The large-scale snapshots are selected from the shared user file system. The characteristics of backup datasets from files, data blocks, and users are studied, and the advantages and disadvantages of different data partitioning methods and strategies are analyzed. The proposed result suggests a reference for future deduplication system design.