基于改进Sparse Indexing的多负载消冗方法

Multiple-Loads Deduplication Method Based on Improved Sparse Indexing

  • 摘要: 针对现有的Sparse Indexing方法不能有效处理小文件备份负载的问题,提出了一种以Broder扩展定理为理论依据的最小特征采样算法,该算法可以对不同形式的备份负载进行有效的特征采样。在此算法的基础上,设计了一种多负载重复数据消除方法,该方法通过对备份负载进行特征采样,仅在内存中维护完整索引的一个很小的子集,并通过批量读入分块标识符,摊销了磁盘访问开销,提高了吞吐量。实验结果表明,该方法对混合备份负载的压缩比是Sparse Indexing的2.04倍,而吞吐量与Sparse Indexing相当。该方法适用于需要处理多种形式备份负载的高性能重复数据消除系统。

     

    Abstract: To address the problem that the sparse indexing can not deduplicate the backup load based on small files effectively, a min-feature sampling algorithm based on the Broder's extension theorem is proposed. In addition, a deduplication method for multiple backup loads, which is on the basis of the min-feature sampling algorithm, is presented. This method only maintains a very small subset of the full index in the RAM by sampling the backup load, and the cost of disk accesses is amortized by loading the chunk IDs in batches. As a result, the throughput of the method is improved effectively. The experimental results indicate that the compression ratio of the method on the mixed backup loads is 2.04 times of the sparse indexing, and its throughput is almost equal to the sparse indexing. This method is applicable to the high-performance deduplication systems which need to process backup loads of multiple types.

     

/

返回文章
返回