数据仓库中的相似重复记录检测方法

A Method for Detecting Approximately Duplicate Database Records in Data Warehouse

摘要: 针对检测和消除数据仓库中的相似重复记录问题,提出了数据仓库中的相似重复记录检测方法。该方法先通过等级法计算每个字段的权值;然后,按照分组思想,选择关键字段或字段某些位将大数据集分割成许多不相交的小数据集;最后,在各个小数据集中检测和消除相似重复记录,为避免漏查,再选择其他关键字段或字段某些位重复多次检测。理论分析和实验表明,该方法不仅具有好的检测精度,而且具有很好的时间效率,能够有效地解决大数据量的相似重复记录检测问题。

Abstract: Detecting and eliminating approximately duplicated records is one of the main problems needed to be solved for data mining and data quality improvement. An algorithm for detecting approximately duplicated database records is presented based on rank group. Firstly, each property of the data is endowed with certain weight according rank-based weights method. Secondly, in term of group thought, large data sets are divided into many non-intersect small data sets. Finally, approximately duplicated records are detected and eliminated in each small data set. To avoid missing, the above steps can be repeated. The theory analysis and experiment show that this algorithm has a good detecting precision better efficiency of time, and therefore is an effective approach to solve approximately duplicate records of massive data.