数据仓库中的相似重复记录检测方法

李星毅, 包从剑, 施化吉

李星毅, 包从剑, 施化吉. 数据仓库中的相似重复记录检测方法[J]. 电子科技大学学报, 2007, 36(6): 1273-1277.
引用本文: 李星毅, 包从剑, 施化吉. 数据仓库中的相似重复记录检测方法[J]. 电子科技大学学报, 2007, 36(6): 1273-1277.
LI Xing-yi, BAO Cong-jian, SHI Hua-ji. A Method for Detecting Approximately Duplicate Database Records in Data Warehouse[J]. Journal of University of Electronic Science and Technology of China, 2007, 36(6): 1273-1277.
Citation: LI Xing-yi, BAO Cong-jian, SHI Hua-ji. A Method for Detecting Approximately Duplicate Database Records in Data Warehouse[J]. Journal of University of Electronic Science and Technology of China, 2007, 36(6): 1273-1277.

数据仓库中的相似重复记录检测方法

基金项目: 

国家火炬计划项目(2004EB33006[0]);江苏省高校自然科学指导性计划项目(05JKD520050)

详细信息
    作者简介:

    李星毅(1969-),男,博士,副教授,主要从事数据挖掘、空间数据库、交通信息系统和控制方面的研究.

  • 中图分类号: TP311

A Method for Detecting Approximately Duplicate Database Records in Data Warehouse

  • 摘要: 针对检测和消除数据仓库中的相似重复记录问题,提出了数据仓库中的相似重复记录检测方法。该方法先通过等级法计算每个字段的权值;然后,按照分组思想,选择关键字段或字段某些位将大数据集分割成许多不相交的小数据集;最后,在各个小数据集中检测和消除相似重复记录,为避免漏查,再选择其他关键字段或字段某些位重复多次检测。理论分析和实验表明,该方法不仅具有好的检测精度,而且具有很好的时间效率,能够有效地解决大数据量的相似重复记录检测问题。
    Abstract: Detecting and eliminating approximately duplicated records is one of the main problems needed to be solved for data mining and data quality improvement. An algorithm for detecting approximately duplicated database records is presented based on rank group. Firstly, each property of the data is endowed with certain weight according rank-based weights method. Secondly, in term of group thought, large data sets are divided into many non-intersect small data sets. Finally, approximately duplicated records are detected and eliminated in each small data set. To avoid missing, the above steps can be repeated. The theory analysis and experiment show that this algorithm has a good detecting precision better efficiency of time, and therefore is an effective approach to solve approximately duplicate records of massive data.
计量
  • 文章访问数:  4671
  • HTML全文浏览量:  174
  • PDF下载量:  107
  • 被引次数: 0
出版历程
  • 收稿日期:  2007-09-06
  • 刊出日期:  2007-12-14

目录

    /

    返回文章
    返回