LI Xing-yi, BAO Cong-jian, SHI Hua-ji. A Method for Detecting Approximately Duplicate Database Records in Data Warehouse[J]. Journal of University of Electronic Science and Technology of China, 2007, 36(6): 1273-1277.
Citation: LI Xing-yi, BAO Cong-jian, SHI Hua-ji. A Method for Detecting Approximately Duplicate Database Records in Data Warehouse[J]. Journal of University of Electronic Science and Technology of China, 2007, 36(6): 1273-1277.

A Method for Detecting Approximately Duplicate Database Records in Data Warehouse

  • Detecting and eliminating approximately duplicated records is one of the main problems needed to be solved for data mining and data quality improvement. An algorithm for detecting approximately duplicated database records is presented based on rank group. Firstly, each property of the data is endowed with certain weight according rank-based weights method. Secondly, in term of group thought, large data sets are divided into many non-intersect small data sets. Finally, approximately duplicated records are detected and eliminated in each small data set. To avoid missing, the above steps can be repeated. The theory analysis and experiment show that this algorithm has a good detecting precision better efficiency of time, and therefore is an effective approach to solve approximately duplicate records of massive data.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return