A Method for Detecting Approximately Duplicate Database Records in Data Warehouse
-
Graphical Abstract
-
Abstract
Detecting and eliminating approximately duplicated records is one of the main problems needed to be solved for data mining and data quality improvement. An algorithm for detecting approximately duplicated database records is presented based on rank group. Firstly, each property of the data is endowed with certain weight according rank-based weights method. Secondly, in term of group thought, large data sets are divided into many non-intersect small data sets. Finally, approximately duplicated records are detected and eliminated in each small data set. To avoid missing, the above steps can be repeated. The theory analysis and experiment show that this algorithm has a good detecting precision better efficiency of time, and therefore is an effective approach to solve approximately duplicate records of massive data.
-
-