J4

• Original Articles • Previous Articles    

An adaptive similarity learning approach to record linkage

HUANG Jian-bin1,2;JI Hong-bing1;SUN He-li3
  

  1. (1. School of Electronic Engineering, Xidian Univ., Xi′an 710071, China;2. School of Computer Science and Technology, Xidian Univ., Xi′an 710071, China;3. Dept. of Computer Science & Technology, Xi′an Jiaotong Univ., Xi′an 710049, China)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-04-20 Published:2007-04-20

Abstract: In order to identify the duplicate records in the data extracted from multiple Web sites, a record pair similarity learning approach based on adaptive string distance metrics is presented. Firstly, the approach uses a maximum entropy classifier to label the relationship between each corresponding field pair in the two records. Then a proper distance function is selected for each field pair to measure its similarity. Finally, a Support Vector Machine trained on selected samples is used to classify the record pair in duplicate or non-duplicate one. Experimental results on a range of datasets show that our approach improves duplicate accuracy significantly over traditional techniques and has a good ability of noise data constraint.

Key words: approximately duplicate records detection, record linkage, entity matching, data integration

CLC Number: 

  • TP311