J4
• Original Articles • Previous Articles
HUANG Jian-bin1,2;JI Hong-bing1;SUN He-li3
Received:
Revised:
Online:
Published:
Abstract: In order to identify the duplicate records in the data extracted from multiple Web sites, a record pair similarity learning approach based on adaptive string distance metrics is presented. Firstly, the approach uses a maximum entropy classifier to label the relationship between each corresponding field pair in the two records. Then a proper distance function is selected for each field pair to measure its similarity. Finally, a Support Vector Machine trained on selected samples is used to classify the record pair in duplicate or non-duplicate one. Experimental results on a range of datasets show that our approach improves duplicate accuracy significantly over traditional techniques and has a good ability of noise data constraint.
Key words: approximately duplicate records detection, record linkage, entity matching, data integration
CLC Number:
HUANG Jian-bin1;2;JI Hong-bing1;SUN He-li3. An adaptive similarity learning approach to record linkage [J].J4, 2007, 34(2): 331-336.
0 / / Recommend
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
URL: https://journal.xidian.edu.cn/xdxb/EN/
https://journal.xidian.edu.cn/xdxb/EN/Y2007/V34/I2/331
Cited