J4

• Original Articles • Previous Articles     Next Articles

Integrating Web objects extracted from multiple sites into relational database

HUANG Jian-bin1,2;JI Hong-bing1;SUN He-li3
  

  1. (1. School of Electronic Engineering, Xidian Univ., Xi′an 710071, China;2. School of Computer Science and Technology, Xidian Univ., Xi′an 710071, China;3. Dept. of Computer Science and Technology, Xi′an Jiaotong Univ., Xi′an 710049, China)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-02-20 Published:2007-02-25

Abstract: This paper studies the problem of integrating heterogeneous semi-structured Web objects into relational database. A generalized sequential learning model named the Combined Conditional Random Fields is presented for solving the problem of schema matching between pairs of heterogeneous Web data sources. The proposed model is able to learn on the manually labeled training data and unlabeled database records, thereby reducing the dependence on tediously labeled samples. It also provides a novel way to incorporate the two-dimensional neighborhood dependencies between Web data elements. Moreover, a constrained Viterbi algorithm is implemented to resolve the imposed labels inference for optimal data integration. Experimental results using a large number of Web pages from diverse domains show that the proposed method can improve the matching accuracy significantly.

Key words: Web data integration, schema matching, conditional random fields

CLC Number: 

  • TP311