Electronic Science and Technology ›› 2019, Vol. 32 ›› Issue (4): 29-33.doi: 10.16180/j.cnki.issn1007-7820.2019.04.007

Previous Articles     Next Articles

Chinese Word Segmentation Based on Joint Training of Heterogeneous Data

JIANG Meng1,WANG Ziniu2,GAO Jianling1   

  1. 1. School of Big Data & Information Engineering,Guizhou University,Guiyang 550025,China;
    2. Network and Information Management Center,Guizhou University,Guiyang 550025,China
  • Received:2018-03-18 Online:2019-04-15 Published:2019-03-27
  • Supported by:
    Guizhou Science and Technology Fund(Guizhou Science and Technology Agency J [2015]2045);Guizhou University Graduate Innovation Fund(Graduate Science and Technology 2017016)

Abstract:

Chinese word segmentation technology is one of the key basic technologies in Chinese information processing. The Chinese word segmentation method based on deep learning model is widely concerned. However, the deep learning model requires large-scale data training to obtain good performance, but the current Chinese sub-word data is relatively lacking and the standards are not the same. This paper proposes a simple and effective method of heterogeneous data processing. Firstly, two artificially-set identifiers are added to different corpus data, and then the processed data is applied to the joint training of Bi-LSTM-CRF Chinese word segmentation model. Experimental results show that the Bi-LSTM-CRF model based on heterogeneous data joint training has better segmentation performance than the single data training model.

Key words: Chinese word segmentation, deep learning, Bi-LSTM-CRF, heterogeneous data, joint training, corpus

CLC Number: 

  • TP391