电子科技 ›› 2019, Vol. 32 ›› Issue (4): 29-33.doi: 10.16180/j.cnki.issn1007-7820.2019.04.007

• • 上一篇    下一篇

基于异构数据联合训练的中文分词法

姜猛1,王子牛2,高建瓴1   

  1. 1. 贵州大学 大数据与信息工程学院,贵州 贵阳 550025
    2. 贵州大学 网络与信息化管理中心,贵州 贵阳 550025
  • 收稿日期:2018-03-18 出版日期:2019-04-15 发布日期:2019-03-27
  • 作者简介:姜猛(1994-),男,硕士研究生。研究方向:机器学习、数据挖掘。|王子牛(1961-),男,副教授。研究方向:信息与信号处理、数据挖掘、计算机仿真技术。|高建瓴(1969-),女,副教授。研究方向:云计算、数据库应用。
  • 基金资助:
    贵州省科学技术基金(黔科合J字[2015]2045);贵州大学研究生创新基金(研理工2017016)

Chinese Word Segmentation Based on Joint Training of Heterogeneous Data

JIANG Meng1,WANG Ziniu2,GAO Jianling1   

  1. 1. School of Big Data & Information Engineering,Guizhou University,Guiyang 550025,China;
    2. Network and Information Management Center,Guizhou University,Guiyang 550025,China
  • Received:2018-03-18 Online:2019-04-15 Published:2019-03-27
  • Supported by:
    Guizhou Science and Technology Fund(Guizhou Science and Technology Agency J [2015]2045);Guizhou University Graduate Innovation Fund(Graduate Science and Technology 2017016)

摘要:

中文分词技术作为中文信息处理中的关键基础技术之一,基于深度学习模型的中文分词法受到广泛关注。然而,深度学习模型需要大规模数据训练才能获得良好的性能,而当前中文分词语料数据相对缺乏且标准不一。文中提出了一种简单有效的异构数据处理方法,对不同语料数据加上两个人工设定的标识符,使用处理过的数据应用于双向长短期记忆网络结合条件随机场(Bi-LSTM-CRF)的中文分词模型的联合训练。实验结果表明,基于异构数据联合训练的Bi-LSTM-CRF模型比单一数据训练的模型具有更好的分词性能。

关键词: 中文分词, 深度学习, Bi-LSTM-CRF, 异构数据, 联合训练, 语料库

Abstract:

Chinese word segmentation technology is one of the key basic technologies in Chinese information processing. The Chinese word segmentation method based on deep learning model is widely concerned. However, the deep learning model requires large-scale data training to obtain good performance, but the current Chinese sub-word data is relatively lacking and the standards are not the same. This paper proposes a simple and effective method of heterogeneous data processing. Firstly, two artificially-set identifiers are added to different corpus data, and then the processed data is applied to the joint training of Bi-LSTM-CRF Chinese word segmentation model. Experimental results show that the Bi-LSTM-CRF model based on heterogeneous data joint training has better segmentation performance than the single data training model.

Key words: Chinese word segmentation, deep learning, Bi-LSTM-CRF, heterogeneous data, joint training, corpus

中图分类号: 

  • TP391