西安电子科技大学学报 ›› 2021, Vol. 48 ›› Issue (6): 161-171.doi: 10.19665/j.issn1001-2400.2021.06.020

• 计算机科学与技术 • 上一篇    下一篇

融合k均值聚类与LSTM网络的半监督词义消歧

张春祥(),周雪松(),高雪瑶(),刘欢()   

  1. 哈尔滨理工大学 计算机科学与技术学院,黑龙江 哈尔滨 150080
  • 收稿日期:2020-03-11 出版日期:2021-12-20 发布日期:2022-02-24
  • 通讯作者: 高雪瑶
  • 作者简介:张春祥(1974—),男,教授,博士,E-mail: z6c6x666@163.com|周雪松(1996—),女,哈尔滨理工大学硕士研究生,E-mail: 1583829471@qq.com|刘 欢(1981—),男,副教授,博士,E-mail: 18473681@qq.com
  • 基金资助:
    国家自然科学基金(61502124);国家自然科学基金(60903082);中国博士后科学基金(2014M560249);黑龙江省普通高校基本科研业务费专项资金(LGYC2018JC014);黑龙江省自然科学基金(F2015041);黑龙江省自然科学基金(F201420);哈尔滨市科技创新人才研究专项资金(2017RALXJ016)

Semi-supervised word sense disambiguation by combining k-means clustering and the LSTM network

ZHANG Chunxiang(),ZHOU Xuesong(),GAO Xueyao(),LIU Huan()   

  1. School of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China
  • Received:2020-03-11 Online:2021-12-20 Published:2022-02-24
  • Contact: Xueyao GAO

摘要:

一词多义是自然语言所固有的特性。词义消歧是根据上下文来确定歧义词汇的含义,是自然语言处理领域中的一项关键技术。目前,词义消歧被广泛地应用于机器翻译、信息检索和文本分类之中。为了提高词义消歧的准确率,提出了一种结合k均值聚类与长短期记忆网络的半监督词义消歧方法。以歧义词汇为中心,选取左右两个邻接词汇单元,形成大小为4的词窗。从词窗中选取词形和语义类作为聚类特征,利用k均值聚类方法对无标注语料进行聚类。将聚类得到的语料加入SemEval-2007:Task#5的训练语料中,以扩充训练语料的规模。从词窗中选取词形、词性、语义类、英文译文和消歧距离作为消歧特征,使用长短期记忆网络来确定歧义词汇的语义类别。利用扩充后的训练语料来优化长短期记忆网络的参数。使用SemEval-2007:Task#5的测试语料对词义消歧分类器进行测试。通过实验分析了隐藏层数和训练语料规模对词义消歧的影响。实验结果表明:相对于贝叶斯分类器和深度信念网络而言,所提出的方法能够提高词义消歧的准确率。

关键词: 词义消歧, k均值聚类, 长短期记忆网络, 聚类特征, 消歧特征

Abstract:

Polysemy is an inherent characteristic of the natural language.The word sense disambiguation(WSD) is to determine the meaning of an ambiguous word according to its context,which is a key technology in the natural language processing field.Now,the WSD is widely applied to machine translation,information retrieval and text classification.In order to improve the accuracy of the WSD,a semi-supervised WSD method is proposed based on the k-means clustering method and the Long Short Term Memory (LSTM).The ambiguous word is used as its center.Its two left and right adjacent lexical units are selected to construct the word window whose size is 4.Morphology and semantic classes are extracted as clustering features from the word window.The k-means clustering method is used to cluster the unlabeled corpus.The clustered corpus is added into the SemEval-2007:Task#5 training corpus to expand the size of the training corpus.The morphology,part-of-speech,semantic category,English translation and disambiguation distance are extracted as disambiguation features from the word window.The LSTM network is used to determine semantic categories of ambiguous words.The expanded corpus is applied to optimize LSTM parameters.The SemEval-2007:Task#5 test corpus is used to test the WSD classifier.Experiments are conducted to analyze the influence of hidden layer number and training corpus scale on the WSD.Experimental results show that the proposed method can improve the WSD accuracy compared with bayesian classifiers and deep belief networks.

Key words: word sense disambiguation, k-means clustering, Long Short Term Memory, clustering features, disambiguation features

中图分类号: 

  • TP391.2