Journal of Xidian University ›› 2021, Vol. 48 ›› Issue (6): 161-171.doi: 10.19665/j.issn1001-2400.2021.06.020

• Computer Science and Technology • Previous Articles     Next Articles

Semi-supervised word sense disambiguation by combining k-means clustering and the LSTM network

ZHANG Chunxiang(),ZHOU Xuesong(),GAO Xueyao(),LIU Huan()   

  1. School of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China
  • Received:2020-03-11 Online:2021-12-20 Published:2022-02-24
  • Contact: Xueyao GAO E-mail:z6c6x666@163.com;1583829471@qq.com;xueyao_gao@163.com;18473681@qq.com

Abstract:

Polysemy is an inherent characteristic of the natural language.The word sense disambiguation(WSD) is to determine the meaning of an ambiguous word according to its context,which is a key technology in the natural language processing field.Now,the WSD is widely applied to machine translation,information retrieval and text classification.In order to improve the accuracy of the WSD,a semi-supervised WSD method is proposed based on the k-means clustering method and the Long Short Term Memory (LSTM).The ambiguous word is used as its center.Its two left and right adjacent lexical units are selected to construct the word window whose size is 4.Morphology and semantic classes are extracted as clustering features from the word window.The k-means clustering method is used to cluster the unlabeled corpus.The clustered corpus is added into the SemEval-2007:Task#5 training corpus to expand the size of the training corpus.The morphology,part-of-speech,semantic category,English translation and disambiguation distance are extracted as disambiguation features from the word window.The LSTM network is used to determine semantic categories of ambiguous words.The expanded corpus is applied to optimize LSTM parameters.The SemEval-2007:Task#5 test corpus is used to test the WSD classifier.Experiments are conducted to analyze the influence of hidden layer number and training corpus scale on the WSD.Experimental results show that the proposed method can improve the WSD accuracy compared with bayesian classifiers and deep belief networks.

Key words: word sense disambiguation, k-means clustering, Long Short Term Memory, clustering features, disambiguation features

CLC Number: 

  • TP391.2