Electronic Science and Technology ›› 2023, Vol. 36 ›› Issue (3): 7-13.doi: 10.16180/j.cnki.issn1007-7820.2023.03.002

Previous Articles     Next Articles

Text Keyword Extraction Method Based on BERT and LightGBM

HE Chuanpeng,YIN Ling,HUANG Bo,WANG Mingsheng,GUO Ruyan,ZHANG Shuai,JU Jiaji   

  1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2021-08-21 Online:2023-03-15 Published:2023-03-16
  • Supported by:
    National Natural Science Foundation of China(61802251)

Abstract:

Traditional text keyword extraction methods ignore the contextual semantic information and cannot solve the problem of ambiguity of a word, so the extraction effect is not ideal. Based on the LDA and BERT models, this study proposes the LDA-BERT-LightGBM (LB-LightGBM) model. The LDA topic model is selected to obtain the topic of each review and its word distribution, candidate keywords are filtered out according to the threshold, and the filtered words and the original review text are spliced and input into the BERT model. The word vector training is performed to obtain the word vector containing the text topic, so the text keyword extraction problem is converted into a two-classification problem through the LightGBM algorithm. The textrank algorithm, LDA algorithm, LightGBM algorithm and the proposed LB-LightGBM model are compared through experiments on the accuracy rate P, recall rate R and F1 of text keyword extraction in the present study. The results show that when TopN takes 3~6, the average value of F1 is 3.5% higher than that of the optimal method, indicating that the extraction effect of this method is generally better than that of the comparison method selected in the experiment, and the text keywords can be found more accurately.

Key words: topic model, word vector, BERT, LightGBM, candidate, extraction, text theme

CLC Number: 

  • TP391.1