电子科技 ›› 2023, Vol. 36 ›› Issue (3): 7-13.doi: 10.16180/j.cnki.issn1007-7820.2023.03.002

• • 上一篇    下一篇

基于BERT和LightGBM的文本关键词提取方法

何传鹏,尹玲,黄勃,王明胜,郭茹燕,张帅,巨家骥   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 收稿日期:2021-08-21 出版日期:2023-03-15 发布日期:2023-03-16
  • 作者简介:何传鹏(1996-),男,硕士研究生。研究方向:文本情感分析。|尹玲(1987-),女,博士,讲师。研究方向:时间序列分析与预测、深度学习。
  • 基金资助:
    国家自然科学基金(61802251)

Text Keyword Extraction Method Based on BERT and LightGBM

HE Chuanpeng,YIN Ling,HUANG Bo,WANG Mingsheng,GUO Ruyan,ZHANG Shuai,JU Jiaji   

  1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2021-08-21 Online:2023-03-15 Published:2023-03-16
  • Supported by:
    National Natural Science Foundation of China(61802251)

摘要:

传统的文本关键词提取方法忽略了上下文语义信息,不能解决一词多义问题,提取效果并不理想。基于LDA和BERT模型,文中提出LDA-BERT-LightGBM(LB-LightGBM)模型。该方法选择LDA主题模型获得每个评论的主题及其词分布,根据阈值筛选出候选关键词,将筛选出来的词和原评论文本拼接在一起输入到BERT模型中,进行词向量训练,得到包含文本主题词向量,从而将文本关键词提取问题通过LightGBM算法转化为二分类问题。通过实验对比了textrank算法、LDA算法、LightGBM算法及文中提出的LB-LightGBM模型对文本关键词提取的准确率P、召回率R以及F1。结果表明,当TopN取3~6时,F1的平均值比最优方法提升3.5%,该方法的抽取效果整体上优于实验中所选取的对比方法,能够更准确地发现文本关键词。

关键词: 主题模型, 词向量, BERT, LightGBM, 候选, 提取, 文本主题

Abstract:

Traditional text keyword extraction methods ignore the contextual semantic information and cannot solve the problem of ambiguity of a word, so the extraction effect is not ideal. Based on the LDA and BERT models, this study proposes the LDA-BERT-LightGBM (LB-LightGBM) model. The LDA topic model is selected to obtain the topic of each review and its word distribution, candidate keywords are filtered out according to the threshold, and the filtered words and the original review text are spliced and input into the BERT model. The word vector training is performed to obtain the word vector containing the text topic, so the text keyword extraction problem is converted into a two-classification problem through the LightGBM algorithm. The textrank algorithm, LDA algorithm, LightGBM algorithm and the proposed LB-LightGBM model are compared through experiments on the accuracy rate P, recall rate R and F1 of text keyword extraction in the present study. The results show that when TopN takes 3~6, the average value of F1 is 3.5% higher than that of the optimal method, indicating that the extraction effect of this method is generally better than that of the comparison method selected in the experiment, and the text keywords can be found more accurately.

Key words: topic model, word vector, BERT, LightGBM, candidate, extraction, text theme

中图分类号: 

  • TP391.1