电子科技 ›› 2020, Vol. 33 ›› Issue (10): 51-56.doi: 10.16180/j.cnki.issn1007-7820.2020.10.009

• • 上一篇    下一篇

融合TextRank算法的中文短文本相似度计算

卢佳伟,陈玮,尹钟   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200093
  • 收稿日期:2019-07-21 出版日期:2020-10-15 发布日期:2020-10-20
  • 作者简介:卢佳伟(1995-),男,硕士研究生。研究方向:自然语言理解。|陈玮(1964-),女,副教授。研究方向:图像处理与模式识别。
  • 基金资助:
    国家自然科学基金(61703277)

Chinese Short Text Similarity Calculation Based on TextRank Algorithm

LU Jiawei,CHEN Wei,YIN Zhong   

  1. School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2019-07-21 Online:2020-10-15 Published:2020-10-20
  • Supported by:
    National Natural Science Foundation of China(61703277)

摘要:

传统的VSM向量空间模型忽略了文本语义,构建的文本特征矩阵具有稀疏性。基于深度学习词向量技术,文中提出一种融合改进TextRank算法的相似度计算方法。该方法利用词向量嵌入的技术来构建文本向量空间,使得构建的向量空间模型具有了语义相关性,同时采用改进的TextRank算法提取文本关键字,增强了文本特征的表达并消除了大量冗余信息,降低了文本特征矩阵的稀疏性,使文本相似度的计算更加高效。不同模型的仿真实验结果表明,融合改进的TextRank算法与Bert词向量技术的方法具有更好的文本相似度计算性能。

关键词: 文本相似度, 提取, TextRank算法, Bert, 词向量技术, 向量空间模型

Abstract:

The traditional VSM vector space model often ignores text semantics, and the constructes text feature matrix is sparse. Based on the word vector technology of deep learning, this paper proposes a similarity calculation method that integrates the improved TextRank algorithm. This method uses the word vector embedding technology to build a text vector space, which makes the vector space model possess the semantic relevance. At the same time, with the improved TextRank algorithm to extract text keywords, the expression of text feature is enhanced and a large amount of redundant information is eliminated. The text characteristic of sparse matrix is reduced, which makes the text similarity computing more efficient. The results of the simulation experiments of different models show that the fusion of the improved TextRank algorithm with Bert word vector technology have better performance of text similarity calculation.

Key words: text similarity, extraction, TextRank slgorithm, Bert, word vector technique, vector space model

中图分类号: 

  • TP391