Electronic Science and Technology ›› 2020, Vol. 33 ›› Issue (10): 51-56.doi: 10.16180/j.cnki.issn1007-7820.2020.10.009
Previous Articles Next Articles
LU Jiawei,CHEN Wei,YIN Zhong
Received:
2019-07-21
Online:
2020-10-15
Published:
2020-10-20
Supported by:
CLC Number:
LU Jiawei,CHEN Wei,YIN Zhong. Chinese Short Text Similarity Calculation Based on TextRank Algorithm[J].Electronic Science and Technology, 2020, 33(10): 51-56.
Table 1
The F1 value table of algorithm model based on the extraction quantity of keyword"
TI+cos | TI+em | TI+mean | TR+cos | TR+em | TR+mean | Imp-TR+cos | Imp-TR+em | Imp-TR+mean | |
---|---|---|---|---|---|---|---|---|---|
6 | 0.589 | 0.378 | 0.571 | 0.720 | 0.347 | 0.728 | 0.821 | 0.491 | 0.824 |
7 | 0.607 | 0.438 | 0.619 | 0.795 | 0.444 | 0.810 | 0.857 | 0.596 | 0.863 |
8 | 0.630 | 0.498 | 0.634 | 0.848 | 0.545 | 0.846 | 0.871 | 0.682 | 0.877 |
9 | 0.638 | 0.552 | 0.647 | 0.886 | 0.671 | 0.879 | 0.891 | 0.775 | 0.889 |
10 | 0.649 | 0.588 | 0.652 | 0.888 | 0.792 | 0.894 | 0.904 | 0.836 | 0.903 |
11 | 0.661 | 0.600 | 0.661 | 0.885 | 0.835 | 0.882 | 0.892 | 0.868 | 0.883 |
12 | 0.656 | 0.619 | 0.661 | 0.885 | 0.871 | 0.864 | 0.893 | 0.903 | 0.879 |
13 | 0.662 | 0.643 | 0.661 | 0.881 | 0.877 | 0.853 | 0.878 | 0.902 | 0.845 |
14 | 0.660 | 0.648 | 0.662 | 0.878 | 0.887 | 0.849 | 0.865 | 0.900 | 0.838 |
15 | 0.661 | 0.659 | 0.661 | 0.86 | 0.881 | 0.826 | 0.860 | 0.897 | 0.829 |
Table 2
Test results of artificial data set"
P | R | F1 | |
---|---|---|---|
TF-IDF+cos | 1.000 | 0.557 | 0.716 |
TF-IDF+em | 0.660 | 1.000 | 0.795 |
W2v+cos | 1.000 | 0.549 | 0.709 |
W2+em | 1.000 | 0.543 | 0.704 |
W2v+mean | 1.000 | 0.509 | 0.675 |
W2v+cos+TR(10) | 0.895 | 0.882 | 0.888 |
W2v+em+TR(14) | 0.920 | 0.856 | 0.887 |
W2v+mean+TR(10) | 0.930 | 0.861 | 0.894 |
W2v+cos+TI(13) | 0.975 | 0.501 | 0.662 |
W2v+em+TI(15) | 0.965 | 0.500 | 0.659 |
W2v+mean+TI(14) | 0.980 | 0.500 | 0.662 |
W2v+cos+Imp-TR(10) | 0.960 | 0.853 | 0.904 |
W2v+em+Imp-TR(12) | 0.885 | 0.922 | 0.903 |
W2v+mean+Imp-TR(10) | 0.975 | 0.841 | 0.903 |
Table 3
Test results of THUCNews data set-Word2vec"
P | R | F1 | |
---|---|---|---|
W2v+cos+TR | 0.873 | 0.867 | 0.870 |
W2v+em+TR | 0.795 | 0.774 | 0.785 |
W2v+mean+TR | 0.634 | 0.986 | 0.772 |
W2v+cos+TI | 0.824 | 0.917 | 0.868 |
W2v+em+TI | 0.683 | 0.889 | 0.772 |
W2v+mean+TI | 0.584 | 0.989 | 0.735 |
W2v+cos+Imp-TR | 0.878 | 0.881 | 0.880 |
W2v+em+Imp-TR | 0.784 | 0.853 | 0.817 |
W2v+mean+Imp-TR | 0.650 | 0.995 | 0.787 |
Table 4
Test results of THUCNews data set-BERT"
P | R | F1 | |
---|---|---|---|
BERT+cos+TR | 0.921 | 0.917 | 0.919 |
BERT+em+TR | 0.944 | 0.785 | 0.857 |
BERT+mean+TR | 0.967 | 0.793 | 0.871 |
BERT+cos+TI | 0.930 | 0.852 | 0.889 |
BERT+em+TI | 0.689 | 0.995 | 0.814 |
BERT+mean+TI | 0.974 | 0.824 | 0.893 |
BERT+cos+Imp-TR | 0.891 | 0.978 | 0.932 |
BERT+em+Imp-TR | 0.936 | 0.867 | 0.901 |
BERT+mean+Imp-TR | 0.847 | 0.947 | 0.894 |
[1] | Xu L H, Sun S T, Wang Q. Text similarity The algorithm based on semantic vector space model[C]. Okayama:The Fifteenth International Conference on Computer and Information Science, 2016. |
[2] | 牛永洁, 田成龙. 融合多因素的TFIDF关键词提取算法研究[J]. 计算机技术与发展, 2019(7):80-83. |
Niu Yongjie, Tian Chenglong. Research on TFIDF keyword extration algorithm based on multiple factors[J]. Computer Technology and Development, 2019(7):80-83. | |
[3] | Pu H, Fei G, Zhao H, et al. Short text similarity calculation using semantic information[C]. Orlando:International Conference on Big Data Computing & Communications, 2017. |
[4] | 宋冬云, 郑瑾, 张祖平. 基于混合策略的中文短文本相似度计算[J]. 计算机工程与应用, 2018,54(12):116-120,205. |
Song Dongyun, Zheng Jin, Zhang Zuping. Chinese short text similarity computation based on hybrid strategy[J]. Computer Engineering and Applications, 2018,54(12):116-120,205. | |
[5] | 谷重阳, 徐浩煜, 周晗, 等. 基于词汇语义信息的文本相似度计算[J]. 计算机应用研究, 2018,35(2):391-395. |
Gu Chongyang, Xu Haoyu, Zhou Han, et al. Text similarity computing based on lexical semantic information[J]. Application Research of Computers, 2018,35(2):391-395. | |
[6] | 李小涛, 游树娟, 陈维. 一种基于词义向量模型的词语语义相似度算法[J/OL].(2018-03-12) [2019-06-02] https://doi.org/10.16383/j.aas.c180312. |
Li Xiaotao, You Shujuan, Chen Wei. An algorithm of semantic similarity between words based on word single-meaning embedding model[J/OL].(2018-03-12) [2019-06-02] https://doi.org/10.16383/j.aas.c180312. | |
[7] | Zhang C, Wang X, Yu S, et al. Research on keyword extraction of Word2vec model in Chinese corpus[C]. Singapore:The Seventeenth International Conference on Computer and Information Science, 2018. |
[8] | Wen Y, Yuan H, Zhang P. Research on keyword extraction based on Word2vec weighted TextRank[C]. Wuhan: The Secend IEEE International Conference on Computer and Communications, 2016. |
[9] | Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL].(2018-10-11)[2019-06-02] https://arxiv.org/abs/1810.04805. |
[10] | 周锦章, 崔晓晖. 基于词向量与TextRank的关键词提取方法[J]. 计算机应用研究, 2019,36(4):1051-1054. |
Zhou Jinzhang, Cui Xiaohui. Keyword extraction method based on word vector and TextRank[J]. Application research of computers, 2019,36(4):1051-1054. | |
[11] | 柳林青, 余瀚, 费宁, 等. 一种基于TextRank的单文本关键字提取算法[J]. 计算机应用研究, 2018,35(3):705-710. |
Liu Linqin, Yu Han, Fei Ning, et al. Key-word extracting algorithm from single text based on TextRank[J]. Application Research of Computers, 2018,35(3):705-710. | |
[12] | 王贵新, 郑孝宗, 张浩然, 等. 基于Word2vec的短信向量化算法[J]. 电子科技, 2016,29(4):49-52. |
Wang Guixin, Zheng Xiaozong, Zhang Haoran, et al. An algorithm for vectoring SMS based on Word2vec[J]. Electronic Science and Technology, 2016,29(4):49-52. | |
[13] | 杨飘, 董文永. 基于余弦距基于BERT嵌入的中文命名实体识别方法离的多目标粒子群优化算法[J/OL].(2019-03-17) [2019-06-02] https://doi.org/10.19678/j.issn.1000-3428.0054272. |
Yang Piao, Dong Wenyong.(2019-03-17) [2019-06-02] Chinese NER based on BERT embedding[J]. https://doi.org/10.19678/j.issn.1000-3428.0054272. | |
[14] | 方欣欣, 龚如宾, 李大为. 基于余弦距离的多目标粒子群优化算法[J]. 电子科技, 2016,29(3):48-52,57. |
Fang Xinxin, Gong Rubin, Li Dawei. Multi-objective particle swarm optimization algorithm based on cosine distance[J]. Electronic Science and Technology, 2016,29(3):48-52,57. | |
[15] | 姜猛, 王子牛, 高建瓴. 基于异构数据联合训练的中文分词法[J]. 电子科技, 2019,32(4):29-32,59. |
Jiang Meng, Wang Ziniu, Gao Jianling. Chinese word segmentation based on joint training of heterogeneous data[J]. Electronic Science and Technology, 2019,32(4):29-32,59. | |
[16] | 廖志芳, 周国恩, 李俊锋, 等. 中文短文本语法语义相似度算法[J]. 湖南大学学报(自然科学版), 2016,43(2):135-140. |
Liao Zhifang, Zhou Guoen, Li Junfeng, et al. A Chinese short text similarity algorithm based on semantic and syntax[J]. Journal of Hunan University(Natural Sciences Edition), 2016,43(2):135-140. |
[1] | WU Weijia,YANG Jian,YUAN Tianchen,SHAO Zhihui. Research on Track Structure Damage Identification Based on Support Vector Machine [J]. Electronic Science and Technology, 2022, 35(2): 27-33. |
[2] | LI Hui,WANG Yicheng. CNNCIFG-Attention Model for Text Sentiment Classifcation [J]. Electronic Science and Technology, 2022, 35(2): 46-51. |
[3] | SHAO Zhihui,YANG Jian,YUAN Tianchen,WU Weijia. Sleeper Diseases Diagnosis Based on Permutation Entropy and Support Vector Machine [J]. Electronic Science and Technology, 2022, 35(2): 52-58. |
[4] | SI Mingming,CHEN Wei,HU Chunyan,YIN Zhong. Fundus Blood Vessel Image Segmentation Combining Resnet50 and U-Net [J]. Electronic Science and Technology, 2021, 34(8): 19-24. |
[5] | MA Lixin,DOU Chenfei,SONG Chencan,YANG Tianxiao. Insulator Nondestructive Testing Based on Feature Fusion CNN [J]. Electronic Science and Technology, 2021, 34(7): 26-30. |
[6] | LIU Shu,SHAO Jie,ZHANG Yiting,ZHANG Shanzhang. ECG Classification Based on Bispectrum and Spectral Features [J]. Electronic Science and Technology, 2021, 34(5): 42-46. |
[7] | SONG Zhangming,HE Huiyong,HUANG Yuejun. Study on On-Line Detection of Surface Defects of Flat Enameled Wire [J]. Electronic Science and Technology, 2021, 34(5): 72-78. |
[8] | GE Jing,LIU Zilong. The Algorithm Based on CNN and LSTM for Sleep Apnea Syndrome Detection [J]. Electronic Science and Technology, 2021, 34(2): 21-26. |
[9] | JU Zhiyong,ZHAI Chunyu,ZHANG Wenxin. Color Commodity Label Image Segmentation Method Based on SVM and Region Growth [J]. Electronic Science and Technology, 2021, 34(10): 69-74. |
[10] | LIU Yanwen,WEI Yun. Research of Emotional Analysis Based on LDA Topic Model [J]. Electronic Science and Technology, 2020, 33(7): 12-16. |
[11] | WEI Wenliang,MAO Yulong. Cross-correlation Time Delay Estimation Optimization Algorithm Based on LMS Adaptive Filtering [J]. Electronic Science and Technology, 2020, 33(6): 29-34. |
[12] | LIU Yimin,LIU Tao,CHEN Qing. Application of CHMM and AR Model in Evaluation and Prediction of Bearing Performance Degradation [J]. Electronic Science and Technology, 2020, 33(5): 58-65. |
[13] | SI Qin,LI Feifei,CHEN Qiu. Face Recognition Algorithm Based on Deep Learning and Feature Fusion [J]. Electronic Science and Technology, 2020, 33(4): 18-22. |
[14] | WANG Zhengjun,YAO Yiming,CHEN Long. PCB Board Common Sorting Algorithm Based on Convolutional Neural Network [J]. Electronic Science and Technology, 2020, 33(2): 60-65. |
[15] | MA Xudong,YUAN Ruibo. An Identification Method of Coins Denomination Based on Image Detection [J]. Electronic Science and Technology, 2020, 33(2): 43-47. |