电子科技 ›› 2023, Vol. 36 ›› Issue (12): 72-78.doi: 10.16180/j.cnki.issn1007-7820.2023.12.010

• • 上一篇    下一篇

基于改进T5 PEGASUS模型的新闻文本摘要生成

张琪,范永胜   

  1. 重庆师范大学 计算机与信息科学学院,重庆 401331
  • 收稿日期:2022-08-10 出版日期:2023-12-15 发布日期:2023-12-05
  • 作者简介:张琪(1997-),女,硕士研究生。研究方向:自然语言处理。|范永胜(1970-),男,博士,副教授。研究方向:大数据、自然语言处理。
  • 基金资助:
    教育部人文社会科学研究项目(18XJC880002);重庆市教育委员会科技项目(KJQN201800539);重庆师范大学(人才引进/博士启动)基金(17XCB008)

Research on Generating News Text Summarization Based on Improved T5 PEGASUS Model

ZHANG Qi,FAN Yongsheng   

  1. School of Computer and Information Science,Chongqing Normal University,Chongqing 401331,China
  • Received:2022-08-10 Online:2023-12-15 Published:2023-12-05
  • Supported by:
    Humanities and Social Science Research Project of Ministry of Education(18XJC880002);Science and Technology Project of Chongqing Education Commission(KJQN201800539);Chongqing Normal University (Talent Introduction/Doctoral Program) Foundation Project(17XCB008)

摘要:

新闻文本摘要生成任务旨在解决用户在阅读新闻时无法快速把握内容重点而造成的时间损耗和阅读疲劳等问题。目前面向中文的文本摘要模型效果较佳的是T5 PEGASUS模型,但针对该模型的研究较少。文中针对T5 PEGASUS模型的中文分词方面进行改进,使用更适用于新闻领域的Pkuseg分词方法进行处理,并在NLPCC2017、LCSTS、SogouCS这3种新闻长度不同的公开数据集上验证其有效性。研究发现Pkuseg分词方法更适合T5 PEGASUS模型,模型生成摘要的ROUGE(Recall-Oriented Understudy for Gisting Evaluation)值与新闻文本长度成正相关,训练集损失值和损失值下降速度与新闻文本长度成负相关,在面对少量训练集时能得到较高的ROUGE分数,因此该模型具有较强的小样本学习能力。

关键词: 文本摘要生成, 生成式模型, T5 PEGASUS, 新闻文本, 中文分词, Pkuseg, 小样本学习, ROUGE

Abstract:

The task of generating news text summarizations aims to solve the problems of wasting time and reading fatigue caused by users' inability to quickly grasp the key points of the content when reading news. At present, the best text summarization model for Chinese is the T5 PEGASUS model, but there are few researches on this model. In this study, the Chinese word segmentation of the T5 PEGASUS model is improved, and the Pkuseg word segmentation method, which is more suitable for news field, is used for processing, and its effectiveness is verified on three public datasets with different news lengths: NLPCC2017, LCSTS and SogouCS. It is found that the Pkuseg method is more suitable for the T5 PEGASUS model. The ROUGE value of T5 Pegasus model generated summaries is positively correlated with the length of news text, and the loss value of training set and the decline speed of loss value are negatively correlated with the length of news text. In the face of a small number of training sets, the model can get a high ROUGE score, so the model has a strong few-shot learning ability.

Key words: text summarization, generative model, T5 PEGASUS, news text, Chinese word segmentation, Pkuseg, few-shot learning, ROUGE

中图分类号: 

  • TP391.1