电子科技 ›› 2022, Vol. 35 ›› Issue (12): 72-77.doi: 10.16180/j.cnki.issn1007-7820.2022.12.010

• • 上一篇    下一篇

基于多模态融合的视觉问答传输注意网络

王茂,彭亚雄,陆安江   

  1. 贵州大学 大数据与信息工程学院,贵州 贵阳 550025
  • 收稿日期:2021-05-10 出版日期:2022-12-15 发布日期:2022-12-13
  • 作者简介:王茂(1998-),女,硕士研究生。研究方向:信号与信息系统。|彭亚雄(1963-),男,副教授。研究方向:数字通信技术、音视频处理。|陆安江(1978-),男,博士,副教授。研究方向:嵌入式系统与集成技术、物联网安全、微传感技术。
  • 基金资助:
    贵州省科技重大专项([2016]3022);贵州省科技成果转化项目([2017]4856)

Visual Question Answer Transmission Attention Network Based on Multi-Modal Fusion

WANG Mao,PENG Yaxiong,LU Anjiang   

  1. College of Big Data and Information Engineering,Guizhou University,Guiyang 550025,China
  • Received:2021-05-10 Online:2022-12-15 Published:2022-12-13
  • Supported by:
    Major Science and Technology Project of Guizhou([2016]3022);Guizhou Province Science and Technology Achievement Transformation Project([2017]4856)

摘要:

针对传统视觉问答任务无法完全捕捉多模态特征之间复杂相关性的缺点,文中提出了基于多模态融合的视觉问答传输注意网络。在特征提取部分,分别利用GloVe词嵌入+LSTM提取问题特征,并使用ResNet-152网络提取图像特征。通过3层传输注意网络进行多模态融合来学习全局多模态嵌入信息,进而使用该嵌入重新校准输入特征。文中设计了一个多模态传输注意学习架构,通过对传输网络进行重叠计算,使组合特征聚焦在图像和问题的细粒度部分,提高了预测答案的准确率。在VQA v1.0数据集上的实验结果表明,该模型的总体准确率达到了69.92%,显著优于其他5种主流视觉问答模型的准确率,证明了该模型的有效性和鲁棒性。

关键词: 视觉问答, 多模态特征, 组合特征, 多模态嵌入, 注意力, 传输网络, 细粒度, 多模态融合

Abstract:

In view of the shortcomings of traditional visual question answering tasks that cannot fully capture the complex correlation between multi-modal features, this study proposes a visual question-and-answer transmission attention network based on multi-modal fusion. In the feature extraction part, GloVe word embedding + LSTM is used to extract problem features, and ResNet-152 network is adopted to extract image features. Multi-modal fusion is performed through a 3-layer transfer attention network to learn global multi-modal embedding information, which is then used to recalibrate the input features. In addition, a multi-modal transmission attention learning architecture is designed. Through overlapping calculations on the transmission network, the combined features focus on the fine-grained parts of the image and the question, which improves the accuracy of the predicted answer. The experimental results on the VQA v1.0 data set show that the overall accuracy of the model reaches 69.92%, which is improved to varying degrees compared with the accuracy of the other 5 mainstream visual question answering models, indicating the effectiveness of the model and robustness.

Key words: visual question answering, multi-modal features, combined features, multi-modal embedding, attention, transmission network, fine-grained, multi-modal fusion

中图分类号: 

  • TP391