Electronic Science and Technology ›› 2022, Vol. 35 ›› Issue (12): 72-77.doi: 10.16180/j.cnki.issn1007-7820.2022.12.010

Previous Articles     Next Articles

Visual Question Answer Transmission Attention Network Based on Multi-Modal Fusion

WANG Mao,PENG Yaxiong,LU Anjiang   

  1. College of Big Data and Information Engineering,Guizhou University,Guiyang 550025,China
  • Received:2021-05-10 Online:2022-12-15 Published:2022-12-13
  • Supported by:
    Major Science and Technology Project of Guizhou([2016]3022);Guizhou Province Science and Technology Achievement Transformation Project([2017]4856)

Abstract:

In view of the shortcomings of traditional visual question answering tasks that cannot fully capture the complex correlation between multi-modal features, this study proposes a visual question-and-answer transmission attention network based on multi-modal fusion. In the feature extraction part, GloVe word embedding + LSTM is used to extract problem features, and ResNet-152 network is adopted to extract image features. Multi-modal fusion is performed through a 3-layer transfer attention network to learn global multi-modal embedding information, which is then used to recalibrate the input features. In addition, a multi-modal transmission attention learning architecture is designed. Through overlapping calculations on the transmission network, the combined features focus on the fine-grained parts of the image and the question, which improves the accuracy of the predicted answer. The experimental results on the VQA v1.0 data set show that the overall accuracy of the model reaches 69.92%, which is improved to varying degrees compared with the accuracy of the other 5 mainstream visual question answering models, indicating the effectiveness of the model and robustness.

Key words: visual question answering, multi-modal features, combined features, multi-modal embedding, attention, transmission network, fine-grained, multi-modal fusion

CLC Number: 

  • TP391