Journal of Xidian University ›› 2024, Vol. 51 ›› Issue (4): 128-138.doi: 10.19665/j.issn1001-2400.20240302

• Computer Science and Technology & Cyberspace Security • Previous Articles     Next Articles

Joint feature approach for image-text cross-modal retrieval

GAO Dihui1,2(), SHENG Lijie1,2(), XU Xiaodong1,2(), MIAO Qiguang1,2()   

  1. 1. School of Computer Science and Technology,Xidian University,Xi’an 710071,China
    2. Key Laboratory of Big Data and Intelligent Vision,Xidian University,Xi’an 710071,China
  • Received:2023-07-10 Online:2024-08-20 Published:2024-03-13
  • Contact: SHENG Lijie E-mail:dhgao@stu.xidian.edu.cn;ljsheng@xidian.edu.cn;xuxiaodong@stu.xidian.edu.cn;qgmiao@xidian.edu.cn

Abstract:

With the rapid development of deep learning,cross-modal retrieval performance has been significantly improved.However,existing methods only match the image text as a whole or only use local information for matching,there are limitations in the use of graphic and textual information,and the retrieval performance needs to be further improved.In order to fully exploit the potential semantic relationship between images and texts,this paper proposes a cross-modal retrieval model based on joint features.In the feature extraction part,two sub-networks are used to deal with the local features and global features of images and texts respectively,and a bilinear layer structure based on the attention mechanism is designed to filter redundant information.In the loss function part,the triplet ranking loss and semantic label classification loss are used to realize feature joint optimization.And the proposed model has a wide range of generality,which can effectively improve the performance of the model only based on local information.A series of experimental results on the public datasets Flickr30k and MS COCO show that the proposed model effectively improves the performance of cross-modal image-text retrieval tasks.In the Flickr30k dataset retrieval task,the proposed model improves 5.1% on the R@1 metric for text retrieval and 2.8% on the R@1 metric for image retrieval.

Key words: cross-modal retrieval, deep learning, self-attention network, image retrieval

CLC Number: 

  • TP391