[1] |
WANG L, LI Y, LAZEBNIK S. Learning Deep Structure-Preserving Image-Text Embeddings[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:5005-5013.
|
[2] |
KARPATHY A, LI F. Deep Visual-Semantic Alignments for Generating Image Descriptions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:3128-3137.
|
[3] |
LEE K H, CHEN X, HUA G, et al. Stacked Cross Attention for Image-Text Matching[C]//Proceedings of European Conference on Computer Vision. Heidelberg:Springer, 2018:201-216.
|
[4] |
KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models(2014)[J/OL].[2014-11-10]. https://arxiv.org/abs/1411.2539.
|
[5] |
FAGHRI F, FLEET D J, KIROS J R, et al. VSE++:Improving Visual-Semantic Embeddings with Hard Negatives(2017)[J/OL].[2017-07-18]. https://arxiv.org/abs/1707.05612.
|
[6] |
QU L, LIU M, CAO D, et al. Context-Aware Multi-View Summarization Network for Image-Text Matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020:1047-1055.
|
[7] |
MESSINA N, AMATO G, ESULI A, et al. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders[J]. ACM Transactions on Multimedia Computing,Communications and Applications, 2021, 17(4):1-23.
|
[8] |
ZHANG K, MAO Z, WANG Q, et al. Negative-Aware Attention Framework for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:15661-15670.
|
[9] |
PAN Z, WU F, ZHANG B. Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:19275-19284.
|
[10] |
FU Z, MAO Z, SONG Y, et al. Learning Semantic Relationship Among Instances for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:15159-15168.
|
[11] |
JIANG D, YE M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:2787-2797.
|
[12] |
HUANG R, LONG Y,HANJ, et al. Nlip:Noise-Robust Language-Image Pre-Training[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2023, 37(1):926-934.
|
[13] |
YANG A, PAN J, LIN J, et al. Chinese Clip:Contrastive Vision-Language Pretraining in Chinese(2022)[J/OL].[2022-11-02]. https://arxiv.org/abs/2211.01335.
|
[14] |
LI J, SELVARAJU R, GOTMARE A, et al. Align before Fuse:Vision and Language Representation Learning with Momentum Distillation[J]. Advances in Neural Information Processing Systems, 2021,34:9694-9705.
|
[15] |
LI J, LI D, XIONG C, et al. Blip:Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation[C]//International Conference on Machine Learning. New York: PMLR, 2022:12888-12900.
|
[16] |
LI J, LI D, SAVARESE S, et al. BLIP-2:Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models[C]//Proceedings of the 40th International Conference on Machine Learning. New York: PMLR, 2023:19730-19742.
|
[17] |
姜定, 叶茫. 面向跨模态文本到图像行人重识别的Transformer网络[J]. 中国图象图形学报, 2023, 28(5):1384-1395.
|
|
JIANG Ding, YE Mang. Transformer Network for Cross-Modal Text-to-Image Person Re-Identification[J]. Journal of Image and Graphics, 2023, 28(5):1384-1395.
|
[18] |
QI J, PENG Y, YUAN Y. Cross-Media Multi-Level Alignment with Relation Attention Network(2018)[J/OL].[2018-04-25]. https://arxiv.org/abs/1804.09539.
|
[19] |
ZHANG Y, ZHOU W, WANG M, et al. Deep Relation Embedding for Cross-Modal Retrieval[J]. IEEE Transactions on Image Processing, 2020,30:617-627.
|
[20] |
JI Z, WANG H, HAN J, et al. Saliency-Guided Attention Network for Image-Sentence Matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2019:5754-5763.
|
[21] |
HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recoginition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
|
[22] |
REN S, HE K, GIRSHICK R, et al. Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
|
[23] |
ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of IEEE Conference on COMPUTER Vision and Pattern Recognition. Piscataway:IEEE, 2018:6077-6086.
|
[24] |
KRISHNA R, ZHU Y, GROTH O, et al. Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
|
[25] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All You Need[C]//Advances in Neural Information Processing Systems. San Diego: NeruIPS, 2017:5998-6008.
|
[26] |
DEVLIN J, CHANG M, LEE K, et al. BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding(2018)[J/OL].[2018-10-11]. https://arxiv.org/abs/1810.04805.
|
[27] |
YOUNG P, LAI A, HODOSH M, et al. From Image Descriptions to Visual Denotations:New Similarity Metrics for Semantic Inference over Event Descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):67-78.
|
[28] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft Coco:Common Objects in Context[C]//Proceedings of the European Conference on Computer Vision. Heidelberg:Springer, 2014:740-755.
|
[29] |
LIU C, MAO Z, LIU A, et al. Focus Your Attention:A Bidirectional Focal Attention Network for Image-Text Matching[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019:3-11.
|
[30] |
CHEN H, DING G, LIU X, et al. Imram:Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2020:4321-4329.
|