电子科技 ›› 2025, Vol. 38 ›› Issue (9): 20-25.doi: 10.16180/j.cnki.issn1007-7820.2025.09.003

• • 上一篇    下一篇

基于交叉融合编码器的Transformer图像特征提取网络

龚宇, 吴鹏()   

  1. 浙江理工大学 信息科学与工程学院, 浙江 杭州 310018
  • 收稿日期:2024-01-15 修回日期:2024-02-25 出版日期:2025-09-15 发布日期:2025-09-23
  • 通讯作者: 吴鹏(1980-),男,E-mail:wupeng@zstu.edu.cn,博士,副教授。研究方向:线性控制、人机交互、图像处理。
  • 作者简介:龚宇(1997-),男,硕士研究生。研究方向:计算机视觉、视频驱动。
  • 基金资助:
    浙江省自然科学基金(LY21F010016)

Cross-Fusion Encoder-Based Transformer Feature Extraction Network

GONG Yu, WU Peng()   

  1. School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China
  • Received:2024-01-15 Revised:2024-02-25 Online:2025-09-15 Published:2025-09-23
  • Supported by:
    Natural Science Foundation of Zhejiang(LY21F010016)

摘要:

针对基于窗口的视觉Transformer易破坏细粒度特征、模型参数量大等问题,文中提出一种基于交叉融合编码器模块的Transformer图像特征提取网络。利用图像通道特征相关一致性剥离特征图来获得两个特征子集,并联两个注意力模块分别进行注意计算以获得局部和全局信息,采用交叉机制融合信息。结合CAT Transformer的窗口间注意力模块设计一种用于特征图通道维度之间的窗口内注意力方式,以避免破坏纹理信息,增强局部特征的表征能力。实验结果表明,所提模型在CIFAR-100数据集使用7.8 MB参数达到79.86%的TOP-1准确率,在ImageNet-1K数据集达到80.7%的准确率,同时利用Grad-CAM(Gradient-weighted Class Activation Mapping)来可视化模型决策过程。

关键词: 计算机视觉, 图像分类, 自注意力机制, 特征提取, 上下文信息, 编码器, 通道特征, 卷积神经网络

Abstract:

In view of the problems that window-based vision Transformer is easy to destroy fine-grained features and large number of model parameters, this study proposes a cross-fusion encoder based Transformer image feature extraction network. Two feature subsets are obtained using image channel feature correlation consistency stripping feature maps. Two attention modules are connected in parallel perform attention calculations respectively to obtain local and global information. A crossover mechanism is adopted to fuse information. Combined with the inter-window attention module of CAT Transformer, an in-window attention mode between channel dimensions of feature graph is designed to avoid destroying texture information and enhance the representation ability of local features. Experimental results show that the proposed model achieves 79.86% TOP-1 accuracy with 7.8 MB parameter on CIFAR-100 data set and 80.7% accuracy on ImageNet-1K data set. Grad-CAM(Gradient-weighted Class Activation Mapping)is also used to visualize the decision-making process.

Key words: computer vision, image classification, self-attention, feature extraction, contextual information, encoder, channel feature, convolutional neural network

中图分类号: 

  • TP391.4文献标识码A