电子科技 ›› 2024, Vol. 37 ›› Issue (7): 25-32.doi: 10.16180/j.cnki.issn1007-7820.2024.07.004

• • 上一篇    下一篇

基于关键区域特征提取与二阶段分类网络的场景识别方法

韩瀛昊, 李菲菲   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200093
  • 收稿日期:2023-02-03 出版日期:2024-07-15 发布日期:2024-07-17
  • 作者简介:韩瀛昊(1997-),男,硕士研究生。研究方向:图像处理与模式识别。
    李菲菲(1970-),女,博士,教授。研究方向:多媒体信息处理、图像处理与模式识别、信息检索等。
  • 基金资助:
    上海市高校特聘教授(东方学者)岗位计划(ES2015XX)

Scene Recognition Algorithm Based on Discriminative Patch Extraction and Two-Stage Classification

HAN Yinghao, LI Feifei   

  1. School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2023-02-03 Online:2024-07-15 Published:2024-07-17
  • Supported by:
    Program for Professor of Special Appointment(Eastern Scholar) at Shanghai Institutions of Higher Learning(ES2015XX)

摘要:

在场景识别任务中,存在异类场景包含高相似度的物品种类或同类场景空间布局差异过大的情况,即场景的类间相似性与类内差异性。现有方法通过增强数据集或利用多层次的信息互补提高分类器的判别能力,尽管性能得到了一定提升,但仍存在局限性。文中提出了关键区域特征提取(Discriminative Patch Extraction,DPE)以及二阶段分类(Two-Stage Classification,TSC)网络的方法来克服场景的类间相似性与类内差异性。关键区域提取通过保留图像中的关键信息区域来避免类内差异性对场景识别的影响,而二阶段分类网络则通过粗细两个阶段的训练来避免类间相似性对场景识别的影响。文中方法结合ViT(Vision Transformer)等基线网络,在经典场景识别数据集Scene15、MITindoor67和SUN397上的分类精度分别达到了96.9%、88.4%以及76.0%。所提方法在最大规模的场景识别数据集Places365上取得了60.5%的最高分类精度。

关键词: 场景识别, 深度神经网络, 类间相似, 类内差异, 数据增强, 关键区域特征提取, 二阶段分类, ViT

Abstract:

In the scene recognition task, there are cases where heterogeneous scenes contain items with high similarity or the spatial layout of similar scenes is too different, that is, the inter-class similarity and intra-class difference of scenes.Existing methods improve the discriminant ability of classifiers by enhancing data sets or using multi-level information complementation. Although some improvements have been made, there are still limitations.In this study, the DPE(Discriminative Patch Extraction) and TSC(Two-Stage Classification) network method are proposed to overcome the inter-class similarity and intra-class difference of scenes. DPE avoids the impact of intra-class differences on scene recognition by preserving the key information regions in images, while the TSC network avoids the impact of inter-class similarities on scene recognition by the coarse-fine two-stage training.After combining the proposed method with baseline networks such as ViT(Vision Transformer), the classification accuracy of classical scene recognition data sets Scene15, MITindoor67 and SUN397 reaches 96.9%, 88.4% and 76.0%, respectively. The proposed method achieves the highest classification accuracy of 60.5% on the largest scene recognition dataset Places365.

Key words: scene recognition, deep neural networks, inter-class similarity, intra-class variability, data augmentation, discriminative patch extraction, two-stage classification, ViT

中图分类号: 

  • TP391