西安电子科技大学学报 ›› 2023, Vol. 50 ›› Issue (2): 92-100.doi: 10.19665/j.issn1001-2400.2023.02.010

• 信息与通信工程 • 上一篇    下一篇

一种面向二维三维卷积的GPGPU cache旁路系统

贾世伟1(),张玉明1(),秦翔2(),孙成璐2(),田泽3()   

  1. 1.西安电子科技大学 微电子学院,陕西 西安 710071
    2.西安翔腾微电子科技有限公司,陕西 西安 710068
    3.中国航空计算技术研究所 集成电路与微系统设计航空科技重点实验室,陕西 西安 710068
  • 收稿日期:2022-05-23 出版日期:2023-04-20 发布日期:2023-05-12
  • 作者简介:贾世伟(1993—),男,西安电子科技大学博士研究生,E-mail:18111210124@xidian.edu.cn;|张玉明(1965—),男,教授,E-mail:zhangym@xidian.edu.cn;|秦 翔(1991—),男,工程师,E-mail:18991316149@189.cn;|孙成璐(1991—),女,工程师,E-mail:101449175@qq.com;|田 泽(1965—),男,研究员,E-mail:tarmz@126.com
  • 基金资助:
    装备联合基金(6141B05200305)

GPGPU cache bypassing system for 2D and 3D convolution

JIA Shiwei1(),ZHANG Yuming1(),QIN Xiang2(),SUN Chenglu2(),TIAN Ze3()   

  1. 1. School of Microelectronics,Xidian University,Xi’an 710071,China
    2. Department of Integrated Circuit R&D,Xiangteng microelectronics corporation,Xi’an 710068,China
    3. Key Laboratory of Aviation and Technology on Integrated Circuit and Micro-System Design,China Institute of Aeronautical Computing Technology,Xi’an 710068,China
  • Received:2022-05-23 Online:2023-04-20 Published:2023-05-12

摘要:

通用图形处理器作为卷积神经网络的核心加速平台,其处理二维、三维卷积的性能,决定着神经网络在实时目标识别检测领域的有效应用。然而,受其固有cache系统功能的限制,当前通用图形处理器架构无法实现二维、三维卷积的高效加速。针对此问题,首先提出一种L1Dcache动态旁路设计方案。该方案定义了一组能够动态反映指令访问cache特征的数据结构,并基于此数据结构定义访存特征记录表,以记录不同访存指令在请求cache时的执行状态。其次,采用优先线程块的warp调度策略来加速访存状态的采样。最后根据访存状态得出不同PC值下访存请求对L1Dcache的旁路的判定,并动态完成部分低局域性数据请求对L1Dcache的旁路。由此将L1Dcache空间保留给高局域性的数据并降低二维、三维卷积执行时的访存阻塞周期,进而提升了二维、三维卷积在通用图形处理器上执行时的访存效率。实验结果表明,相比原架构,在面向二维、三维卷积时分别带来了约2.16%与19.79%的性能提升,体现了设计方案的有效性与实用性。

关键词: 卷积, 通用图形处理器, 存储系统, cache旁路

Abstract:

As the core computing platform of the convolution neural network,general-purpose graphics processor(GPGPU),its performance of processing two-dimensional and three-dimensional convolution determines the application of the neural network in real-time target recognition and detection.However,limited by inherent cache system design,the current GPGPU architecture cannot achieve efficient acceleration of 2D and 3D convolution computing.Aiming at this problem,a dynamic L1Dcache bypassing design for this problem is proposed.First,we define a new data structure that can dynamically reflect the cache access characteristics of an instruction,and then defines a memory-access-feature record table based on this information,in order to record the execution status of different memory accesses.Second,the warp scheduling strategy with the priority thread block is adopted,which can speed up the sampling of the memory access state.Next,the L1Dcache bypassing decision of memory accesses under different PCs is obtained due to the sampling results.Finally,the L1Dcache bypassing of some low-locality data accesses is completed.As a result,the L1Dcache space is reserved for data with high locality and the memory access stall cycle of 2D and 3D convolution is reduced.In addition,the memory access efficiency of 2D and 3D convolution has been improved.Compared with the original design,experimental results show that the L1Dcache bypassing design brings 2.16% performance improvements in 2D convolution and 19.79% in 3D convolution.Experiments prove the effectiveness and practicality of this design.

Key words: convolution, GPGPU, memory system, cache bypassing

中图分类号: 

  • TN4