西安电子科技大学学报 ›› 2020, Vol. 47 ›› Issue (2): 98-107.doi: 10.19665/j.issn1001-2400.2020.02.014

• • 上一篇    下一篇

卷积神经网络训练访存优化

王吉军,郝子宇,李宏亮   

  1. 江南计算技术研究所,江苏 无锡 214083
  • 收稿日期:2019-11-14 出版日期:2020-04-20 发布日期:2020-04-26
  • 作者简介:王吉军(1990—),男,江南计算技术研究所博士研究生,E-mail:wjjxjtu@mail.ustc.edu.cn
  • 基金资助:
    国家核高基重点专项-面向数据中心(云平台)与集群计算的智能计算单元 2018ZX01028-102

Optimization of memory access for the convolutional neural network training

WANG Jijun,HAO Ziyu,LI Hongliang   

  1. Jiangnan Institute of Computing Technology, Wuxi 214083, China
  • Received:2019-11-14 Online:2020-04-20 Published:2020-04-26

摘要:

虽然批归一化算法能有效加速深度卷积网络模型的收敛速度,但其数据依赖性复杂,训练时会导致严重的“存储墙”瓶颈。故对使用批归一化算法的卷积神经网络,提出多层融合且重构批归一化层的训练方法,减少模型训练过程中的访存量。首先,通过分析训练时批归一化层的数据依赖、访存特征及模型训练时的访存特征,分析访存瓶颈的关键因素;其次,使用“计算换访存”思想,提出融合“卷积层+批归一化层+激活层”结构的方法,并基于批归一化层的计算访存特征,将其重构为两个子层,分别与相邻层融合,进一步减少训练时对主存的读写,并构建了训练时的访存量模型与计算量模型。实验结果表明,使用NVIDIA TESLA V100 GPU训练ResNet-50、Inception V3及DenseNet模型时,同原始训练方法相比,其访存数据量分别降低了33%,22%及31%,V100的实际计算效率分别提升了20.5%,18.5%以及18.1%。这种优化方法利用了网络结构与模型训练时的访存特点,可与其他访存优化方法协同使用,进一步降低模型训练时的访存量。

关键词: 深度卷积神经网络, 模型训练, 多层融合, 批归一化重构, 访存优化

Abstract:

Batch Normalization (BN) can effectively speed up deep neural network training, while its complex data dependence leads to the serious "memory wall" bottleneck. Aiming at the "memory wall" bottleneck for the training of the convolutional neural network(CNN) with BN layers, an effective memory access optimization method is proposed through BN reconstruction and fused-layers computation. First, through detailed analysis of BN’s data dependence and memory access features during training, some key factors for large amounts of memory access are identified. Second, the “Convolution + BN + ReLU (Rectified Linear Unit)” block is fused as a computational block to reduce memory access with re-computing strategy in training. Besides, the BN layer is split into two sub-layers which are respectively fused with its adjacent layers, and this approach further reduces memory access during training and effectively improves the accelerator’s computational efficiency. Experimental results show that the amount of memory access is decreased by 33%, 22% and 31% respectively, and the actual computing efficiency of the V100 is improved by 20.5%, 18.5% and 18.1% respectively when the ResNet-50, Inception V3 and DenseNet are trained on the NVIDIA TELSA V100 GPU with the optimization method. The proposed method exploits the characteristics of memory access during training, and can be used in conjunction with other optimization methods to further reduce the amount of memory access during training.

Key words: deep convolutional neural networks, model training, fused-layers, batch normalization reconstruction, off-chip memory access optimization.

中图分类号: 

  • TP391