卷积神经网络训练访存优化

doi:10.19665/j.issn1001-2400.2020.02.014

Abstract

Abstract:

Batch Normalization (BN) can effectively speed up deep neural network training, while its complex data dependence leads to the serious "memory wall" bottleneck. Aiming at the "memory wall" bottleneck for the training of the convolutional neural network(CNN) with BN layers, an effective memory access optimization method is proposed through BN reconstruction and fused-layers computation. First, through detailed analysis of BN’s data dependence and memory access features during training, some key factors for large amounts of memory access are identified. Second, the “Convolution + BN + ReLU (Rectified Linear Unit)” block is fused as a computational block to reduce memory access with re-computing strategy in training. Besides, the BN layer is split into two sub-layers which are respectively fused with its adjacent layers, and this approach further reduces memory access during training and effectively improves the accelerator’s computational efficiency. Experimental results show that the amount of memory access is decreased by 33%, 22% and 31% respectively, and the actual computing efficiency of the V100 is improved by 20.5%, 18.5% and 18.1% respectively when the ResNet-50, Inception V3 and DenseNet are trained on the NVIDIA TELSA V100 GPU with the optimization method. The proposed method exploits the characteristics of memory access during training, and can be used in conjunction with other optimization methods to further reduce the amount of memory access during training.

Key words: deep convolutional neural networks, model training, fused-layers, batch normalization reconstruction, off-chip memory access optimization.

CLC Number:

TP391

WANG Jijun,HAO Ziyu,LI Hongliang. Optimization of memory access for the convolutional neural network training[J].Journal of Xidian University, 2020, 47(2): 98-107.

Figures/Tables 15

References 17

[1]	SCHMIDHUBER J . Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015,61(1):85-117.
[2]	IOFFE S, SZEGEDY C . Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]// Proceedings of the 32nd International Conference on Machine Learning. Lille: IMLS, 2015: 448-456.
[3]	GOOGLE INC TPUv2[EB/OL]. [2019-1-7]. https://www.tomshardware.com/ne-ws/tpu-v2-google-machine-learning-35370.html .
[4]	LI J, YAN G, LU W , et al. TNPU: an Efficient Accelerator Architecture for Training Convolutional Neural Networks[C]// Proceedings of the Asia and South Pacific Design Automation Conference. Piscataway: IEEE, 2019: 487-492.
[5]	乔瑞秀, 陈刚, 龚国良 , 等. 一种高性能可重构深度卷积神经网络加速器[J]. 西安电子科技大学学报, 2019,46(3):130-139.
	QIAO Ruixiu, CHEN Gang, GONG Guoliang , et al. High Performance Reconfigurable Accelerator for Deep Convolutional Neural Networks[J]. Journal of Xidian University, 2019,46(3):130-139.
[6]	HEGDE K, AGRAWAL R, YAO Y , et al. Morph: Flexible Acceleration for 3D CNN-based Video Understanding[C]// Proceedings of the Annual International Symposium on Microarchitecture. Washington: IEEE Computer Society, 2018: 933-946.
[7]	LI J, YAN G, LU W , et al. SmartShuttle: Optimizing Off-chip Memory Accesses for Deep Learning Accelerators[C]// Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition. Piscataway: IEEE, 2018: 343-348.
[8]	CHEN T, XU B, ZHANG C , et al. Training Deep Nets with Sublinear Memory Cost[J]. Computer Science, 2016.
[9]	NARANG S, DIAMOS G, ELSEN E , et al. Mixed Precision Training[C]// Proceedings of the 6th International Conference on Learning Representations. San Diego: ICLR, 2018.
[10]	JAIN A, PHANISHAYEE A, MARS J , et al. Gist: Efficient Data Encoding for Deep Neural Network Training[C]// Proceedings of the International Symposium on Computer Architecture. Piscataway: IEEE, 2018: 776-789.
[11]	HE K, ZHANG X, REN S , et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 770-778.
[12]	SZEGEDY C, VANHOUCKE V, IOFFE S , et al. Rethinking the Inception Architecture for Computer Vision[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 2818-2826.
[13]	HUANG G, LIU Z, VAN DER MAATEN L , et al. Densely Connected Convolutional Networks[C]// Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2261-2269.
[14]	REDMON J, FARHADI A . Yolov3: An Incremental Improvement[EB/OL]. [2018-12-25]. https://arxiv.org/pdf/1804.02767.pdfs.
[15]	NVIDIA T . V100 GPU Architecture. The World’s Most Advanced Data Center GPU[EB/OL].[2018-10-10]. https://devblogs.nvidia.com/inside-volta/.
[16]	YOU Y, ZHANG Z, DEMMEL J , et al. Imagenet Training in 24 Minutes[CP/OL].[2018-10-10]. https://arxiv.org/pdf/1709.05011v1.pdf.
[17]	JIA Y, SHELHAMER E, DONAHUE J , et al. Caffe: Convolutional Architecture for Fast Feature Embedding[C]// Proceedings of the 2014 ACM Conference on Multimedia. New Nork: ACM, 2014: 675-678.

类别	ResNet50	InceptionV3	YoloV3	DenseNet
卷积层	53	107	75	121
批归一化层	53	107	72	121
激活层	49	109	72	121
总层数	177	369	318	547

类别	计算μ, σ²	缩放与归一化	计算/FLOP	访存/B	计算密度/(FLOP/B)
正向	3EFN	2EFN	5EFN	8EFN	0.625
反向	3EFN	4EFN	7EFN	16EFN	0.4375

模型	输入图片大小	每批数量	中间结果及参数总量/MB
AlexNet	224×224×3	16	119
VGG-16	224×224×3	16	1761
ResNet-50	224×224×3	16	2202

层次	过程	输入特征图	参数	输出特征图	读访存量	写访存量
卷积	前向计算	[N,C₁,H₁,W₁]	K²C₁C₂	[N,C₂,H₂,W₂]	NC₁H₁W₁+K²C₁C₂	NC₁H₁W₁
	计算参数误差	[C₁,N,H₁,W₁]	[C₂,N,H₂,W₂]	K²C₂C₁	NC₁H₁W₁+K+C₂NH₂W₂	K²C₂C₁
	计算输入误差	[N,C₂,H₂,W₂]	K²C₂C₁	[N,C₁,H₁,W₁]	NC₂H₂W₂+K²C₂C₁	NC₁H₁W₁
批归一化	前向计算	[C₂,N,H₂,W₂]		[C₂,N,H₂,W₂]	2NC₂H₂W₂
批归一化	计算输入误差	[C₂,N,H₂,W₂]		[C₂,N,H₂,W₂]	4NC₂H₂W₂	NC₂H₂W₂
激活	前向计算	[N,C₂,H₂,W₂]		[N,C₂,H₂,W₂]		NC₂H₂W₂
激活	计算输入误差	[N,C₂,H₂,W₂]		[N,C₂,H₂,W₂]	NC₂H₂W₂

方法	结构		卷积层	BN_A	BN_B	ReLU
多层融合训练	BRCB	正向	K₁+D₂	0	D₁	0
	BRCB	反向	2K₁	D₁	4D₂	2D₁
	BRC	正向	K₁+D₂	2D₁	0
	BRC	反向	2D₁+2K₁+2D₂	D₁	4D₁
原始训练		正向	D₁+ K₁+D₂	3D₂	2D₂
原始训练		反向	2D₁+2K₁+2D₂	5D₂	3D₂

Optimization of memory access for the convolutional neural network training

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 15

References 17

Related Articles 15

Metrics

Comments

Recommended 10

网络模型	原始训练方法	采用融合层计算
网络模型	原始训练方法	BRC	比例/%	BRCB	比例/%
ResNet-50	18.18	14.3	0.22	12.27	0.33
InceptionV3	11.51	9.82	0.15	8.93	0.22
DenseNet	25.07	20.45	0.18	17.19	0.31

网络模型	OP/PIC	训练速度/(PIC/S)			计算性能/TFLOPS			计算效率/%
网络模型	OP/PIC	原始	BRC	BRCB	原始	BRC	BRCB	原始	BRC	BRCB
ResNet-50	2.3×10¹⁰	1254.3	1960.2	2327.1	28.8	45.1	53.5	24.1	37.6	44.6
InceptionV3	3.7×10¹⁰	867.2	1287.5	1466.4	32.1	46.7	54.3	26.7	38.9	45.2
DenseNet	1.8×10¹⁰	1725.6	2627.2	3004.2	31.1	47.3	52.7	25.8	39.4	43.9

[1]	CHEN Rong,XU Hongli,YANG Dongxue,HUANG Hua. Dense three-dimensional reconstruction algorithm based on spatially encoded structured light [J]. Journal of Xidian University, 2021, 48(6): 123-130.
[2]	LIU Yunrui,ZHOU Shuisheng. Application of least squares loss in the multi-view learning algorithm [J]. Journal of Xidian University, 2021, 48(6): 151-160.
[3]	ZHANG Chunxiang,ZHOU Xuesong,GAO Xueyao,LIU Huan. Semi-supervised word sense disambiguation by combining k-means clustering and the LSTM network [J]. Journal of Xidian University, 2021, 48(6): 161-171.
[4]	LI Yuan,CUI Yushuang,WANG Wei. Method for the analysis of text sentiment based on the word dual-channel network [J]. Journal of Xidian University, 2021, 48(6): 179-186.
[5]	DAI Mingjun,LI Xiaofeng,DENG Haiyan,CHEN Bin. Private information retrieval with low encoding/decoding complexity [J]. Journal of Xidian University, 2021, 48(6): 212-220.
[6]	TAN Wen,GAN Xinbiao,BAI Hao,XIAO Tiaojie,CHEN Xuguang,LEI Shumeng,LIU Jie. Optimization of large-scale graph traversal for supercomputers [J]. Journal of Xidian University, 2021, 48(6): 84-95.
[7]	GU Zhaojun,CHEN Hui,WANG Jialiang,GAO Bing. Target tracking control algorithm for small size quad-rotor helicopter [J]. Journal of Xidian University, 2021, 48(5): 117-127.
[8]	DONG Ruchan,JIAO Licheng,ZHAO Jin,SHEN Weiyan. Application of the deep fusion mechanism in object detection of remote sensing images [J]. Journal of Xidian University, 2021, 48(5): 128-138.
[9]	WANG Haijun,ZHANG Shengyan,DU Yujie. UAV object tracking via the correlation filter with the response divergence constraint [J]. Journal of Xidian University, 2021, 48(5): 149-155.
[10]	ZHANG Yuhao,CHENG Peitao,ZHANG Shuhao,WANG Xiumei. Lightweight image super-resolution with the adaptive weight learning network [J]. Journal of Xidian University, 2021, 48(5): 15-22.
[11]	CHENG De,HAO Yi,ZHOU Jingyu,WANG Nannan,GAO Xinbo. Cross-modality person re-identification utilizing the hybrid two-stream neural networks [J]. Journal of Xidian University, 2021, 48(5): 190-200.
[12]	SUN Yanjing,WEI Li,ZHANG Nianlong,YUN Xiao,DONG Kaiwen,GE Min,CHENG Xiaozhou,HOU Xiaofeng. Person re-identification method combining the DD-GAN and Global feature in a coal mine [J]. Journal of Xidian University, 2021, 48(5): 201-211.
[13]	YAN Jia,CAO Yudong,REN Jiaxing,CHEN Donghao,LI Xiaohui. Deep asymmetric compression Hashing algorithm [J]. Journal of Xidian University, 2021, 48(5): 212-221.
[14]	TIAN Chunna,YE Yanyu,SHAN Xiao,DING Yuxuan,ZHANG Xiangnan. Survey of self-supervised video representation learning [J]. Journal of Xidian University, 2021, 48(5): 222-230.
[15]	WANG Junjun,SUN Yue,LI Ying. Cloud removal method for the remote sensing image based on the GAN [J]. Journal of Xidian University, 2021, 48(5): 23-29.