基于FPGA的SqueezeNet推断加速器设计

doi:10.16180/j.cnki.issn1007-7820.2022.02.004

摘要/Abstract

摘要：

针对轻量型深度神经网络SqueezeNet存在中间流动数据量大及消耗计算周期长等问题,文中提出以处理块结构划分整个网络来加速计算。每个处理块由Expand层和Squeeze层组成。以Squeeze层结束的处理块结构减少了计算模块与内存间流动的中间数据量,降低了读写消耗。利用激活函数的特性,在核心计算模块引入提前结束卷积计算技术,并为其设计有效索引生存单元、有效索引控制取值单元和卷积判断单元,可跳过卷积计算中无效值占用的计算量和计算周期。实验结果表明,该加速器能减少55.38%的数据流动量,并将无效值所占的计算量和计算周期减少14.68%。

关键词: 轻量型深度网络, SqueezeNet, 处理块, 激活函数, 提前结束卷积计算, 有效索引, 无效值, 计算周期

Abstract:

In view of the problems of the lightweight deep neural network SqueezeNet, such as large amount of intermediate data and long consumption calculation cycle,this study proposes to divide the entire network with a process block structure to speed up the calculation. Each process block is composed of Expand layer and Squeeze layer. The processing block structure ending with the Squeeze layer reduces the amount of intermediate data flowing between the computing module and the memory, and reduces the read and write consumption. The core calculation module introduces the early termination of the convolution calculation technology using the characteristics of the activation function. The effective index survival unit, the effective index control value unit and the convolution judgment unit are designed to skip the calculation amount and calculation cycle occupied by invalid values in the convolution calculation. Experimental results show that the data flow of the accelerator is reduced by 55.38%, and the calculation amount and calculation period occupied by invalid values are reduced by 14.68%.

Key words: lightweight deep neural network, SqueezeNet, process block, activation function, early termination of the convolution calculation, effective index, invalid value, calculation period

中图分类号:

TP183

储萍,倪伟. 基于FPGA的SqueezeNet推断加速器设计[J]. 电子科技, 2022, 35(2): 20-26.

CHU Ping,NI Wei. Design of FPGA-Based SqueezeNet Inference Accelerator[J]. Electronic Science and Technology, 2022, 35(2): 20-26.

图/表 11

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

表2

参考文献 16

[1]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]. San Diego:Proceedings of the International Conference on Learning Representations, 2015.
[2]	Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]. Boston:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[3]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]. Seattle:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[4]	Han S, Mao H, Dally W J, et al. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding[C]. San Juan:Proceedings of the International Conference on Learning Representations, 2016.
[5]	Han S, Liu X Y, Mao H Z, et al. EIE: efficient inference engine on compressed deep neural network[J]. International Symposium on Computer Architecture, 2016, 44(3):243-254.
[6]	Courbariaux M, Bengio Y, David J P. Binaryconnect: training deep neural networks with binary weights during propagations[C]. Montreal:Proceedings of the Twenty-ninth Annual Conference on Neural Information Processings Systems, 2015.
[7]	Rastegari M, Ordonez V, Redmon J, et al. XNOR-Net: imageNet classification using binary convolutional neural networks[C]. Amsterdam:Proceedings of the Fourteenth European Conference on Computer Vision, 2016.
[8]	Zhang X Y, Zhou X Y, Lin M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]. Salt Lake City:Proceedings of the Thirty-first IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[9]	Sandler M, Howard A, Zhu M, et al. MobilenetV2: inverted residuals and linear bottlenecks[C]. Salt Lake City: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[10]	Santos A G, Souza C D, Zanchettin C, et al. Reducing SqueezeNet Storage Size with Depthwise Separable Convolutions[C]. Rio de Janeiro:International Joint Conference on Neural Networks, 2018.
[11]	毕鹏程, 罗健欣, 陈卫卫, 等. 面向移动端的轻量化卷积神经网络结构[J]. 信息技术与网络安全, 2019, 38(9):24-29.
	Bi Pengcheng, Luo Jianxin, Chen Weiwei, et al. Lightweight convolutional neural network structure for mobile terminal[J]. Information Technology and Network Security, 2019, 38(9):24-29.
[12]	胡挺, 祝永新, 田犁, 等. 面向移动平台的轻量级卷积神经网络架构[J]. 计算机工程, 2019, 45(1):17-22.
	Hu Ting, Zhu Yongxin, Tian Li, et al. Lightweight convolutional neural network architecture for mobile platforms[J]. Computer Engineering, 2019, 45(1):17-22.
[13]	秦兴, 高晓琪, 陈滨. 基于压缩卷积神经网络的图像超分辨率算法[J]. 电子科技, 2020, 33(5):1-8.
	Qin Xing, Gao Xiaoqi, Chen Bin. Image super-resolution algorithm based on SqueezeNet convolution neural network[J]. Electronic Science and Technology, 2020, 33(5):1-8.
[14]	Huang C, Ni S Y, Chen G S. A layer-based structured design of CNN on FPGA[C]. Guiyang:Proceedings of the Twelfth IEEE International Conference on ASIC, 2017.
[15]	Aimar A, Mostafa H, Calabrese E, et al. Nullhop: a flexible convolutional neural network accelerator based on sparse representations of feature maps[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(3):644-656. doi: 10.1109/TNNLS.2018.2852335
[16]	Mousouliotis P G, Petrou L P. SqueezeJet: high-level synthesis accelerator design for deep convolutional neural networks[C]. Voros:Proceedings of the International Symposium on Applied Reconfigurable Computing, 2018.

资源	LUT	FF	DSP	BRAM
可用的	1 221 600	2 443 200	2 160	2 584
已用的	477 940	152 120	1 536	2 279
利用率	39.1%	6.2%	71.1%	88.1%

	文献[16]	文献[14]	本加速器
平台	Zynq XC7Z020	Xilinx XC7Z020	XC7V 2000T
频率/MHz	100	110	100
DSP	186	1 879	1 536
BRAM/kB	269	2 715	2 279
精度	8/16-bit fixed	16-bit fixed	16-bit fixed
延迟/s	0.333 00	0.003 65	0.003 58