基于残差密集连接与注意力融合的人群计数算法

doi:10.16180/j.cnki.issn1007-7820.2022.06.002

摘要/Abstract

摘要：

现有人群计数算法采用多列融合结构来解决单一图像的多尺度问题,但该处理方法不能有效利用低层特征信息,从而导致最终人群计数结果不准确。针对这一缺陷,文中提出一种基于残差密集连接与注意力融合的人群计数算法。该算法的前端利用改进VGG16网络提取低级特征信息。算法后端主分支基于残差密集连接结构,利用残差网络和密集网络结合方式捕获层与层间的特征信息,可高效捕获多尺度信息。侧分支通过引入注意力机制,生成对应尺度注意力图,有效区分特征图的背景和前景,降低了背景噪声的影响。采用3个主流公开数据集对该算法进行验证。实验结果表明,该算法计数有效且计数准确率优于其他算法。

关键词: 人群计数, 残差密集, 注意力, 卷积神经网络, 密度图, 特征融合, 多尺度, 最近邻插值

Abstract:

The existing crowd counting algorithm uses multi-column fusion structure to solve the multi-scale problem of a single image, which cannot effectively use the low-level feature information, resulting in inaccurate final crowd counting results. In order to improve the accuracy, a crowd counting algorithm based on residual dense connection and attention fusion is proposed. The algorithm uses improved VGG16 network to extract low-level feature information. Based on the residual dense connection structure, the back-end main branch of the proposed algorithm uses the combination of residual network and dense network to capture the feature information between layers and efficiently capture multi-scale information. Side branch introduces the attention mechanism to generate the corresponding scale attention map, which effectively distinguishes the background and prospect of the feature map and reduces the influence of background noise. The algorithm is tested on three mainstream public data sets. The experimental results show that the algorithm is effective in counting and has better counting accuracy than other algorithms.

Key words: crowd counting, dense residuals, attention, convolutional neural network, density figure, feature fusion, multi-scale, nearest neighbor interpolation

中图分类号:

TP391

沈宁静,袁健. 基于残差密集连接与注意力融合的人群计数算法[J]. 电子科技, 2022, 35(6): 6-12.

SHEN Ningjing,YUAN Jian. Crowd Counting Algorithm Based on Residual Dense Connection and Attention Fusion[J]. Electronic Science and Technology, 2022, 35(6): 6-12.

图/表 9

图1

图2

图3

表1

图4

表2

图5

表3

表4

参考文献 24

[1]	Leibe B, Seemann E, Schiele B. Pedestrian detection in crowded scenes[C]. San Diego: IEEE Computer Society Conference on Computer Vision & Pattern Recognition. IEEE, 2005.
[2]	夏菁菁, 高琳, 范勇, 等. 基于骨架特征的人数统计[J]. 计算机应用, 2014, 34(2):585-588.
	Xia Jingjing, Gao Lin, Fan Yong, et al. People counting based on skeleton feature[J]. Journal of Computer Applications, 2014, 34(2):585-588.
[3]	杨林, 吕学强, 张鑫, 等. 像素特征与粘连人体分割结合的人数统计方法[J]. 计算机工程与设计, 2019, 40(2):455-461.
	Yang Lin, Lü Xueqiang, Zhang Xin, et al. People countring method combining pilxel feature and conglutination human body segmentation[J]. Computer Engineering and Design, 2019, 40(2):455-461.
[4]	禹明娟, 张英烈, 陈临强. 医院监控场景下的人群密度估计方法[J]. 电子科技, 2016, 29(3):75-78.
	Yu Mingjuan, Zhang Yinglie, Chen Linqiang. Crowd density estimation method for hospital surveillance[J]. Electronic Science and Technology, 2016, 29(3):75-78.
[5]	范龙飞, 姜子政, 李海丰, 等. 基于局部密度分类的人数统计算法[J]. 控制工程, 2019, 26(6):1015-1020.
	Fan Longfei, Jiang Zizheng, Li Haifeng, et al. Population statistics algorithm based on localdensity classification[J]. Control engineering of China, 2019, 26(6):1015-1020.
[6]	Boominathan L, Kruthivent S S S, Babu R V. Crowdnet: A deep convolutional network for dense crowd counting[C]. Amsterdam: Proceedings of the Twenty-fourth ACM International Conference on Multimedia, 2016.
[7]	Zhang Y, Zhou D, Chen S, et al. Single-image crowd counting via multi-column convolutional neural network[C]. Las Vegas: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[8]	Sindagi V A, Patel V M. Cnn-based cascaded multit-ask learning of high-level prior and density estimation for crowd counting[C]. Lecce: Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance, 2017.
[9]	Sam D B, Surya S, Babu R V. Switching convolutional neural network for crowd counting[C]. Honolulu: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[10]	Zeng L, Xu X, Cai B, et al. Multi-scale convolutional neural networks for crowd counting[C]. Beijing: Proceedings of the IEEE International Conference on Image Processing, 2017.
[11]	Liu M, Jiang J, Guo Z Q, et al. Crowd counting with fully convolutional neural network[C]. Athens: The Twenty-fifth IEEE International Conference on Image Processing, 2018.
[12]	Li Y, Zhang X, Chen D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes[C]. Salt Lake: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[13]	Wang S, Wang H, Li Q. Multi-Dilation network for crowd counting[C]. Beijing: Proceedings of the ACM Multimedia Asia, 2019.
[14]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. Computer Science, 2014(11):332-345.
[15]	Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]. Honolulu: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[16]	Zhao B, Wu X, Feng J, et al. Diversified visual attention networks for fine-grained object classification[J]. IEEE Transactions on Multimedia, 2017, 19(6):1245-1256. doi: 10.1109/TMM.2017.2648498
[17]	Park J, Woo S, Lee J Y, et al. BAM:bottleneck attention module[C]. Newcastle: Proceedings of the British Machine Vision Conference, 2018.
[18]	Chen L C, Yang Y, Wang J, et al. Attention to scale:Scale-aware semantic image segmentation[C]. Las Vegas: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, 2016.
[19]	郑萌. 基于改进注意力机制模型的智能英语翻译方法研究[J]. 电子科技, 2020, 33(11):84-87.
	Zhang Meng. Research on intelligentEnglish translation based on improved attention mechanism model[J]. Electronic Science and Technology, 2020, 33(11):84-87.
[20]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]. Salt Lake: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, 2018.
[21]	Chen K, Loy C C, Gong S, et al. Feature mining for localised crowd counting[C]. London: Proceedings of the British Machine Vision Conference, 2012.
[22]	Xiong F, Shi X, Yeung D Y. Spatiotemporal modeling for crowd counting in videos[C]. Venice: Proceedings of the IEEE International Conference on Computer Vision, 2017.
[23]	Idrees H, Tayyab M, Athrey K, et al. Composition loss for counting, density map estimation and localization in dense crowds[C]. Munich: Proceedings of the European Conference on Computer Vision, 2018.
[24]	Sheng B, Shen C, Lin G, et al. Crowd counting via weighted VLAD on a dense attribute feature map[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 28(8):1788-1797. doi: 10.1109/TCSVT.2016.2637379

方法	PartA		PartB
方法	MAE	RMSE	MAE	RMSE
SwitchCNN^[9]	90.4	135.0	21.6	33.4
MSCNN^[10]	83.8	127.4	17.7	30.2
CSRNet^[12]	68.2	115.0	10.6	16.0
MDNet^[13]	66.9	108.3	9.2	15.1
RDCAF	65.6	108.0	8.51	14.2

方法	MAE	RMSE
Ridge Regression^[21]	3.59	19.0
Weighted VLAD^[24]	2.86	13.05
MCNN^[7]	2.24	8.50
Bi-ConvLSTM^[22]	2.10	7.60
RDCAF	1.79	2.32

方法	MAE	RMSE
MCNN^[7]	277.0	426.0
C-MTL^[8]	252.0	514.0
SwitchCNN^[9]	228.0	445.0
CL-CNN^[23]	132.0	191.0
RDCAF	108.5	180.6

方法	MAE	RMSE
无注意力机制结构	9.20	15.40
无残差密集连接结构	11.40	16.90
RDCAF	8.51	14.20