基于三维卷积和哈希方法的视频检索算法

doi:10.16180/j.cnki.issn1007-7820.2022.04.006

摘要/Abstract

摘要：

视频信息检索与其他多媒体检索的最大不同在于视频信息量较大,因此进行视频间相似度计算时的计算量较大。此外,对视频特征的提取中常常忽略视频帧之间的时间相关性,从而导致特征提取不充分,影响视频检索的精度。为此,文中提出基于三维卷积和哈希方法的视频检索方法。该方法构建了一个端到端的框架,使用三维卷积神经网络来提取视频中代表帧的特征,并将视频特征映射到低维的汉明空间中去,在汉明空间计算相似度。在两个视频数据集下的实验结果表明,相较于当前最新的视频检索算法,文中所提方法在精度上有较大的提升。

关键词: 视频检索, 三维卷积, 特征表示, 哈希方法, 监督学习, 特征降维, 汉明空间, 相似度匹配

Abstract:

Different from other multimedia information retrieval, video retrieval requires a large amount of computation in similarity calculation due to the large amount of information contained in videos. In addition, the temporal correlation between video frames is often ignored in feature extraction, which leads to insufficient feature extraction and affects the accuracy of video retrieval. For this problem, this study proposes a video retrieval method based on 3D convolution and Hash method. This method constructs an end-to-end framework, uses a 3D convolutional neural network to extract the features of the representative frames selected from the video, and then maps the features to the low-dimensional Hamming space to calculate the similarity in the Hamming space. Experimental results on two video data sets show that compared with the latest video retrieval algorithms, the proposed method has a greater improvement in accuracy.

Key words: video retrieval, 3D convolution, feature representation, Hash method, supervised learning, feature reduction, Hamming space, similarity matching

中图分类号:

TP391

陈汗青,李菲菲,陈虬. 基于三维卷积和哈希方法的视频检索算法[J]. 电子科技, 2022, 35(4): 35-39.

Hanqing CHEN,Feifei LI,Qiu CHEN. Video Retrieval Algorithm Based on 3D Convolution and Hash Method[J]. Electronic Science and Technology, 2022, 35(4): 35-39.

图/表 8

图1

图2

图3

表1

3D ResNet18网络参数"

层名	18层结构
Conv1	7×7×7,64
3×3×3max pool, stride 2
Conv2_x	$3 × 3 × 3, 64 3 × 3 × 3, 64 × 2$
Conv3_x	$3 × 3 × 3, 128 3 × 3 × 3, 128 × 2$
Conv4_x	$3 × 3 × 3, 256 3 × 3 × 3, 256 × 2$
Conv5_x	$3 × 3 × 3, 512 3 × 3 × 3, 512 × 2$
Average pool, 400-d fc, softmax

表1

表2

表3

图4

图5

参考文献 22

[1]	Wu X, Hauptmann A G, Ngo C W. Practical elimination of near-duplicates from web video search[C]. Augsburg:Proceedings of the Fifteenth ACM International Conference on Multimedia, 2007.
[2]	Shang L, Yang L, Wang F, et al. Real-time large scale near-duplicate web video retrieval[C]. Firenze:Proceedings of the Eighteenth ACM International Conference on Multimedia, 2010.
[3]	Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110. doi: 10.1023/B:VISI.0000029664.99615.94
[4]	Wang L, Bao Y, Li H, et al. Compact CNN based video representation for efficient video copy detection[C]. Reykjavik:Proceedings of the International Conference on Multimedia Modeling, 2017.
[5]	Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]. Boston:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[6]	Jiang Y G, Wang J. Partial copy detection in videos: A benchmark and an evaluation of popular methods[J]. IEEE Transactions on Big Data, 2016, 2(1):32-42. doi: 10.1109/TBDATA.2016.2530714
[7]	Kordopatis-Zilos G, Papadopoulos S, Patras I, et al. Near-duplicate video retrieval by aggregating intermediate CNN layers[C]. Reykjavik:Proceedings of the International Conference on Multimedia Modeling, 2017.
[8]	Douze M, Jégou H, Schmid C, et al. Compact video description for copy detection with precise temporal alignment[C]. Heidelberg:Proceedings of the European Conference on Computer Vision, 2010.
[9]	Tan H K, Ngo C W, Hong R, et al. Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]. Beijing:Proceedings of the Seventeenth ACM International Conference on Multimedia, 2009.
[10]	陆超文, 李菲菲, 陈虬基于改进哈希算法的图像检索方法[J]. 电子科技, 2020, 33(5):28-32.
	Lu Chaowen, Li Feifei, Chen Qiu An image retrieval algorithm based on improved hashing method[J]. Electronic Science and Technology, 2020, 33(5):28-32.
[11]	Shen L, Hong R C, Zhang H R, et al. Video retrieval with similarity-preserving deep temporal hashing[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(4):1-16.
[12]	Ji S W, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1):221-231. doi: 10.1109/TPAMI.2012.59
[13]	Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]. Venice:Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017.
[14]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]. Las Vegas:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[15]	Lin K, Yang H F, Hsiao J H, et al. Deep learning of binary hash codes for fast image retrieval[C]. Boston:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015.
[16]	Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human action classes from videos in the wild[EB/OL].(2012-12-01) [2020-10-11]http://crcv.ucf.edu/data/ucf101.php.
[17]	Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]. Barcelona:Proceedings of the International Conference on Computer Vision, 2011.
[18]	Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. Santiago:Proceedings of the IEEE International Conference on Computer Vision,IEEE, 2015.
[19]	Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C]. Berkeley:Proceedings of the Forty-seventh Annual IEEE Symposium on Foundations of Computer Science, 2006.
[20]	Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(12),2916-2929. doi: 10.1109/TPAMI.2012.193
[21]	Liong V E, Lu J, Tan Y P, et al. Deep video hashing[J]. IEEE Transactions on Multimedia, 2016, 19(6):1209-1219. doi: 10.1109/TMM.2016.2645404
[22]	Dong Y, Li J. Video retrieval based on deep convolutional neural network[C]. Shenzhen:Proceedings of the Third International Conference on Multimedia Systems and Signal Processing, 2018.

方法	UCF-101
方法	64位	128位	256位
LSH	0.605	0.671	0.710
ITQ	0.701	0.735	0.750
DH	0.723	0.759	0.778
DCNNH	0.747	0.783	0.796
Ours	0.815	0.821	0.831

方法	UCF-101
方法	64位	128位	256位
LSH	0.356	0.393	0.431
ITQ	0.408	0.416	0.436
DH	0.424	0.433	0.443
DCNNH	0.485	0.451	0.467
本文	0.529	0.534	0.542