Journal of Xidian University ›› 2022, Vol. 49 ›› Issue (4): 144-155.doi: 10.19665/j.issn1001-2400.2022.04.017

• Computer Science and Technology • Previous Articles     Next Articles

Method to recognize human action by using the convolutional block attention mechanism

GAO Deyong1,2(),KANG Zibing1(),WANG Song1,2(),WANG Yangping1,3()   

  1. 1. School of Electronicand Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China
    2. Gansu Provincial Engineering Research Center for Artificial Intelligence and Graphic and Image Processing,Lanzhou 730070,China
    3. Gansu Provincial Key Lab of System Dynamics and Reliability of Rail Transport Equipment,Lanzhou 730070,China
  • Received:2021-03-24 Online:2022-08-20 Published:2022-08-15
  • Contact: Zibing KANG E-mail:258680916@qq.com;914764692@qq.com;wangsong@mail.lzjtu.cn;1328396793@qq.com

Abstract:

When focusing on the region of interest in the image sequence in the action recognition task,the attention mechanism focuses more on the correlation of features at the channel level and ignores the spatial location information on the features,so it lacks the ability to accurately identify dynamic regions in the video.Therefore,this paper proposes an action recognition algorithm based on the attention mechanism and convolutional LSTM.First,the ResNet-50 network is used to obtain the feature representation of the video frame,and the convolution block attention module is used to first allocate the resources of the feature map on different convolution channels through channel attention,and then the different feature maps are analyzed with spatial attention.In this way,the optimal adjustment of the weights of the convolutional feature map is realized,and the influence of the regions unrelated to the action is suppressed or reduced.At the same time,considering that the long-short-term memory network (LSTM) loses the spatial structure information of the image frame when processing spatiotemporal data,the convolutional long-short-term memory network (ConvLSTM) uses the convolution operation to mine the spatial correlation in the image.The completeness representation of video’s attribute is further supplemented.The ConvLSTM is used to model the sequence information of the features to obtain frame-level predictions.Finally,the predictions of all frames are combined to determine the video classification.Experimental results on three public datasets show that the method proposed in this paper can effectively highlight the key region in the video and improve the accuracy of action recognition to a certain extent.

Key words: machine vision, action recognition, attention mechanism, region of interesting, convolutional LSTM

CLC Number: 

  • TP391.4