Journal of Xidian University ›› 2021, Vol. 48 ›› Issue (5): 222-230.doi: 10.19665/j.issn1001-2400.2021.05.025

Previous Articles     Next Articles

Survey of self-supervised video representation learning

TIAN Chunna(),YE Yanyu(),SHAN Xiao(),DING Yuxuan(),ZHANG Xiangnan()   

  1. School of Electronic Engineering,Xidian University,Xi’an 710071,China
  • Received:2021-05-21 Online:2021-10-20 Published:2021-11-09

Abstract:

Learning high-quality video representations is helpful for the machine to accurately understand the video content.Video representation based on supervised learning needs to annotate massive amounts of video data,which is extremely time-consuming and laborious.Thus,self-supervised video representation,which adopts unannotated data,has become a hot research topic.Self-supervised video representation learning uses massive amounts of unlabeled data.It uses the temporal-spatial continuity of videos as the supervision information to design auxiliary tasks for representation learning,and then applies the learned video representations to downstream tasks.For lack of the survey on new developments of self-supervised video representation learning,we analyze and summarize the methods for self-supervised video representation learning,which are mostly published in recent three years.According to the information used in pretext tasks,we categorize the methods into three groups:Time series information,temporal-spatial information and multi-modal information based ones.We compare the experimental results of self-supervised video representation learning on two downstream tasks of action recognition and video retrieval,and then analyze the advantages and disadvantages of those models and the reason behind it.Finally,we summarize the existing issues and propose the promising prospects on self-supervised video representation learning.

Key words: self-supervised learning, video, multi-modal learning, unsupervised learning

CLC Number: 

  • TP391