西安电子科技大学学报 ›› 2021, Vol. 48 ›› Issue (5): 222-230.doi: 10.19665/j.issn1001-2400.2021.05.025

• • 上一篇    下一篇

自监督视频表征学习综述

田春娜(),叶彦妤(),单笑(),丁宇轩(),张相南()   

  1. 西安电子科技大学 电子工程学院,陕西 西安 710071
  • 收稿日期:2021-05-21 出版日期:2021-10-20 发布日期:2021-11-09
  • 作者简介:田春娜(1980—),女,教授,E-mail: chnatian@xidian.edu.cn|叶彦妤(1997—),女,西安电子科技大学硕士研究生,E-mail: yyye_1@stu.xidian.edu.cn|单 笑(1998—),女,西安电子科技大学硕士研究生,E-mail: xshan@stu.xidian.edu.cn|丁宇轩(1995—),男,西安电子科技大学博士研究生,E-mail: yxding@stu.xidian.edu.cn|张相南(1991—),男,西安电子科技大学博士研究生,E-mail: zxnn81@outlook.com
  • 基金资助:
    国家自然科学基金(61571354);国家自然科学基金(62173265)

Survey of self-supervised video representation learning

TIAN Chunna(),YE Yanyu(),SHAN Xiao(),DING Yuxuan(),ZHANG Xiangnan()   

  1. School of Electronic Engineering,Xidian University,Xi’an 710071,China
  • Received:2021-05-21 Online:2021-10-20 Published:2021-11-09

摘要:

学习高质量的视频表征有助于机器更准确地理解视频内容。基于监督学习的视频表征需要标注海量的视频数据,而视频标注极其费时费力,因而不需要标注数据的自监督视频表征方法成为研究的热点。自监督视频表征学习利用海量的未标注数据,将视频自身的时空连续性等作为监督信息来设计辅助任务进行表征学习,并将学习到的视频表征应用于下游任务。鉴于缺少对自监督视频表征学习新进展的综述,首先根据辅助任务采用的信息不同,从时序信息、时空信息和多模态信息方面,对近三年的自监督视频表征学习算法进行分析和总结;然后,在动作识别和视频检索两个下游任务中,对比自监督视频表征学习模型的实验结果,并分析模型性能的优劣及其原因;最后,总结自监督视频表征学习依然存在的问题并对其发展进行展望。

关键词: 自监督学习, 视频, 多模态学习, 无监督学习

Abstract:

Learning high-quality video representations is helpful for the machine to accurately understand the video content.Video representation based on supervised learning needs to annotate massive amounts of video data,which is extremely time-consuming and laborious.Thus,self-supervised video representation,which adopts unannotated data,has become a hot research topic.Self-supervised video representation learning uses massive amounts of unlabeled data.It uses the temporal-spatial continuity of videos as the supervision information to design auxiliary tasks for representation learning,and then applies the learned video representations to downstream tasks.For lack of the survey on new developments of self-supervised video representation learning,we analyze and summarize the methods for self-supervised video representation learning,which are mostly published in recent three years.According to the information used in pretext tasks,we categorize the methods into three groups:Time series information,temporal-spatial information and multi-modal information based ones.We compare the experimental results of self-supervised video representation learning on two downstream tasks of action recognition and video retrieval,and then analyze the advantages and disadvantages of those models and the reason behind it.Finally,we summarize the existing issues and propose the promising prospects on self-supervised video representation learning.

Key words: self-supervised learning, video, multi-modal learning, unsupervised learning

中图分类号: 

  • TP391