西安电子科技大学学报 ›› 2024, Vol. 51 ›› Issue (3): 136-146.doi: 10.19665/j.issn1001-2400.20231004

• 计算机科学与技术 & 人工智能 • 上一篇    下一篇

基于多注意力机制的纹理感知视频修复方法

夏译蓝1(), 王秀美1(), 程培涛2()   

  1. 1.西安电子科技大学 电子工程学院,陕西 西安 710071
    2.西安电子科技大学 机电工程学院,陕西 西安 710071
  • 收稿日期:2023-03-13 出版日期:2024-06-20 发布日期:2023-11-15
  • 通讯作者: 程培涛(1978—),男,副教授,E-mail:chengpeitao@163.com
  • 作者简介:夏译蓝(1998—),女,西安电子科技大学硕士研究生,E-mail:ylxia@stu.xidian.edu.cn
    王秀美(1978—),女,教授,E-mail:wangxm@xidian.edu.cn
  • 基金资助:
    国家自然科学基金(62372355);国家自然科学基金(61972305);国家自然科学基金(61871308);陕西省自然科学基础研究计划(2023-JC-ZD-39);陕西省重点研发计划(2021ZDLGY02-03)

Texture-aware video inpainting algorithm based on the multi-attention mechanism

XIA Yilan1(), WANG Xiumei1(), CHENG Peitao2()   

  1. 1. School of Electronic Engineering,Xidian University,Xi’an 710071,China
    2. School of Mechano-Elctronic Engineering,Xidian University,Xi’an 710071,China
  • Received:2023-03-13 Online:2024-06-20 Published:2023-11-15

摘要:

针对现有视频修复方法无法有效利用远处空间内容信息而导致修复结果中存在结构和纹理不合理的问题,提出了一种基于多注意力机制的纹理感知视频修复方法。该方法设计了由多头时空注意力和单图局部注意力构成的多注意力机制以保证全局结构并增强局部纹理,其中多头时空注意力关注整体时空信息,单图局部注意力通过局部窗口的自注意力机制精炼提取局部信息。另外,采用可即插即用的快速傅里叶卷积层残差块代替前馈网络中的普通卷积,将感受野扩展为整个图像,进一步增强了模型对图像纹理和结构的全局信息的获取能力。快速傅里叶卷积层残差块和单图局部注意力相辅相成,共同提升局部纹理的修复质量。在YouTube-VOS和DAVIS数据集上的实验结果表明,虽然提出的方法修复结果的客观质量评价仅次于最优方法Fuseformer,但其参数量和运行时间分别下降了54.8%和21.5%,而且能够生成视觉上更逼真、语义上更合理的修复内容。

关键词: 视频修复, Transformer, 快速傅里叶卷积, 多注意力机制, 纹理感知

Abstract:

Existing video inpainting methods cannot effectively utilize distant spatial contents,which results in unreasonable structures and textures.To solve this problem,a texture-aware video inpainting algorithm based on the multi-attention mechanism is proposed in this paper.The algorithm designs a multi-attention mechanism composed of multi-head spatiotemporal attention and single-image local attention,guaranteeing global structures and enriching local textures.Multi-head spatial-temporal attention focuses on the overall spatial-temporal information,and single-image local attention distills local information through local windows of the self-attention mechanism.A plug-and-play fast Fourier convolution layer residual block is used to replace vanilla convolution in feedforward networks,expanding the receptive field into the entire image so that the global structure and texture of a single frame image can be enriched.The fast Fourier convolutional layer residual block and the single-image local attention complement each other and jointly promote the quality of local textures.Experimental results on YouTube-VOS and DAVIS datasets show that although the proposed method ranks second only to the optimal method Fuseformer on objective metrics,the number of parameters and running time are reduced by 54.8% and 21.5% respectively.And the proposed method can generate more visually realistic and semantically reasonable contents.

Key words: video inpainting, Transformer, fast Fourier convolution, multi-attention mechanism, texture-aware

中图分类号: 

  • TP391