Journal of Xidian University ›› 2025, Vol. 52 ›› Issue (2): 57-84.doi: 10.19665/j.issn1001-2400.20240907
Previous Articles Next Articles
LIU Long1(
), LI Haosheng1(
), ZHANG Mengxuan2(
), DU Ying3(
), CHANG Yaqi1(
), ZHANG Wenbo1(
)
Received:2024-05-05
Online:2025-04-20
Published:2024-09-25
Contact:
ZHANG Mengxuan
E-mail:longliu@xidian.edu.cn;haoshengli@stu.xidian.edu.cn;mxzhang@xidian.edu.cn;duying@bfa.edu.cn;yqchang@stu.xidian.edu.cn;wbzhang@xidian.edu.cn
CLC Number:
LIU Long, LI Haosheng, ZHANG Mengxuan, DU Ying, CHANG Yaqi, ZHANG Wenbo. Review of deep learning-based methods for driving facial animation[J].Journal of Xidian University, 2025, 52(2): 57-84.
"
| 关注点 | 时间/年 | 综述名称 | 主要内容 |
|---|---|---|---|
| 驱动数据源的获 | 2018 | 文献[6] | 3D人脸建模技术进展与数据采集策略 |
| 取与处理的综述 | 2019 | 文献[7] | 表演驱动的表情动画技术综述及驱动方式分类 |
| 2022 | 文献[8] | 驱动数据源分类及其在动画中的应用 | |
| 关注被驱动数据 | 2008 | 文献[9] | 面部动画技术实现方法及其优缺点分析 |
| 或对象的综述 | 2022 | 文献[10] | 语音驱动的3D人脸动画技术进展 |
| 2024 | 文献[11] | 3D数字人类面部建模、动画与渲染技术综述 | |
| 关注驱动实现 | 2015 | 文献[12] | 语音驱动人脸动画的关键技术研究 |
| 方法的综述 | 2017 | 文献[13] | 音视频映射技术在人脸动画中的应用 |
| 2020 | 文献[14] | 说话头视频生成质量的评估标准与方法 | |
| 2021 | 文献[15] | 人脸动画方法按驱动目的的分类综述 | |
| 2022 | 文献[16] | 音频驱动的3D唇同步动画技术总结 | |
| 2023 | 文献[18] | 深度学习在说话人脸生成领域的应用 | |
| 2024 | 文献[20] | 视觉语音分析与生成技术的最新研究进展 |
"
| 时间 | 方法 | 运动表示 | 主要技术创新 |
|---|---|---|---|
| 2018 | X2Face[ | 嵌入向量 | 利用图像、音频和姿态编码来控制人脸生成 |
| 2018 | Vid2Vid[ | 嵌入向量 | 首次利用GAN实现图像到图像的转换 |
| 2019 | Monkey Net[ | 稀疏关键点 | 提出无监督方法,实现由关键点到密集光流动画驱动 |
| 2019 | FOMM[ | 稀疏关键点+局部仿射变换 | 加入仿射变换,模拟更复杂的运动 |
| 2020 | Fast Bi-layer[ | 嵌入向量 | 利用两个生成器分别获得图像纹理(高频)和主体(低频) |
| 2021 | HeadGAN[ | 密集光流 | 基于3DMM技术,能提取任意的几何面部,同时实现面部的身份和表情分离 |
| 2021 | MRAA[ | 区域热图+全局仿射变换 | 使用区域代替关键点,引入额外的全局仿射变换对相机运动进行建模 |
| 2021 | Face-vid2vid[ | 3D关键点+面部姿态表情 | 利用多个估计器获取头部姿势 |
| 2022 | DAM[ | 可形变关键点 | 增加可形变关键点约束 |
| 2022 | DaGAN++[ | 稀疏关键点+局部仿射变换 | 关键点提取与深度估计相结合 |
| 2022 | TPSM[ | TSP变换关键点+全局仿射变换 | 引入TSP运动模型 |
| 2022 | MoTrans[ | 关键点+局部仿射变换 | 利用Transformer提取关键点 |
| 2022 | Face2Face ρ[ | 3DMM参数+不同尺度地标 | 多尺度地标建模运动表示 |
| 2024 | AniFaceDiff[ | 3DMM参数+运动流场 | 将面部运动表示作为扩散模型生成条件 |
| 2024 | DiffusionAct[ | 3DMM参数+3D地标 | 使用DiffAE和扩散模型准确传递头部姿态和面部表情 |
"
| 时间 | 方法 | 驱动源 | 主要技术创新 |
|---|---|---|---|
| 2017 | Syn Obama[ | 音频 | 首次实现音频驱动生成人脸动画视频,将音频与嘴唇同步 |
| 2017 | Speech2Vid[ | 音频 | 首次实现单阶段音频驱动,将音频与嘴唇同步 |
| 2019 | CRGAN[ | 音频 | 提出条件循环生成网络,设计三个判别器以改善质量 |
| 2019 | DAVs[ | 音/视频 | 视频解耦为身份信息和语音信息,支持任意输入音频或视频 |
| 2019 | ATVG Net[ | 音频 | 基于MMCRNN的生成器,引入动态可调损失以保持时间连续性 |
| 2020 | Wav2Lip[ | 音频 | 提高泛化性,音频可驱动任意身份人物 |
| 2021 | PC-AVS[ | 音视频 | 从额外的视频中提取姿态信息,支持多模态输入 |
| 2022 | EAMM[ | 音视频 | 从额外的动态情绪特征提取,实现具有情感的人脸动画视频 |
| 2023 | StyleTalk[ | 音视频 | 考虑时间连续性,视频中逐帧加入情态风格信息 |
| 2023 | DIRFA[ | 音频 | 基于Transformer的概率映射网络,从音频中提取人脸动画 |
| 2023 | IP LAP[ | 音频 | 使用Transformer编码器,实现提取人脸下半部分的3D关键点 |
| 2023 | DreamTalk[ | 音频 | 在扩散模型中引入风格预测器,直接从音频预测表情 |
| 2023 | DiT-Head[ | 音频 | 基于Diffusion Transformers提取高分辨率和多身份的动态情绪特征 |
| 2024 | VLOGGER[ | 音频 | 从音频中直接预测3D运动信息,并作为控制条件 |
| 2024 | Hallo[ | 音频 | 使用分层音频驱动的视觉合成,提供了对多种表情的自适应控制 |
"
| 数据集名称 | 时长 | 主体 | 词汇 | 句子 | 姿势动作 | 情感 | 数据环境 |
|---|---|---|---|---|---|---|---|
| The GRID[ | 27.5 | 33 | 51 | 33k | × | × | 实验室 |
| LRW[ | 173 | 1k+ | 500 | 539k | √ | × | 野外 |
| VoxCeleb1[ | 352 | 1.2k | N/A | 153.5k | √ | × | 野外 |
| VoxCeleb2[ | 2.4k | 6.1k | N/A | 1.1m | √ | × | 野外 |
| LRS2[ | 224.5 | 500+ | 59k | 140k+ | √ | × | 野外 |
| LRS3[ | 438 | 5k+ | N/A | 152k | √ | × | 野外 |
| ObamaSet[ | 14 | 1 | N/A | N/A | × | × | 野外 |
| TCD-TIMIT[ | 11.1 | 62 | N/A | 6.9k | × | × | 实验室 |
| MEAD[ | 2 400 | 60 | N/A | 2.6k | √ | √ | 实验室 |
| MELD[ | 13.7 | 407 | 17k | 13.7k | √ | √ | 野外 |
| [1] | KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[J]. Advances in Neural Information Processing Systems25.New York: Curran Associates, Inc., 2012:1097-1105. |
| [2] | ELMAN J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2):179-211. |
| [3] | GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative Adversarial Nets[J]. Advances in Neural Information Processing Systems, 2014, 27:1-9. |
| [4] | HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models[J]. Advances in Neural Information Processing Systems, 2020, 33:6840-6851. |
| [5] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[J]. Advances in Neural Information Processing Systems, 2017, 30:1-11. |
| [6] | NOORA N M N, SUAIB N M, AHMAD M A, et al. Review on 3D Facial Animation Techniques[J]. International Journal of Engineering & Technology, 2018, 7(4.44):181-187. |
| [7] | 魏巍, 刘尚武, 段晓东, 等. 表演驱动的三维表情动画技术综述[J]. 大连民族大学学报, 2019, 21(1):69-77. |
| WEI Wei, LIU Shangwu, DUAN Xiaodong, et al. A Summary of Performance-Driven 3D Expression Animation[J]. Journal of Dalian Minzu University, 2019, 21(1):69-77. | |
| [8] | 刘锦, 陈鹏, 王茜, 等. 人类面部重演方法综述[J]. 中国图象图形学报, 2022, 27(9):2629-2651. |
| LIU Jinpeng, CHEN Peng, WANG Xi, et al. Critical Review of Human Face Reenactment Methods[J]. Journal of Image and Graphics, 27(9):2629-2651. | |
| [9] | 潘红艳, 柳杨华, 徐光祐. 人脸动画方法综述[J]. 计算机应用研究, 2008, 2:327-331. |
| PAN Hongyan, LIU Yanghua, XU Guangyou. Review on Methods of Facial Synthesis[J]. Application Research of Computers, 2008, 2:327-331. | |
| [10] | 刘贤梅, 刘露, 贾迪, 等. 基于语音驱动的三维人脸动画技术综述[J]. 计算机系统应用, 2022, 31(10):44-50. |
| LIU Xianmei, LIU Lu, JIA Di, et al. Overview on Speech-Driven 3D Facial Animation Technology[J]. Computer Systems & Applications, 2022, 31(10):44-50. | |
| [11] | ZHANG Y, SU R, YU J, et al. 3D Facial Modeling,Animation,and Rendering for Digital Humans:A Survey[J]. Neurocomputing, 2024:128168. |
| [12] | 王慧慧, 赵晖. 语音驱动人脸动画研究综述[J]. 现代计算机:中旬刊, 2015(5):54-59. |
| WANG Huihui, ZHAO Hui. Survey of Speech-Driven Facial Animation[J]. Modern Computer, 2015(5):54-59. | |
| [13] |
李欣怡, 张志超. 语音驱动的人脸动画研究现状综述[J]. 计算机工程与应用, 2017, 53(22):21-28.
doi: 10.3778/j.issn.1002-8331.1704-0345 |
|
LI Xinyi, ZHANG Zhichao. Review of Speech Driven Facial Animation[J]. Computer Engineering and Applications, 2017, 53(22):21-28.
doi: 10.3778/j.issn.1002-8331.1704-0345 |
|
| [14] | CHENL, CUI G, KOU Z, et al. What Comprises A Good Talking-Head Video Generation?:A Survey and Benchmark(2020)[J/OL].[2020-05-07]. https://arxiv.org/abs/2005.03201. |
| [15] | 费建伟, 夏志华, 余佩鹏, 等. 人脸合成技术综述[J]. Journal of Frontiers of Computer Science & Technology, 2021, 15(11):2025-2047. |
| FEI Jianwei, XIA Zhihua, YU Peipeng, et al. Survey of Face Synthesis[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(11):2025-2047. | |
| [16] | HWANG J, PARK K. Audio-Driven Facial Animation:A Survey[C]// 2022 13th International Conference on Information and Communication Technology Convergence(ICTC).Piscataway:IEEE, 2022: 614-617. |
| [17] | EDWARDS P, LANDRETH C, POPłAWSKI M, et al. JALI-Driven Expressive Facial Animation and Multilingual Speech in Cyberpunk 2077[C]// Special Interest Group on Computer Graphics and Interactive Techniques Conference Talks. New York: ACM, 2020:1-2. |
| [18] | TOSHPULATOV M, LEE W, LEE S. Talking Human Face Generation:A Survey[J]. Expert Systems with Applications, 2023:119678. |
| [19] | MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. Nerf:Representing Scenes as Neural Radiance Fields for View Synthesis[J]. Communications of the ACM, 2021, 65(1):99-106. |
| [20] | SHENG C, KUANG G, BAI L, et al. Deep Learning for Visual Speech Analysis:A Survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(9):6001-6022. |
| [21] | LYU S. Deepfake Detection:Current Challenges and Next Steps[C]// 2020 IEEE International Conference on Multimedia & Expo Workshops(ICMEW).Piscataway:IEEE, 2020: 1-6. |
| [22] | MIRSKY Y, LEE W. The Creation and Detection of Deepfakes:A survey[J]. ACM computing surveys(CSUR), 2021, 54(1):1-41. |
| [23] | PARKE F I. Computer Generated Animation of Faces[C]// Proceedings of the ACM Annual Conference-Volume 1.New York:ACM, 1972:451-457. |
| [24] | PARKE F I. A Parametric Model for Human Faces[M]. Utah: The University of Utah,1974. |
| [25] | PLATT S M, BADLER N I. Animating Facial Expressions[C]// Proceedings of the 8th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM, 1981:245-252. |
| [26] | BREGLER C, COVELL M, SLANEY M. Video Rewrite:Visual Speech Synthesis from Video[C]// Audio-Visual Speech Processing:Computational & Cognitive Science Approaches.Rhodes:AVSP, 1997:153-156. |
| [27] | BRAND M. Voice Puppetry[C]// Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM, 1999:21-28. |
| [28] | CHAI J, XIAO J, HODGINS J. Vision-Based Control of 3D Facial Animation[C]// Symposium on Computer Animation. New York: ACM, 2003,2. |
| [29] | WILLIAMS L. Performance-Driven Facial Animation[C]// ACM SIGGRAPH 2006 Courses.New York: ACM,2006:16-es. |
| [30] |
HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276 |
| [31] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale(2020)[J/OL].[2021-06-03]. https://arxiv.org/abs/2010.11929. |
| [32] | RADFORD A, METZ L, CHINTALA S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks(2015)[J/OL].[2016-01-07]. https://arxiv.org/abs/1511.06434. |
| [33] | MIRZA M, OSINDERO S. Conditional Generative Adversarial Nets(2014)[J/OL].[2014-11-06]. https://arxiv.org/abs/1411.1784. |
| [34] | NEWELL A, YANG K, DENG J. Stacked Hourglass Networks for Human Pose Estimation[C]//Computer Vision-ECCV 2016. Berlin:Springer, 2016:483-499. |
| [35] | RONNEBERGER O, FISCHER P, BROX T. U-Net:Convolutional Networks for Biomedical Image Segmentation[C]//Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015. Berlin:Springer, 2015:234-241. |
| [36] | WILES O, KOEPKE A, ZISSERMAN A. X2face:A Network for Controlling Face Generation Using Images,Audio,and Pose Codes[C]//Proceedings of the European Conference on Computer Vision(ECCV). Berlin:Springer, 2018:670-686. |
| [37] | ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-Image Translation with Conditional Adversarial Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:1125-1134. |
| [38] | SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition(2014)[J/OL].[2015-04-10]. https://arxiv.org/abs/1409.1556. |
| [39] | SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. Animating Arbitrary Objects Via Deep Motion Transfer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:2377-2386. |
| [40] | LI K, XU F, WANG J, et al. A Data-Driven Approach for Facial Expression Synthesis in Video[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2012: 57-64. |
| [41] | SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. First Order Motion Model for Image Animation[C]// Advances in Neural Information Processing Systems 32(NeurIPS 2019). New York: Curran Associates,Inc,2019:7137-7147. |
| [42] | SIAROHIN A, WOODFORD O J, REN J, et al. Motion Representations for Articulated Animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2021:13653-13662. |
| [43] | TAO J, WANG B, XU B, et al. Structure-Aware Motion Transfer with Deformable Anchor Model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:3637-3646. |
| [44] | ZHAO J, ZHANG H. Thin-plate Spline Motion Model for Image Animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:3657-3666. |
| [45] | WANG T C, MALLYA A, LIU M Y. One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2021:10039-10049. |
| [46] | YANG K, CHEN K, GUO D, et al. Face2Face ρ:Real-Time High-Resolution One-Shot Face Reenactment[C]//European Conference on Computer Vision. Berlin:Springer, 2022:55-71. |
| [47] | BLANZ V, VETTER T. A Morphable Model for The Synthesis of 3D Faces[J]. Seminal Graphics Papers:Pushing the Boundaries, 2023, 2:157-164. |
| [48] | LIU K, SU Y C, CANG R, et al. Controllable One-Shot Face Video Synthesis with Semantic Aware Prior(2023)[J/OL].[2023-04-27]. https://arxiv.org/abs/2304.14471. |
| [49] | ZHANG H, REN Y, CHEN Y, et al. Exploiting Multiple Guidance from 3DMM for Face Reenactment[C]//The AAAI-23 Workshop on Creative AI Across Modalities. Reston:AAAI, 2023:1-8. |
| [50] | ZHANG Z, DING Y. Adaptive Affine Transformation:A Simple and Effective Operation for Spatial Misaligned Image Generation[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022:1167-1176. |
| [51] | TAO J, WANG B, GE T, et al. Motion Transformer for Unsupervised Image Animation[C]//European Conference on Computer Vision. Berlin:Springer, 2022:702-719. |
| [52] | HA S, KERSNER M, KIM B, et al. Marionette:Few-Shot Face Reenactment Preserving Identity of Unseen Targets[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Reston:AAAI, 2020:10893-10900. |
| [53] | TRAN P, ZAKHAROV E, HO L N, et al. VOODOO 3D:Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2024:10336-10348. |
| [54] | WANG T C, LIU M Y, ZHU J Y, et al. Video-to-Video Synthesis(2018)[J/OL].[2018-12-03]. https://arxiv.org/abs/1808.06601. |
| [55] | WANG T C, LIU M Y, ZHU J Y, et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional Gans[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:8798-8807. |
| [56] | WANG T C, LIU M Y, TAO A, et al. Few-Shot Video-to-Video Synthesis(2019)[J/OL].[2019-10-28]. https://arxiv.org/abs/1910.12713. |
| [57] | PARK T, LIU M Y, WANG T C, et al. Semantic Image Synthesis with Spatially-Adaptive Normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:2337-2346. |
| [58] | BURKOV E, PASECHNIK I, GRIGOREV A, et al. Neural Head Reenactment with Latent Pose Descriptors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2020:13786-13795. |
| [59] | DOUKAS M C, ZAFEIRIOU S, SHARMANSKA V. Headgan:One-Shot Neural Head Synthesis and Editing[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2021:14398-14407. |
| [60] | ZAKHAROV E, IVAKHNENKO A, SHYSHEYA A, et al. Fast Bi-Layer Neural Synthesis of One-Shot Realistic Head Avatars[C]//Computer Vision-ECCV 2020. Berlin:Springer, 2020:524-540. |
| [61] | HONG F T, ZHANG L, SHEN L, et al. Depth-Aware Generative Adversarial Network for Talking Head Video Generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:3397-3406. |
| [62] | HONG F T, SHEN L, XU D. DaGAN++:Depth-Aware Generative Adversarial Network for Talking Head Video Generation(2023)[J/OL].[2023-12-10]. https://arxiv.org/abs/2305.06225. |
| [63] | BEHROUZI T, SHAHROUDNEJAD A, MOUSAVI P. MaskRenderer:3D-Infused Multi-Mask Realistic Face Reenactment(2023)[J/OL].[2023-09-10]. https://arxiv.org/abs/2309.05095. |
| [64] | GAO Y, ZHOU Y, WANG J, et al. High-Fidelity and Freely Controllable Talking Head Video Generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:5609-5619. |
| [65] | LI X, DE MELLO S, LIU S, et al. Generalizable One-Shot 3D Neural Head Avatar[J]. Advances in Neural Information Processing Systems, 2024, 36:1-12. |
| [66] | XUE H, LING J, TANG A, et al. High-Fidelity Face Reenactment Via Identity-Matched Correspondence Learning[J]. ACM Transactions on Multimedia Computing,Communications and Applications, 2023, 19(3):1-23. |
| [67] | BOUNARELI S, TZELEPIS C, ARGYRIOU V, et al. Hyperreenact:One-Shot Reenactment Via Jointly Learning to Refine and Retarget Faces[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2023:7149-7159. |
| [68] | BOUNARELIS, TZELEPIS C, ARGYRIOU V, et al. One-Shot Neural Face Reenactment Via Finding Directions in GAN’s Latent Space[J]. International Journal of Computer Vision, 2024:1-31. |
| [69] | CHEN K, SENEVIRATNE S, WANG W, et al. AniFaceDiff:High-Fidelity Face Reenactment Via Facial Parametric Conditioned Diffusion Models(2024)[J/OL].[2024-12-02]. https://arxiv.org/abs/2406.13272. |
| [70] | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022:10684-10695. |
| [71] | BOUNARELI S, TZELEPIS C, ARGYRIOU V, et al. DiffusionAct:Controllable Diffusion Autoencoder for One-Shot Face Reenactment(2024)[J/OL].[2024-03-25]. https://arxiv.org/abs/2403.17217. |
| [72] | ZENG B, LIU X, GAO S, et al. Face Animation with an Attribute-Guided Diffusion Model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:628-637. |
| [73] | WANG Q, ZHANG J, XU C, et al. DiffFAE:Advancing High-fidelity One-Shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation(2024)[J/OL].[2024-03-26]. https://arxiv.org/abs/2403.17664. |
| [74] | KIM G, SHIM H, KIM H, et al. Diffusion Video Autoencoders:Toward Temporally Consistent Face Video Editing Via Disentangled Video Encoding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:6091-6100. |
| [75] | CHUNG J S, JAMALUDIN A, ZISSERMAN A. You Said That?(2017)[J/OL].[2017-07-18]. https://arxiv.org/abs/1705.02966. |
| [76] | SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing Obama:Learning Lip Sync from Audio[J]. ACM Transactions on Graphics(ToG), 2017, 36(4):1-13. |
| [77] | JAMALUDINA, CHUNG J S, ZISSERMAN A. You Said That?:Synthesising Talking Faces From Audio[J]. International Journal of Computer Vision, 2019, 127:1767-1779. |
| [78] | ZHOU H, SUN Y, WU W, et al. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2021:4176-4186. |
| [79] | ZHOU Y, HAN X, SHECHTMAN E, et al. Makelttalk:Speaker-Aware Talking-Head Animation[J]. ACM Transactions on Graphics(TOG), 2020, 39(6):1-15. |
| [80] | HUANG Z, XU W, YU K. Bidirectional LSTM-CRF Models for Sequence Tagging(2015)[J/OL].[2015-08-09]. https://arxiv.org/abs/1508.01991. |
| [81] | LU Y, CHAI J, CAO X. Live Speech Portraits:Real-Time Photorealistic Talking-Head Animation[J]. ACM Transactions on Graphics(TOG), 2021, 40(6):1-17. |
| [82] | JI X, ZHOU H, WANG K, et al. Audio-Driven Emotional Video Portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2021:14080-14089. |
| [83] | JI X, ZHOU H, WANG K, et al. Eamm:One-Shot Emotional Talking Face Via Audio-Based Emotion-Aware Motion Model[C]// ACM SIGGRAPH 2022 Conference Proceedings.New York:ACM, 2022:1-10. |
| [84] | TAN S, JI B, DING Y, et al. Say Anything with Any Style[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Reston:AAAI, 2024:5088-5096. |
| [85] | WANG J, ZHAO Y, LIU L, et al. Emotional Talking Head Generation Based on Memory-Sharing and Attention-Augmented Networks(2023)[J/OL].[2023-06-06]. https://arxiv.org/abs/2306.03594. |
| [86] | WU H, ZHOU S, JIA J, et al. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements[C]// Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023:6822-6830. |
| [87] | CHU Z, GUO K, XING X, et al. CorrTalk:Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation(2023)[J/OL].[2023-10-17]. https://arxiv.org/abs/2310.11295. |
| [88] | WANG S, LI L, DING Y, et al. One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Reston:AAAI, 2022:2531-2539. |
| [89] | HUANG R, ZHONG W, LI G. Audio-Driven Talking Head Generation with Transformer and 3D Morphable Model[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022:7035-7039. |
| [90] | WU R, YU Y, ZHAN F, et al. Audio-Driven Talking Face Generation with Diverse Yet Realistic Facial Animations(2023)[J/OL].[2023-04-18]. https://arxiv.org/abs/2304.08945. |
| [91] | ZHONG W, FANG C, CAI Y, et al. Identity-Preserving Talking Face Generation with Landmark and Appearance Priors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:9729-9738. |
| [92] | LING J, TAN X, CHEN L, et al. Stableface:Analyzing and Improving Motion Stability for Talking Face Generation[J]. IEEE Journal of Selected Topics in Signal Processing, 2023, 17(6):1232-1247. |
| [93] | MA Y, WANG S, HU Z, et al. Styletalk:One-Shot Talking Head Generation with Controllable Speaking Styles(2023)[J/OL].[2023-06-10]. https://arxiv.org/abs/2301.01081. |
| [94] | WANG J, ZHAO K, ZHANG S, et al. Lipformer:High-Fidelity and Generalizable Talking Face Generation with A Pre-Learned Facial Codebook[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:13844-13853. |
| [95] | WU X, HU P, WU Y, et al. Speech2lip:High-Fidelity Speech to Lip Generation by Learning from A Short Video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2023:22168-22177. |
| [96] | XU C, ZHU J, ZHANG J, et al. High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:6609-6619. |
| [97] | TAN S, JI B, PAN Y. FlowVQTalker:High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2024:26317-26327. |
| [98] | LEE D, KIM C, YU S, et al. RADIO:Reference-Agnostic Dubbing Video Synthesis[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway:IEEE, 2024:4168-4178. |
| [99] | WANG S, MA Y, DING Y, et al. StyleTalk++:A Unified Framework for Controlling the Speaking Styles of Talking Heads[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(6):4331-4347. |
| [100] | PRAJWALK R, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A Lip Sync Expert Is All You Need for Speech to Lip Generation in The Wild[C]// Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020:484-492. |
| [101] | CHEN L, MADDOX R K, DUAN Z, et al. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:7832-7841. |
| [102] | ZHOU H, LIU Y, LIU Z, et al. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Reston:AAAI, 2019:9299-9306. |
| [103] | VOUGIOUKAS K, PETRIDIS S, PANTIC M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. Piscataway:IEEE, 2019:37-40. |
| [104] | SAITO M, MATSUMOTO E, SAITO S. Temporal Generative Adversarial Nets with Singular Value Clipping[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:2830-2839. |
| [105] | SONG Y, ZHU J, LI D, et al. Talking Face Generation by Conditional Recurrent Adversarial Network[J/OL].[2018-04-13]. http://arxiv.org/abs/1804.04786v1. |
| [106] | GUAN J, ZHANG Z, ZHOU H, et al. Stylesync:High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:1505-1515. |
| [107] | ZHANG C, WANG C, ZHAO Y, et al. Dr2:Disentangled Recurrent Representation Learning for Data-Efficient Speech Video Synthesis[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway:IEEE, 2024:6204-6214. |
| [108] | XU M, LI H, SU Q, et al. Hallo:Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation(2024)[J/OL].[2024-06-16]. https://arxiv.org/abs/2406.08801. |
| [109] | CORONA E, ZANFIR A, BAZAVAN E G, et al. VLOGGER:Multimodal Diffusion for Embodied Avatar Synthesis(2024)[J/OL].[2024-03-13]. https://arxiv.org/abs/2403.08764. |
| [110] | SHEN S, ZHAO W, MENG Z, et al. Difftalk:Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023:1982-1991. |
| [111] | MA Y, ZHANG S, WANG J, et al. Dreamtalk:When Expressive Talking Head Generation Meets Diffusion Probabilistic Models(2023)[J/OL].[2024-08-10]. https://arxiv.org/abs/2312.09767. |
| [112] | MIR A, ALONSO E, MONDRAGÓN E. DiT-Head:High-Resolution Talking Head Synthesis Using Diffusion Transformers(2023)[J/OL].[2023-12-11]. https://arxiv.org/abs/2312.06400. |
| [113] | ZHANG B, ZHANG X, CHENG N, et al. Emotalker:Emotionally Editable Talking Face Generation Via Diffusion Model[C]// ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Piscataway:IEEE, 2024: 8276-8280. |
| [114] | MUKHOPADHYAY S, SURI S, GADDE R T, et al. Diff2lip:Audio Conditioned Diffusion Models for Lip-synchronization[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway:IEEE, 2024:5292-5302. |
| [115] | TAN S, JI B, PAN Y. Style2talker:High-Resolution Talking Head Generation with Emotion Style and Art Style[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Reston:AAAI, 2024:5079-5087. |
| [116] | XU S, CHEN G, GUO Y X, et al. Vasa-1:Lifelike Audio-Driven Talking Faces Generated in Real Time(2024)[J/OL].[2024-10-31]. https://arxiv.org/abs/2404.10667. |
| [117] | ZHANG C, WANG C, ZHANG J, et al. Dream-Talk:Diffusion-Based Realistic Emotional Audio-Driven Method for Single Image Talking Face Generation(2023)[J/OL].[2023-12-21]. https://arxiv.org/abs/2312.13578. |
| [118] | COOKE M, BARKER J, CUNNINGHAM S, et al. An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5):2421-2424. |
| [119] | CHUNG J S, ZISSERMAN A. Lip Reading in The Wild[C]// Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision.Berlin:Springer,2017:87-103. |
| [120] | NAGRANI A, CHUNG J S, ZISSERMAN A. Voxceleb:A Large-Scale Speaker Identification Dataset(2017)[J/OL].[2018-05-30]. https://arxiv.org/abs/1706.08612. |
| [121] | CHUNG J S, NAGRANI A, ZISSERMAN A. Voxceleb2:Deep Speaker Recognition(2018)[J/OL].[2018-06-27]. https://arxiv.org/abs/1806.05622. |
| [122] | AFOURAS T, CHUNG J S, SENIOR A, et al. Deep Audio-Visual Speech Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 44(12):8717-8727. |
| [123] | AFOURAS T, CHUNG J S, ZISSERMAN A. LRS3-TED:A Large-Scale Dataset for Visual Speech Recognition(2018)[J/OL].[2018-10-28]. https://arxiv.org/abs/1809.00496. |
| [124] | HARTE N, GILLEN E. TCD-TIMIT:An Audio-Visual Corpus of Continuous Speech[J]. IEEE Transactions on Multimedia, 2015, 17(5):603-615. |
| [125] | WANG K, WU Q, SONG L, et al. Mead:A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation[C]//European Conference on Computer Vision. Berlin:Springer, 2020:700-717. |
| [126] | PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD:A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations(2018)[J/OL].[2019-06-04]. https://arxiv.org/abs/1810.02508. |
| [127] | GAROFOLO J S. TIMIT Acoustic Phonetic Continuous Speech Corpus[J]. Linguistic Data Consortium,1993. |
| [128] | CHUNG J S, ZISSERMAN A. Out of Time:Automated Lip Sync in The Wild[C]// Computer Vision-ACCV 2016 Workshops:ACCV 2016 International Workshops.Berlin:Springer,2017: 251-263. |
| [129] | WANG Z, BOVIK A C, SHEIKH H R, et al. Image Quality Assessment:from Error Visibility to Structural Similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4):600-612. |
| [130] | SONAWANE R V, SUJATHA K. CPBD Metric for Blur Detection of a No-Reference Image & Its Removal[J]. International Journal of Innovative Research in Science,Engineering and Technology, 2016, 5(8):15515-15521. |
| [131] | HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. Gans Trained by A Two Time-Scale Update Rule Converge to A Local Nash Equilibrium[J]. Advances in Neural Information Processing Systems, 2017, 30:1-12. |
| [132] | NARVEKAR N D, KARAM L J. A No-Reference Image Blur Metric Based on The Cumulative Probability of Blur Detection(CPBD)[J]. IEEE Transactions on Image Processing, 2011, 20(9):2678-2683. |
| [133] | SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the Inception Architecture for Computer Vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:2818-2826. |
| [1] | LIU Na, YANG Yanbo, ZHANG Jiawei, LI Baoshan, MA Jianfeng. Research on the CNN network coding scheme for high-resolution image transmission [J]. Journal of Xidian University, 2025, 52(2): 225-238. |
| [2] | JIN Heng, SUN Yuochao, ZENG Yining, LIU Weicheng, GUO Yuanyuan. Pilot mental fatigue assessment method based on the SSENet [J]. Journal of Xidian University, 2025, 52(2): 33-46. |
| [3] | ZHANG Jing, WU Huixue, ZHANG Shaobo, LI Yunsong. Decoder-side enhanced image compression network under distributed strategy [J]. Journal of Xidian University, 2025, 52(1): 1-13. |
| [4] | WANG Chao, JIANG Xiaofeng, WANG Sumin. Research on the quantum effect traffic prediction algorithm oriented towards intuitive reasoning [J]. Journal of Xidian University, 2025, 52(1): 152-162. |
| [5] | ZHAO Congjian, JIAO Yiyuan, LI Yanni. Overview of deep sentence-level entity relation extraction [J]. Journal of Xidian University, 2024, 51(6): 117-131. |
| [6] | XU Haitao, LIU Yuzhe, YAN Xinyi, LI Jiaojiao, XUE Changbin. Fusion classification network for hyperspectral and LiDAR eature coupling modeling [J]. Journal of Xidian University, 2024, 51(6): 73-83. |
| [7] | WU Xinting, HUANG Ying, NIU Baoning, GUAN Hu, LAN Fangpeng, LIU Jie. Image texture-guided iterative watermarking model [J]. Journal of Xidian University, 2024, 51(5): 110-121. |
| [8] | ZHANG Mingjin, ZHOU Nan, LI Yunsong. Smooth interactive compression network for infrared small target detection [J]. Journal of Xidian University, 2024, 51(4): 1-14. |
| [9] | GAO Dihui, SHENG Lijie, XU Xiaodong, MIAO Qiguang. Joint feature approach for image-text cross-modal retrieval [J]. Journal of Xidian University, 2024, 51(4): 128-138. |
| [10] | WAN Pengwu, HUI Xi, CHEN Dongrui, WU Bo. Modulation recognition based on the two-dimensional asynchronous in-phase quadrature histogram [J]. Journal of Xidian University, 2024, 51(4): 78-90. |
| [11] | GUAN Yepeng, SU Guangyao, SHENG Yi. Time series prediction method based on the bidirectional long short-term memory network [J]. Journal of Xidian University, 2024, 51(3): 103-112. |
| [12] | HE Wangpeng, HU Deshun, LI Cheng, ZHOU Yue, GUO Baolong. Siamese network tracking using template updating and trajectory prediction [J]. Journal of Xidian University, 2024, 51(3): 46-54. |
| [13] | LIU Wei, WANG Mengyang, BAI Baoming. Efficient semantic communication method for bandwidth constrained scenarios [J]. Journal of Xidian University, 2024, 51(3): 9-18. |
| [14] | LIU Zhenyan, ZHANG Hua, LIU Yong, YANG Libo, WANG Mengdi. Efficient seed generation method for software fuzzing [J]. Journal of Xidian University, 2024, 51(2): 126-136. |
| [15] | ZHAI Fengwen, SUN Fanglin, JIN Jing. Study of EEG classification of depression by multi-scale convolution combined with the Transformer [J]. Journal of Xidian University, 2024, 51(2): 182-195. |
|
||