基于多模态交叉互动的情感识别算法

doi:10.16180/j.cnki.issn1007-7820.2024.10.011

摘要/Abstract

摘要：

由于单模态情感识别的局限性,研究者已将其研究重点转移到多模态情感识别领域。多模态情感识别围绕最优提取每个模态的特征以及有效融合所提取出的特征这两方面问题进行研究。文中提出了一种基于多模态交叉互动的情感识别方法,以捕获模态表达的多样性。各种模态的编辑器分别提取具有情感信息的特征,模态间注意力机制堆叠的交互模块建模视觉-文本-音频之间的潜在关系。在基于文本、语音和图像的CMU-MOSI和CMU-MOSEI情感识别数据集上进行实验,结果显示在Acc2(Accuracy2)、Acc7(Accuracy7)、F1、MAE(Mean Absolute Error)和Corr(Correlation)这5个指标上文中方法分别取得了86.5%、47.7%、86.4%、0.718、0.776和83.4%、51.5%、83.4%、0.566、0.737的成绩,证明该方法性能具有显著提升,同时也验证了模态间交叉映射互相表示机制比各单模态表示方法具有更好的性能。

关键词: 多模态, 特征融合, 情感识别, 情感分析, 注意力机制, 变压器, 变压器的双向编码器表示, 交互映射

Abstract:

Due to the limitations of single modality emotion recognition, many researchers have shifted their focus to the field of multimodal emotion recognition. Multi-modal emotion recognition focuses on two problems: The optimal extraction of the features of each mode and the effective fusion of the extracted features. This study proposes an emotion recognition method based on multimodal cross-interaction to capture the diversity of modality expressions. The editors of various modalities separately extract features with emotional information, and the stacked interaction modules based on the attention mechanism between modalities model the potential relationship among vision, text and audio. Experiments are conducted on CMU-MOSI and CMU-MOSEI datasets for emotion recognition based on text, audio and visual. The results show that the method achieved the scores of 86.5%, 47.7%, 86.4%, 0.718, 0.776, and 83.4%, 51.5%, 83.4%, 0.566, 0.737 on five indicators, Acc2(Accuracy2)、Acc7(Accuracy7)、F1、MAE(Mean Absolute Error) and Corr(Correlation). This demonstrates that the proposed algorithm significantly improves performance, and also validates that the cross-mapping mutual representation mechanism perform better than single-modal representation methods.

Key words: multimodal, feature fusion, emotion recognition, emotion analysis, attention mechanism, transformer, bidirectional encoder representation from transformers, interactive mapping

中图分类号:

TP391.41

张慧, 李菲菲. 基于多模态交叉互动的情感识别算法[J]. 电子科技, 2024, 37(10): 81-87.

ZHANG Hui, LI Feifei. Emotion Recognition Algorithm Based on Multimodal Cross-Interaction[J]. Electronic Science and Technology, 2024, 37(10): 81-87.

图/表 8

图1

图2

表1

表2

表3

表4

表5

表6

参考文献 31

[1]	左斌, 李菲菲. 基于注意力机制和Inf-Net的新冠肺炎图像分割方法[J]. 电子科技, 2023, 36(2):22-28.
	Zuo Bin, Li Feifei. An effective segmentation method for COVID-19 CT image based on attention mechanism and Inf-Net[J]. Electronic Science and Technology, 2023, 36(2):22-28.
[2]	林潮威, 李菲菲, 陈虬. 基于深度卷积特征的场景全局与局部表示方法[J]. 电子科技, 2022, 35(4):20-27.
	Lin Chaowei, Li Feifei, Chen Qiu. Globaland local scene representation method based on deep convolutional features[J]. Electronic Science and Technology, 2022, 35(4):20-27.
[3]	Mittal T, Bhattacharya U, Chandra R, et al. M3ER:Multi-plicative multimodal emotion recognition using facial,textual,and speech cues[C]. New York: Proceedings of the AAAI Conference on Artificial Intelligence,2020:51359-1367.
[4]	Liu K, Li Y, Xu N, et al. Learn to combine modalities in multimodal deep learning[EB/OL].(2018-05-29)[2023-03-10].
[5]	Tzirakis P, Chen J, Zafeiriou S, et al. End-to-end multimodal affect recognition in real-world environments[J]. Information Fusion, 2021, 68(1):46-53.
[6]	Lyu H, Sha N, Qin S, et al. Manifold denoising by nonlinear robust principal component analysis[J]. Advances in Neural Information Processing Systems, 2019, 32(1):2-12.
[7]	Lee J, Toutanova K. BERT:Pretraining of deep bidirectional transformers for language understanding[EB/OL].(2019-05-24)[2023-03-09]https://arxiv.53yu.com/abs/1810.04805. .
[8]	Yang K, Xu H, Gao K. CM-BERT:Cross-modal BERT f-or text-audio sentiment analysis[C]. Beijing: Proceedings of the Twenty-eighth ACM International Conference on Multimedia,2020:521-528.
[9]	Rahman W, Hasan M K, Lee S, et al. Integrating multi-modal information in large pretrained transformers[C]. Online: Proceedings of the Conference on Association for Computational Linguistics,2020:2359-2371.
[10]	Kim K, Park S. AOBERT:All-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92(6):37-45.
[11]	Zadeh A, Liang P P, Poria S, et al. Multi-attention recur-rent network for human communication comprehension[C]. New Orleans: Proceedings of the AAAI Conference on Artificial Intelligence,2018:1145-1156.
[12]	Wu Y, Schuster M, Chen Z, et al. Google's neural machine translation system:Bridging the gap between human and machine translation[EB/OL].(2016-09-26)[2023-03-09]https://arxiv.53yu.com/abs/1609.08144.
[13]	Ba J L, Kiros J R, Hinton G E. Layer normalization[EB/OL].(2016-07-21)[2023-03-09]https://arxiv.53yu.com/abs/1607.06450. .
[14]	Zadeh A, Zellers R, Pincus E, et al. MOSI:Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL].(2016-06-20)[2023-03-09]https://arxiv.53yu.com/abs/1606.06259. .
[15]	Zadeh A A B, Liang P P, Poria S, et al. Multimodal lan-guage analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]. Melbourne: Proceedings of the Fifty-sixth Annual Meeting of the Association for Computational Linguistics,2018:2236-2246.
[16]	Ekman P, Freisen W V, Ancoli S. Facial signs of emotional experience[J]. Journal of Personality and Social Psychology, 1980, 39(6):1125-1132.
[17]	Yang B, Wu L, Zhu J, et al. Multimodal sentiment analysis with two-phase multitask learning[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2022, 30(10):2015-2024.
[18]	Degottex G, Kane J, Drugman T, et al. COVAREPA collaborative voice analysis repository for speech technologies[C]. Florence: Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing,2014:960-964.
[19]	Drugman T, Alwan A. Joint robust voicing detection and pitch estimation based on residual harmonics[EB/OL].(2019-12-28)[2023-03-10]https://arxiv.53yu.com/abs/2001.00459. .
[20]	Alku P, Bäckström T, Vilkman E. Normalized amplitude quotient for parametrization of the glottal flow[J]. Journal of the Acoustical Society of America,Acoustical Society of America, 2002, 112(2):701-710.
[21]	Kane J, Gobl C. Wavelet maxima dispersion for breathy to tense voice discrimination[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2013, 21(6):1170-1179.
[22]	Pennington J, Socher R, Manning C D. GloVe:Global vectors for word representation[C]. Doha: Proceedings of the Conference on Empirical Methods in Natural Language Processing,2014:1532-1543.
[23]	Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL].(2017-07-23)[2023-03-09]https://arxiv.53yu.com/abs/1707.07250.
[24]	Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL].(2018-05-31)[2023-03-09]https://arxiv.53yu.com/abs/1806.00064.
[25]	Tsai Y H H, Liang P P, Zadeh A, et al. Learning factorized multimodal representations[EB/OL].(2019-05-14)[2023-03-08]https://arxiv.53yu.com/abs/1806.06176.
[26]	Sun Z, Sarma P, Sethares W, et al. Learning relationshipsbetween text,audio,and video via deep canonical correlation for multimodal language analysis[C]. New York: Proceedings of the AAAI Conference on Artificial Intelligence,2020:8992-8999.
[27]	Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]. Florence: Proceedings of the Conference on Association for Computational Linguistics,2019:6558-6562.
[28]	Hazarika D, Zimmermann R, Poria S. Misa:Modality-inv-ariant and-specific representations for multimodal sentiment analysis[C]. Beijing: Proceedings of the Twenty-eighth ACM International Conference on Multimedia,2020:1122-1131.
[29]	Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization formultimodal sentiment analysis[EB/OL].(2021-09-16)[2023-03-09]https://arxiv.53yu.com/abs/2109.00412.
[30]	Wang Y, Shen Y, Liu Z, et al. Words can shift:Dynamically adjusting word representations using nonverbal behaviors[C]. Honolulu: Proceedings of the AAAI Conference on Artificial Intelligence,2019:7216-7223.
[31]	Chauhan D S, Akhtar M S, Ekbal A, et al. Context-aware interactive attention for multi-modal sentiment and emotion analysis[C]. Hong Kong: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the Ninth International Joint Conference on Natural Language Processing,2019:5647-5657.

	编码器层	交互层
网络参数	维度:74(视觉) 47(音频,CMU-MOSI) 35(音频,CMU-MOSEI)	维度1:256
	维度:74(视觉) 47(音频,CMU-MOSI) 35(音频,CMU-MOSEI)	维度2:256
	多头个数:1	多头个数:4
	前馈维度:256	前馈维度:1 024
	正则化系数:0.1	正则化系数:0.1
	激活函数:ReLU	激活函数:ReLU
损失	L₁
优化器	AdamW
初始学习率	0.000 1
学习率衰减系数	0.1
迭代数	29 (CMU-MOSI) 23(CMU-MOSEI)

模态	Acc2 (↑)	Acc7 (↑)	MAE (↓)	Corr (↑)	F1 (↑)
T→A, T→V	85.6	47.0	0.728	0.775	85.5
A→V, A→T	85.3	48.2	0.725	0.772	85.2
V→T, V→A	85.9	47.6	0.726	0.774	85.8
V↔T↔A↔V	86.5	47.7	0.718	0.776	86.4

交互层层数	Acc2 (↑)	Acc7 (↑)	MAE (↓)	Corr (↑)	F1 (↑)
1	86.5	47.7	0.718	0.776	86.4
3	85.6	46.7	0.730	0.775	85.6
6	84.9	43.5	0.746	0.771	84.9

多头个数	交互层的维度	Acc2 (↑)	Acc7 (↑)	MAE (↓)	Corr (↑)	F1 (↑)
2	1 024	85.6	47.3	0.732	0.769	85.5
4	1 024	86.5	47.7	0.718	0.776	86.4
8	1 024	85.4	46.7	0.729	0.771	85.3
2	512	85.4	45.1	0.734	0.771	85.4
4	512	85.7	47.9	0.726	0.773	85.6
8	512	85.6	47.0	0.727	0.772	85.5

模型	Acc2 (↑)	Acc7 (↑)	MAE (↓)	Corr (↑)	F1 (↑)
TFN (B)^[23]	80.8	34.9	0.901	0.698	80.7
LMF (B)^[24]	82.5	33.2	0.917	0.695	82.4
MFM (B)^[25]	81.7	35.4	0.877	0.706	81.6
ICCN (B)^[26]	83.0	39.0	0.860	0.710	83.0
MulT (B)^[27]	82.1	-	0.861	0.711	82.03
MISA (B)^[28]	83.4	42.3	0.783	0.761	83.6
BBFN (B)^[29]	84.3	45.0	0.776	0.755	84.3
MAG-BERT (B)^[9]	84.3	-	0.731	0.798	84.3
AOBERT (B)^[10]	85.6	40.2	0.856	0.700	86.4
TPMSA*^[17]	84.5	44.5	0.755	0.769	84.5
本文	86.5	47.7	0.718	0.776	86.4