电子科技 ›› 2024, Vol. 37 ›› Issue (10): 81-87.doi: 10.16180/j.cnki.issn1007-7820.2024.10.011

• • 上一篇    下一篇

基于多模态交叉互动的情感识别算法

张慧, 李菲菲   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200093
  • 收稿日期:2023-03-10 出版日期:2024-10-15 发布日期:2024-11-04
  • 作者简介:张慧(1998-),女,硕士研究生。研究方向:计算机视觉与模式识别。
    李菲菲(1970-),女,博士,教授。研究方向:多媒体信息处理、图像处理与模式识别、信息检索等。
  • 基金资助:
    上海市高校特聘教授(东方学者)岗位计划(ES2015XX)

Emotion Recognition Algorithm Based on Multimodal Cross-Interaction

ZHANG Hui, LI Feifei   

  1. School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 200093,China
  • Received:2023-03-10 Online:2024-10-15 Published:2024-11-04
  • Supported by:
    The Program for Professor of Special Appointment(Eastern Scholar) at Shanghai Institutions of Higher Learning(ES2015XX)

摘要:

由于单模态情感识别的局限性,研究者已将其研究重点转移到多模态情感识别领域。多模态情感识别围绕最优提取每个模态的特征以及有效融合所提取出的特征这两方面问题进行研究。文中提出了一种基于多模态交叉互动的情感识别方法,以捕获模态表达的多样性。各种模态的编辑器分别提取具有情感信息的特征,模态间注意力机制堆叠的交互模块建模视觉-文本-音频之间的潜在关系。在基于文本、语音和图像的CMU-MOSI和CMU-MOSEI情感识别数据集上进行实验,结果显示在Acc2(Accuracy2)、Acc7(Accuracy7)、F1、MAE(Mean Absolute Error)和Corr(Correlation)这5个指标上文中方法分别取得了86.5%、47.7%、86.4%、0.718、0.776和83.4%、51.5%、83.4%、0.566、0.737的成绩,证明该方法性能具有显著提升,同时也验证了模态间交叉映射互相表示机制比各单模态表示方法具有更好的性能。

关键词: 多模态, 特征融合, 情感识别, 情感分析, 注意力机制, 变压器, 变压器的双向编码器表示, 交互映射

Abstract:

Due to the limitations of single modality emotion recognition, many researchers have shifted their focus to the field of multimodal emotion recognition. Multi-modal emotion recognition focuses on two problems: The optimal extraction of the features of each mode and the effective fusion of the extracted features. This study proposes an emotion recognition method based on multimodal cross-interaction to capture the diversity of modality expressions. The editors of various modalities separately extract features with emotional information, and the stacked interaction modules based on the attention mechanism between modalities model the potential relationship among vision, text and audio. Experiments are conducted on CMU-MOSI and CMU-MOSEI datasets for emotion recognition based on text, audio and visual. The results show that the method achieved the scores of 86.5%, 47.7%, 86.4%, 0.718, 0.776, and 83.4%, 51.5%, 83.4%, 0.566, 0.737 on five indicators, Acc2(Accuracy2)、Acc7(Accuracy7)、F1、MAE(Mean Absolute Error) and Corr(Correlation). This demonstrates that the proposed algorithm significantly improves performance, and also validates that the cross-mapping mutual representation mechanism perform better than single-modal representation methods.

Key words: multimodal, feature fusion, emotion recognition, emotion analysis, attention mechanism, transformer, bidirectional encoder representation from transformers, interactive mapping

中图分类号: 

  • TP391.41