CHEN Shihan,MA Hongjiang,WANG Ting,et al.Video Sentiment Analysis Technology based on Multimodal Fusion[J].Journal of Chengdu University of Information Technology,2022,37(06):656-661.[doi:10.16836/j.cnki.jcuit.2022.06.007]
基于多模态融合的视频情感分析技术
- Title:
- Video Sentiment Analysis Technology based on Multimodal Fusion
- 文章编号:
- 2096-1618(2022)06-0656-06
- 分类号:
- TP391
- 文献标志码:
- A
- 摘要:
- 介绍一种视频多模态情感识别方法。一个视频通常通过文本、声音和视觉图像等多模态信息来表达同一种情感主题,而如何将同一个视频中不同异构数据之间的信息融合并最大程度地利用是目前需要重点攻克的难题。通过互信息最大化的方法,高效融合视频中的文本、声音与视觉图像等多模态异构数据,尽可能多地消除模态之间的差异,最终实现对视频的情感进行识别分析。在公开的MOSEI多模态数据集上进行实验,实验结果显示 MAE值达55.4。相比之前的一些模型,本模型效果更优,且实验模型构造不繁琐,为后面相关的研究打下良好的基础。
- Abstract:
- A method for multimodal sentiment recognition in video is introduced in this paper. A video usually expresses the same sentiment theme through multimodal information such as text, sound, and visual images,andfusingthe information between different modalities and make full use of them is the current key problems that need to be overcome. This paper uses the method of maximizing mutual information to efficiently fuse multimodal heterogeneous data such as text, sound and visual images in videos to eliminate as many differences between modalities as possible, and finally realize the recognition and analysis of video sentiment. Experiments are carried out on the public MOSEI multimodal dataset, and the results show that the MAE value reaches 55.4. Compared with conventional models, the effect of this model is better, and the construction of the experimental model is not cumbersome, which can provide reference for related research.
参考文献/References:
[1] 奚晨.基于表情、语音和文本的多模态情感分析[D].南京:南京邮电大学,2021.
[2] 王蝶.基于注意力机制的多模态融合技术研究[D].南京:南京师范大学,2021.
[3] 冯亚琴,沈凌洁,胡婷婷,等.利用语音与文本特征融合改善语音情感识别[J].数据采集与处理,2019,34(4):625-631.
[4] 秦放,曾维佳,罗佳伟,等.基于深度学习的多模态融合图像识别研究[J].信息技术,2022(4):29-34.
[5] 牟智佳,符雅茹.多模态学习分析研究综述[J].现代教育技术,2021,31(6):23-31.
[6] Zadeh A,Chen M,Poria S,et al.Tensor FusionNetwork for Multimodal Sentiment Analysis[C].Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:1103-1114.
[7] 薛其威,伍锡如.基于多模态特征融合的无人驾驶系统车辆检测[J].广西师范大学学报(自然科学版),2022,40(2):37-48.
[8] Sun Z,Sarma P,Sethares W,et al.Learning relationships between text,audio,and video via deep canonical correlation for multimodal language analysis[C].Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(5):8992-8999.
[9] 颜增显,孔超,欧卫华.基于多模态融合的人脸反欺骗算法研究[J].计算机技术与发展,2022,32(4):63-68.
[10] 王旭阳,董帅,石杰.复合层次融合的多模态情感分析[J/OL].http://kns.cnki.net/kcms/detail/11.5602.TP.20220331.1739.003.html,2022(8):31.
[11] Tsai Y H,Bai S,Kolter J Z,et al.Multimodal Transformer for Unaligned Multimodal Language Sequences[C].Proceedings of the conference.Association for ComputationalLinguistics,2019,2019:6558-6569.
[12] Hazarika D,Zimmermann R,Poria S.Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C].Proceedings of the 28th ACM international conference on multimedia,2020:1122-1131.
[13] Makiuchi M R,Uto K,Shinoda K.Multimodal emotion recognition with high-level speech and text features[C].2021 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU).IEEE,2021:350-357.
[14] Byun S W,Kim J H,Lee S P.Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding[J].Applied Sciences,2021,11(17):7967.
[15] 黄欢,孙力娟,曹莹,等.基于注意力的短视频多模态情感分析[J].图学学报,2021,42(1):8-14.
[16] Poole B,Ozair S,Van Den Oord A,et al.On variational bounds of mutual information[C].International Conference on Machine Learning.PMLR,2019:5171-5180.
[17] Belghazi M I,Baratin A,Rajeshwar S,et al.Mutual information neural estimation[C].International conference on machine learning.PMLR,2018:531-540.
[18] Devlin J,Chang M W,Lee K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C].Proceedings of NAACL-HLT,2019:4171-4186.
[19] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[20] Tishby N,Zaslavsky N.Deep learning and the information bottleneck principle[C].2015 ieee information theory workshop.IEEE,2015:1-5.
[21] Alemi A A,Fischer I,Dillon J V,et al.Deep Variational Information Bottleneck[J].arXiv e-prints,2016:arXiv:1612.00410,2016.
[22] Bachman P,Hjelm R D,Buchwalter W.Learning representations by maximizing mutual information across views[C].Proceedings of the 33rd International Conference on Neural Information Processing Systems,2019:15535-15545.
[23] Barber D,Agakov F.The IM algorithm:a variational approach to Information Maximization[C].Proceedings of the 16th International Conference on Neural Information Processing Systems,2003:201-208.
[24] Huber M F,Bailey T,Durrant-Whyte H,et al.On entropy approximation for Gaussian mixture random vectors[C].2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.IEEE,2008:181-188.
[25] Gutmann M,Hyvärinen A.Noise-contrastive estimation:A new estimation principle for unnormalized statistical models[C].Proceedings of the thirteenth international conference on artificial intelligence and statistics.JMLR Workshop and Conference Proceedings,2010:297-304.
[26] Zadeh A A B,Liang P P,Poria S,et al.Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fusion graph[C].Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,2018:2236-2246.
[27] Yuan J,Liberman M.Speaker identification on the SCOTUS corpus[J].The Journal of the Acoustical Society of America,2008,123(5):3878-3878.
[28] Degottex G,Kane J,Drugman T,et al.COVAREP—A collaborative voice analysis repository for speech technologies[C].2014 ieee international conference on acoustics,speech and signal processing(icassp).IEEE,2014:960-964.
[29] Yu W,Xu H,Yuan Z,et al.Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(12):10790-10797.
[30] Liu Z,Shen Y,Lakshminarasimhan V B,et al.Efficient Low-rank Multimodal Fusion With Modality-Specific Factors[C].Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2018:2247-2256.
相似文献/References:
[1]陶全桧,安俊秀,陈宏松.基于跨模态融合ERNIE的多模态情感分析研究[J].成都信息工程大学学报,2022,37(05):501.[doi:10.16836/j.cnki.jcuit.2022.05.003]
TAO Quanhui,AN Junxiu,CHEN Hongsong.Multi-modal Sentiment Analysis based on Cross-modal Fusion ERNIE[J].Journal of Chengdu University of Information Technology,2022,37(06):501.[doi:10.16836/j.cnki.jcuit.2022.05.003]
备注/Memo
收稿日期:2022-07-19
基金项目:四川省科技厅重点研发资助项目(2021YFG0031、2022YFG0375); 四川省科技服务业示范资助项目(2021GFW130)