AN Junxiu,TIAN Maoyun.Research on Multi-scale Speech Emotion Recognition based on Dual Channel Feature Fusion[J].Journal of Chengdu University of Information Technology,2025,40(05):582-588.[doi:10.16836/j.cnki.jcuit.2025.05.002]
基于双通道特征融合的多尺度语音情感识别研究
- Title:
- Research on Multi-scale Speech Emotion Recognition based on Dual Channel Feature Fusion
- 文章编号:
- 2096-1618(2025)05-0582-07
- Keywords:
- deep learning; speech emotion recognition; attention mechanism; multi-scale; feature fusion
- 分类号:
- TP391
- 文献标志码:
- A
- 摘要:
- 当前语音情感识别技术在处理不同语种、文化背景及个体差异的语音数据时面临严峻挑战。对此,提出一种多尺度特征注意力网络(MFAN)。该网络结合了主通道特征提取网络(利用CNN和两种注意力机制提取局部特征,BiGRU与Bahdanau注意力机制捕获全局特征)与辅助通道特征提取网络(利用Transformer encoder挖掘多尺度上下文信息),实现特征表示的深度与广度双重增强。通过双损失函数联合训练策略,MFAN促进了同类样本聚集与异类样本分离,提升模型的辨别力。实验验证MFAN在RAVDESS、Emo-DB、SAVEE和CASIA数据集上的卓越性能,准确率分别达到90.97%、94.39%、94.97%和88.33%,与现有模型相比,展现了更强的泛化能力和对情感细微差异的捕捉能力。
- Abstract:
- The current speech emotion recognition technology faces severe challenges in processing speech data from different languages, cultural backgrounds, and individual differences. In response to this, this article innovatively proposes a multi-scale feature attention network(MFAN), which combines the main channel feature extraction network(using CNN and two attention mechanisms to extract local features, BiGRU and Bahdanau attention mechanisms to capture global features)and the auxiliary channel feature extraction network(using Transformer encoder to mine multi-scale contextual information), achieving dual enhancement of feature representation depth and breadth. Through the dual loss function joint training strategy, MFAN promotes the clustering of similar samples and the separation of dissimilar samples, improving the model’s discriminative power. The experiment verified the excellent performance of MFAN on the RAVDESS, Emo-DB, SAVEE, and CASIA datasets, with accuracy rates of 90.97%,94.39%,94.97%,and 88.33%, respectively. Compared with existing models, it demonstrated stronger generalization ability and ability to capture subtle emotional differences.
参考文献/References:
[1] Sun T W.End-to-end speech emotion recognition with gender information[J]. IEEE Access,2020,8:152423-152438.
[2] Song P,Zheng W,Yu Y,et al.Speech emotion recognition based on robust discriminative sparse regression[J]. IEEE Transactions on Cognitive and Developmental Systems,2020,13(2):343-353.
[3] Dey A,Chattopadhyay S,Singh P K,et al.A hybrid meta-heuristic feature selection method using golden ratio and equilibrium optimization algorithms for speech emotion recognition[J]. IEEE Access,2020,8:200953-200970.
[4] Yang L,Xie K,Wen C,et al.Speech emotion analysis of netizens based on bidirectional lstm and pgcdbn[J]. IEEE Access,2021,9:59860-59872.
[5] Andayani F,Theng L B,Tsun M T,et al.Hybrid LSTM-transformer model for emotion recognition from speech audio files[J]. IEEE Access,2022,10:36018-36027.
[6] Atila O, 瘙 塁 engür A.Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition[J]. Applied Acoustics,2021,182:108260.
[7] 高鹏淇,黄鹤鸣,樊永红.融合坐标与多头注意力机制的交互语音情感识别[J/OL]. 计算机应用, http://kns.cnki.net/kcms/detail/51.1307.TP.20231205.1603.014.html,2024-05-16.
[8] Yang L,Zhang R Y,Li L,et al.Simam:A simple,parameter-free attention module for convolutional neural networks[C]. International conference on machine learning.PMLR,2021:11863-11874.
[9] Hou Q,Zhou D,Feng J.Coordinate attention for efficient mobile network design[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2021:13713-13722.
[10] Bahdanau D,Cho K,Bengio Y.Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473,2014.
[11] Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J]. Advances in neural information processing systems,2017,30.
[12] Livingstone S R,Russo F A.The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS):A dynamic,multimodal set of facial and vocal expressions in North American English[J]. PloS one,2018,13(5):e0196391.
[13] Burkhardt F,Paeschke A,Rolfes M,et al.A database of German emotional speech[C]. Interspeech.2005,5:1517-1520.
[14] Jackson P,Haq S.Surrey audio-visual expressed emotion(savee)database[J]. University of Surrey:Guildford,UK,2014.
[15] Zhang J,Jia H.Design of speech corpus for mandarin text to speech[C]. The blizzard challenge 2008 workshop,2008.
[16] Jothimani S,Premalatha K.MFF-SAug:Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network[J]. Chaos,Solitons & Fractals,2022,162:112512.
[17] 乔栋,陈章进,邓良,等.基于改进语音处理的卷积神经网络中文语音情感识别方法[J]. 计算机工程,2022,48(2):281-290.
[18] Ancilin J,Milton A.Improved speech emotion recognition with Mel frequency magnitude coefficient[J]. Applied Acoustics,2021,179:108046.
相似文献/References:
[1]张 斌,王 强.一种改进型卷积神经网络的图像分类方法[J].成都信息工程大学学报,2019,(01):39.[doi:10.16836/j.cnki.jcuit.2019.01.009]
ZHANG Bin,WANG Qiang.An Improved Convolution Neural Network Image Classification Method[J].Journal of Chengdu University of Information Technology,2019,(05):39.[doi:10.16836/j.cnki.jcuit.2019.01.009]
[2]唐明轩,李孝杰,周激流.基于Dense Connected深度卷积神经网络的
自动视网膜血管分割方法[J].成都信息工程大学学报,2018,(05):525.[doi:10.16836/j.cnki.jcuit.2018.05.007
]
TANG Ming-xuan,LI Xiao-jie,ZHOU Ji-liu.Automatic Retinal Vascular Segmentation Method based on
Densely Connected Convolution Neural Network[J].Journal of Chengdu University of Information Technology,2018,(05):525.[doi:10.16836/j.cnki.jcuit.2018.05.007
]
[3]蔡姣姣,何 嘉.基于混合自动编码器的分类应用[J].成都信息工程大学学报,2016,(增刊1):1.
[4]任 波,王录涛,邓 旭,等.一种改进深度学习网络结构的英文字符识别[J].成都信息工程大学学报,2017,(03):259.[doi:10.16836/j.cnki.jcuit.2017.03.005]
REN Bo,WANG Lu-tao,DENG Xu,et al.An Improved Deep Learning Network Structure for English Character Recognition[J].Journal of Chengdu University of Information Technology,2017,(05):259.[doi:10.16836/j.cnki.jcuit.2017.03.005]
[5]冯金慧,陶宏才.基于注意力的深度协同在线学习资源推荐模型[J].成都信息工程大学学报,2020,35(02):151.[doi:10.16836/j.cnki.jcuit.2020.02.005]
FENG Jinhui,TAO Hongcai.An Attention-based Deep Collaborative Filtering Model for Online Course Recommendation[J].Journal of Chengdu University of Information Technology,2020,35(05):151.[doi:10.16836/j.cnki.jcuit.2020.02.005]
[6]杨 铭,文 斌.一种改进的YOLOv3-Tiny目标检测算法[J].成都信息工程大学学报,2020,35(05):531.[doi:10.16836/j.cnki.jcuit.2020.05.009]
YANG Ming,WEN Bin.An Improved YOLOv3-Tiny Target Detection Algorithm[J].Journal of Chengdu University of Information Technology,2020,35(05):531.[doi:10.16836/j.cnki.jcuit.2020.05.009]
[7]曹远杰,高瑜翔,杜鑫昌,等.口罩佩戴识别中的Tiny-YOLOv3模型算法优化[J].成都信息工程大学学报,2021,36(02):154.[doi:10.16836/j.cnki.jcuit.2021.02.005]
CAOYuanjie,GAO Yuxiang,DU Xinchang,et al.Tiny-YOLOv3 Model Algorithm is Optimized for Mask Wearing Recognition[J].Journal of Chengdu University of Information Technology,2021,36(05):154.[doi:10.16836/j.cnki.jcuit.2021.02.005]
[8]曹远杰,高瑜翔,刘海波,等.基于YOLOv4-Tiny模型剪枝算法[J].成都信息工程大学学报,2021,36(06):610.[doi:10.16836/j.cnki.jcuit.2021.06.005]
CAO Yuanjie,GAO Yuxiang,LIU Haibo,et al.Model Pruning Algorithm based on YOLOv4-Tiny[J].Journal of Chengdu University of Information Technology,2021,36(05):610.[doi:10.16836/j.cnki.jcuit.2021.06.005]
[9]晏美娟,魏 敏,文 武.一种高分辨率卫星图像道路提取方法[J].成都信息工程大学学报,2022,37(01):46.[doi:10.16836/j.cnki.jcuit.2022.01.008]
YAN Meijuan,WEI Min,WEN Wu.A Method of Road Extraction for High-Resolution Satellite Images[J].Journal of Chengdu University of Information Technology,2022,37(05):46.[doi:10.16836/j.cnki.jcuit.2022.01.008]
[10]白凯毅,盛志伟,黄源源.基于惩罚回归的高噪声流量分类[J].成都信息工程大学学报,2025,40(02):125.[doi:10.16836/j.cnki.jcuit.2025.02.001]
BAI Kaiyi,SHENG Zhiwei,HUANG Yuanyuan.High Noise Traffic Classification based on Penalty Regression[J].Journal of Chengdu University of Information Technology,2025,40(05):125.[doi:10.16836/j.cnki.jcuit.2025.02.001]
备注/Memo
收稿日期:2024-05-08
基金项目:国家社会科学基金资助项目(22BXW048)
通信作者:安俊秀.E-mail:86631589@qq.com
