LI Tongyan,PEI Haoyan,PEI Yan,et al.GAN Speech Enhancement Algorithm based on Attention Mechanism and Mask Learning[J].Journal of Chengdu University of Information Technology,2025,40(02):137-142.[doi:10.16836/j.cnki.jcuit.2025.02.003]
基于注意力机制和掩码学习的GAN语音增强算法
- Title:
- GAN Speech Enhancement Algorithm based on Attention Mechanism and Mask Learning
- 文章编号:
- 2096-1618(2025)02-0137-06
- Keywords:
- speech enhancement; generative adversarial networks; attention-BLSTM; mask reconstruction; low signal-to-noise ratio
- 分类号:
- TP391.1
- 文献标志码:
- A
- 摘要:
- 语音增强是自动语音识别的重要组成之一,近年来,生成对抗网络及其变体模型在语音增强中的建模能力逐渐增强,但仍有泛化能力弱、无法适应低信噪比环境等问题。对此,提出一种结合注意力机制双向长短期记忆网络及掩码学习的GAN语音增强模型。该框架创新了语音增强机制,利用双向长短期记忆网络及注意力层作为生成对抗网络的生成器,并引入掩码学习进行频谱重构,将滤波后的信号与原始信号进行叠加得到增强信号,输入判别器后,两个网络相互博弈达到语音增强的目的。采用TIMIT数据集,通过对比语音质量客观评估和短时客观可懂度等语音评价指标,在不同信噪比环境下对该模型进行评估。实验结果表明,该模型的语音增强效果相比基准生成对抗网络等模型平均提升了11.8%,在噪声干扰大的环境下仍有较强的声学建模能力。
- Abstract:
- Speech enhancement is one of the important components in automatic speech recognition(ASR). In recent years, the modeling capability of generative adversarial networks(GANs)and their variants in speech enhancement has been gradually improved. However, they still suffer from weak generalization ability and inability to adapt to low signal-to-noise ratio environments. In response to this issue, a GAN-based speech enhancement model called Mask-LAGAN, which combines attention-based bidirectional LSTM(BLSTM)and mask learning, is proposed. The framework innovatively designs the speech enhancement mechanism by using BLSTM and attention layers as the generator of the GAN, and introduces mask learning for spectrum reconstruction. The enhanced signal is obtained by overlaying the filtered signal with the original signal, followed by input to the discriminator. The two networks engage in a mutual adversarial training to achieve the goal of speech enhancement. The TIMIT dataset is utilized for comparative evaluation under different signal-to-noise ratio conditions, using speech evaluation metrics such as PESQ, STOI, and CSIG. Experimental results demonstrate that the proposed model achieves an average improvement of 11.8% in speech enhancement compared to models like SeGAN.
参考文献/References:
[1] Pascual Santiago,Antonio Bonafonte,Joan Serrà.SEGAN:Speech Enhancement Generative Adversarial Network[C].Interspeech,2017.
[2] Bahdanau Dzmitry,Jan Chorowski,Dmitriy Serdyuk,et al.End-to-end attention-based large vocabulary speech recognition[C].IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2015:4945-4949.
[3] Watanabe,Shinji,Takaaki Hori,et al.Hybrid CTC/Attention Architecture for End-to-End Speech Recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017:1240-1253.
[4] Wang Kai,He Bengbeng,Zhu Weiping.TSTNN:Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain[C].IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2021:7098-7102.
[5] Qiuqiang Kong,Xu Yong,Wang Wenwu,et al.Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019(28):2450-2460.
[6] Cao Ru,Sherif Abdulatif,Bin Yang.CMGAN:Conformer-based Metric GAN for Speech Enhancement[J].ArXiv,2022.
[7] 张恩琪,顾广华,赵晨,等.生成对抗网络GAN的研究进展[J].计算机应用研究,2021,38(4):968-974.
[8] 庞源焜,张宇山.句子级状态下LSTM对谣言鉴别的研究[J].计算机应用研究,2022,39(4):1064-1070.
[9] Zhang Shiqing,Zhao Xiaoming,Tian Qingxi.Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM[J].IEEE Transactions on Affective Computing,2022(13):680-688.
[10] 刘继明,孙成,袁野.基于训练模型改进的语音问句信息抽取方法[J].科学技术与工程,2021,21(18):7635-7641.
[11] Sultana Sadia M,Zafar Iqbal,Mohammad Reza Selim,et al.Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks[J].IEEE Access,2022(10):564-578.
[12] Shim,Kyuhong, Jungwook Choi,et al.Understanding the Role of Self Attention for Efficient Speech Recognition[C].International Conference on Learning Representations,2022.
[13] Su BoHao,ChiChun Lee.Unsupervised Cross-Corpus Speech Emotion Recognition Using a Multi-Source Cycle-GAN[C].IEEE Transactions on Affective Computing,2022.
[14] Wu Bowen,Liu Chaoran,Carlos Toshinori Ishi,et al.Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN[C].Electronics,2021.
[15] Ju Lin,Niu Sufeng,Adriaan J,et al.Improved Speech Enhancement Using a Time-Domain GAN with Mask Learning[C].Interspeech,2020.
[16] 杨海涛,王华朋,楚宪腾,等.基于卷积循环神经网络的语音逻辑攻击检测[J].科学技术与工程,2022,22(18):7937-7944.
[17] Su Jiaqi,Jin Zeyu,Adam Finkelstein.HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features[C]. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA),2021:166-170.
[18] Beck,Gustavo Teodoro Döhler,Ulme Wennberg,et al.Wavebender GAN:An Architecture for Phonetically Meaningful Speech Manipulation[C].IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2022:6187-6191.
[19] Kim Minsu,Joanna Hong,Yong Man Ro.Lip to Speech Synthesis with Visual Context Attentional GAN[C].Neural Information Processing Systems,2022.
[20] Rix A W,Beerends J.G,Hollier M P,et al.Perceptual evaluation ofspeech quality(PESQ)—A new method for speech quality assessmentof telephone networks and codecs[C].Proc of the 26th IEEE IntConf on Acoustics,Speech,and Signal Processing.Piscataway,NJ:IEEE,2001:749-752.
相似文献/References:
[1]蔡 良,夏秀渝,陆 雄,等.基于基音跟踪的语音增强研究[J].成都信息工程大学学报,2019,(01):1.[doi:10.16836/j.cnki.jcuit.2019.01.001]
CAI Liang,XIA Xiuyu,LU Xiong,et al.Research on Speech Enhancement based on Pitch Tracking[J].Journal of Chengdu University of Information Technology,2019,(02):1.[doi:10.16836/j.cnki.jcuit.2019.01.001]
备注/Memo
收稿日期:2023-09-25
基金项目:四川省科技厅资助项目(2023YFS0422)
通信作者:李彤岩.E-mail:lty@cuit.edu.cn