基于卷积循环网络与非局部模块的语音增强方法

doi:10.16180/j.cnki.issn1007-7820.2022.03.002

Abstract

Abstract:

The existing deep neural network speech enhancement methods ignore the importance of phase spectrum learning and cause the enhanced speech quality to be unsatisfactory. In view of this problem, a speech enhancement method based on convolutional recurrent network and non-local modules is proposed in the present study. By designing an encoder-decoder network, the time-domain representation of the speech signal is used as the input of the encoding end for deep feature extraction, so as to make full use of the amplitude information and phase information of the speech signal. Non-local modules are added to the convolutional layers of the encoder and decoder to extract key features of the speech sequence while suppressing useless features. A gated loop unit network is introduced to capture the timing correlation information between the speech sequences. The experimental results on the ST-CMDS Chinese speech dataset show that compared with the unprocessed noisy speech, the quality and intelligibility of the enhanced speech are improved by 61% and 7.93% on average.

Key words: speech enhancement, deep neural network, convolutional recurrent network, non-local module, supervised learning, gated recurrent unit, magnitude spectrum, phase spectrum

CLC Number:

TN912.35

Hui LI,Hao JING,Kanghua YAN,Lianghao XU. Speech Enhancement Method Based on Convolutional Recurrent Network and Non-Local Module[J].Electronic Science and Technology, 2022, 35(3): 8-15.

Figures/Tables 8

Figure 1.

Figure 2.

Figure 3.

Table 1

Table 2

Table 3

Table 4

Figure 4.

References 21

[1]	刘文举, 聂帅, 梁山 , 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016,42(6):819-833.
	Liu Wenju, Nie Shuai, Liang Shan , et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016,42(6):819-833.
[2]	毕灶荣, 童东兵, 陈巧玉 . 基于快速MFCC计算的说话人识别系统的设计[J]. 电子科技, 2018,31(4):25-28.
	Bi Zaorong, Tong Dongbing, Chen Qiaoyu . Design of speaker recognition system based on fast MFCC calculation[J]. Electronic Science and Technology, 2018,31(4):25-28.
[3]	刘立辉, 杨毅, 王旭阳 , 等. 机载任务系统语音交互技术应用研究[J]. 电子科技, 2017,30(12):125-129.
	Liu Lihui, Yang Yi, Wang Xuyang , et al. Applied research on the speech interaction technology in airborne mission system[J]. Electronic Science and Technology, 2017,30(12):125-129.
[4]	Wang D L, Chen J T . Supervised speech separation based on deep learning:an overview[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2018,26(10):1702-1726.
[5]	Xu Y, Du J, Dai L R , et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal Processing Letters, 2013,21(1):65-68.
[6]	袁文浩, 孙文珠, 夏斌 , 等. 利用深度卷积神经网络提高未知噪声下的语音增强性能[J]. 自动化学报, 2018,44(4):751-759.
	Yuan Wenhao, Sun Wenzhu, Xia Bin , et al. Improving speech enhancement in unseen noise using deep convolutional neural network[J]. Acta Automatica Sinica, 2018,44(4):751-759.
[7]	范存航, 刘斌, 陶建华 , 等. 一种基于卷积神经网络的端到端语音分离方法[J]. 信号处理, 2019,35(4):542-548.
	Fan Cunhang, Liu Bin, Tao Jianhua , et al. An end-to-end speech separation method based on convolutional neural network[J]. Journal of Signal Processing, 2019,35(4):542-548.
[8]	Paliwal K, Wójcicki K, Shannon B . The importance of phase in speech enhancement[J]. Speech Communication, 2011,53(4):465-494.
[9]	Pascual S, Bonafonte A, Serrà J. SEGAN:speech enhancement generative adversarial network[C]. Stockholm:Proceedings of the International Speech Communication Association, 2017.
[10]	王怡斐, 韩俊刚, 樊良辉 . 基于WGAN的语音增强算法研究[J]. 重庆邮电大学学报(自然科学版), 2019,31(1):136-142.
	Wang Yifei, Han Jungang, Fan Lianghui . Algorithm research of speech enhancement based on WGAN[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2019,31(1):136-142.
[11]	Baby D, Verhulst S. Sergan:speech enhancement using relativistic generative adversarial networks with gradient penalty[C]. Brighton:Proceedings of the International Conference on Acoustics,Speech and Signal Processing, 2019.
[12]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C].Boston:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[13]	Stoller D, Ewert S, Dixon S. Wave-U-Net: a multi-scale neural network for end-to-end audio source separation[C]. Paris:International Society for Music InformationRetrieval, 2018.
[14]	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]. Salt Lake City:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[15]	袁文浩, 娄迎曦, 夏斌 , 等. 基于卷积门控循环神经网络的语音增强方法[J]. 华中科技大学学报(自然科学版), 2019,47(4):13-18.
	Yuan Wenhao, Lou Yingxi, Xia Bin , et al. Speech enhancement method based on convolutional gated recurrent neural network[J]. Journal of Huazhong University of Science and Technology(Natural Science Edition), 2019,47(4):13-18.
[16]	黎阳, 沈烨, 刘敏 , 等. 融合运动信息与表观信息的多目标跟踪算法[J]. 电子科技, 2020,33(9):21-24.
	Li Yang, Shen Ye, Liu Min , et al. Multi-target tracking algorithm by combining motion information and apparent information[J]. Electronic Science and Technology, 2020,33(9):21-24.
[17]	贝琛圆, 于海滨, 潘勉 , 等. 基于改进U-Net网络的腺体细胞图像分割算法[J]. 电子科技, 2019,32(11):18-22.
	Bei Chenyuan, Yu Haibin, Pan Mian , et al. Gland cell image segmentation algorithm based on improved U-Net network[J]. Electronic Science and Technology, 2019,32(11):18-22.
[18]	Piczak K J. ESC:Dataset for environmental sound classification[C]. Brisbane:Proceedings of the Twenty-third Acm International Conference on Multimedia, 2015.
[19]	Varga A, Steeneken H J M. Assessment for automatic speech recognition:II. NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993,12(3):247-251.
[20]	Rix A W, Beerends J G, Hollier M P, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]. Piscataway:Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2001.
[21]	Taal C H, Hendriks R C, Heusdens R , et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio Speech and Language Processing, 2011,19(7):2125-2136.

[1]	. [J]. , 2009, 12(12): 41 -43 .
[2]	. [J]. , 2009, 12(12): 20 -22+77 .
[3]	. [J]. , 2009, 12(12): 23 -25 .
[4]	. [J]. , 2009, 12(12): 48 -51 .
[5]	. [J]. , 2009, 12(12): 26 -28+37 .
[6]	. [J]. , 2009, 12(12): 38 -40 .
[7]	. [J]. , 2009, 12(12): 52 -54+57 .
[8]	. [J]. , 2009, 12(12): 103 -106 .
[9]	. [J]. , 2009, 12(12): 6 -8 .
[10]	PU Liang, ZHANG Xiao-Miao. Design of a Dual Band Printed Antenna for WLAN Applications[J]. , 2009, (12): 1 -2+5 .

网络层	输入维度	输出维度	超参数
输入层	1×16 384	1×16 384	—
编码端	1×16 384	288×4	k=15,s=1 n=24,48,72,96,120,144,168,192,216,240,264,288
非局部模块	288×4	288×4	—
重构层	288×4	4×288	—
GRU层1	4×288	4×288	288
GRU层2	4×288	4×288	288
特征融合层	[4×288, 4×288]	4×288	—
重构层	4×288	288×4	—
非局部模块	288×4	288×4	—
解码端	576×8	24×16 384	k=5,s=1 n=288,264,240,216,192,168,144,120,96,72,48,24
输出层	25×16 384	1×16 384	k=5,s=1,n=1

配置名称	型号参数
操作系统	Windows 10
编程语言	Python 3.6
处理器	Inter Core i5-9400F @2.90 GHz
显卡	RTX 2060S
内存	16 GB

信噪比	噪声类型	a	b	c	d	e	f	g	h
-3 dB	Babble	1.12	1.18	1.14	1.17	1.25	1.28	1.29	1.35
	Destroyer engine	1.21	1.43	1.34	1.43	1.55	1.59	1.56	1.59
	F16	1.15	1.43	1.36	1.50	1.50	1.57	1.51	1.66
	HF channel	1.21	1.46	1.39	1.47	1.52	1.69	1.63	1.69
	M109	1.16	1.63	1.58	1.71	1.71	1.63	1.65	1.83
	White	1.10	1.41	1.14	1.43	1.54	1.59	1.60	1.68
0 dB	Babble	1.31	1.41	1.35	1.41	1.51	1.53	1.51	1.60
	Destroyer engine	1.28	1.53	1.44	1.58	1.67	1.76	1.70	1.80
	F16	1.17	1.59	1.48	1.65	1.75	1.80	1.71	1.87
	HF channel	1.27	1.59	1.49	1.63	1.80	1.86	1.81	1.83
	M109	1.25	1.82	1.72	1.94	1.93	1.91	1.90	2.05
	White	1.12	1.59	1.24	1.64	1.73	1.78	1.78	1.88
3 dB	Babble	1.31	1.93	1.79	2.08	1.71	1.82	1.77	1.99
	Destroyer engine	1.36	1.67	1.58	1.73	1.90	1.98	1.91	1.96
	F16	1.28	1.83	1.71	1.90	2.03	2.08	1.98	2.12
	HF channel	1.36	1.68	1.61	1.74	1.97	2.05	1.99	1.97
	M109	1.42	2.00	1.93	2.10	2.26	2.21	2.21	2.30
	White	1.16	1.76	1.40	1.79	1.90	1.93	1.93	2.03
均值		1.23	1.61	1.48	1.66	1.73	1.78	1.75	1.84

信噪比	噪声类型	a	b	c	d	e	f	g	h
-3 dB	Babble	63.91	61.11	62.04	60.80	71.17	70.83	70.95	72.02
	Destroyer engine	68.61	71.86	70.39	73.51	77.57	77.04	77.03	79.13
	F16	67.39	75.06	74.52	75.81	79.15	78.48	78.17	80.57
	HF channel	69.01	73.28	72.87	75.58	78.81	78.78	78.51	78.69
	M109	79.45	83.67	83.28	84.29	86.74	88.72	86.44	88.69
	White	72.82	73.33	55.57	74.11	81.25	81.12	80.74	83.56
0 dB	Babble	76.11	80.04	80.02	78.88	84.82	84.69	84.43	83.95
	Destroyer engine	77.06	80.08	79.32	80.81	84.02	83.55	83.61	85.02
	F16	74.86	81.98	81.21	82.26	84.41	84.18	83.77	86.11
	HF channel	76.87	80.24	79.68	80.90	84.13	83.76	83.61	84.29
	M109	85.69	87.91	87.41	88.19	90.68	90.56	90.53	91.67
	White	79.45	79.92	68.29	80.12	85.41	85.49	85.11	85.39
3 dB	Babble	80.13	91.79	89.27	92.71	88.66	88.67	88.58	92.11
	Destroyer engine	83.81	84.97	85.11	85.29	88.06	87.63	87.7	88.11
	F16	83.06	86.25	85.91	85.99	89.09	88.78	88.65	89.33
	HF channel	82.45	84.27	84.42	84.41	87.73	87.25	87.56	86.95
	M109	91.03	90.70	91.10	90.65	93.71	93.67	93.49	94.25
	White	84.99	85.82	80.40	85.95	88.45	88.43	89.71	89.53
均值		77.59	80.68	78.38	81.13	84.66	84.54	84.37	85.52

Speech Enhancement Method Based on Convolutional Recurrent Network and Non-Local Module

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 21

Related Articles 7

Metrics

Comments

Recommended 10

[1]	Hanqing CHEN,Feifei LI,Qiu CHEN. Video Retrieval Algorithm Based on 3D Convolution and Hash Method [J]. Electronic Science and Technology, 2022, 35(4): 35-39.
[2]	CHU Ping,NI Wei. Design of FPGA-Based SqueezeNet Inference Accelerator [J]. Electronic Science and Technology, 2022, 35(2): 20-26.
[3]	HAO Miao,CHEN Linqiang. Chinese Microblog Polarity Classification Based on Hownet and PMI [J]. Electronic Science and Technology, 2021, 34(7): 50-55.
[4]	ZHAO Yihe,SHAO Jie,CHENG Yongliang. Behavior Modeling of Class-D Power Amplifier Based on Encoder-Decoder Model [J]. Electronic Science and Technology, 2020, 33(2): 20-24.
[5]	CHENG Lingfei,HE Yang,ZHANG Peiling,LI Yan. Classification Performance of Compressing Dimensionality of Hidden Layer of Deep Neural Network [J]. Electronic Science and Technology, 2019, 32(1): 72-75.
[6]	MA Lei. Semi-Supervised Regression Based on Particle Swarm Optimization and Support Vector Machine [J]. , 2013, 26(9): 10-.
[7]	CHAI Mei-Juan, LIU Gui-Guo. A Fault Identification Method for Rolling Bearing Based on SLS_SVM [J]. , 2012, 25(6): 136-.