电子科技 ›› 2022, Vol. 35 ›› Issue (3): 8-15.doi: 10.16180/j.cnki.issn1007-7820.2022.03.002

• • 上一篇    下一篇

基于卷积循环网络与非局部模块的语音增强方法

李辉1,景浩2,严康华2,徐良浩2   

  1. 1. 河南理工大学 物理与电子信息学院,河南 焦作 454000
    2. 河南理工大学 电气工程与自动化学院,河南 焦作 454000
  • 收稿日期:2020-11-16 出版日期:2022-03-15 发布日期:2022-04-02
  • 作者简介:李辉(1974-),男,博士,教授。研究方向:智能信号处理、通信技术等。|景浩(1995-),男,硕士研究生。研究方向:语音增强、语音信号处理。
  • 基金资助:
    国家自然科学基金(11804081);河南省基础与前沿技术研究计划(152300410103)

Speech Enhancement Method Based on Convolutional Recurrent Network and Non-Local Module

Hui LI1,Hao JING2,Kanghua YAN2,Lianghao XU2   

  1. 1. School of Physics and Electronic Information Engineering,Henan Polytechnic University,Jiaozuo 454000,China
    2. School of Electrical Engineering and Automation,Henan Polytechnic University,Jiaozuo 454000,China
  • Received:2020-11-16 Online:2022-03-15 Published:2022-04-02
  • Supported by:
    National Natural Science Foundation of China(11804081);Basic and Frontier Technology Research Program of Henan(152300410103)

摘要:

现有的深度神经网络语音增强方法忽视了相位谱学习的重要性,从而造成增强语音质量不理想。针对这一问题,文中提出了一种基于卷积循环网络与非局部模块的语音增强方法。通过设计一种编解码网络,将语音信号的时域表示作为编码端的输入进行深层特征提取,从而充分利用语音信号的幅值信息以及相位信息。在编码端和解码端的卷积层中加入非局部模块,在提取语音序列关键特征的同时,抑制无用特征,并引入门控循环单元网络捕捉语音序列间的时序相关性信息。在ST-CMDS中文语音数据集上实验结果表明,与未处理的含噪语音相比,使用文中方法生成的增强语音质量和可懂度平均提升了61%和7.93%。

关键词: 语音增强, 深度神经网络, 卷积循环网络, 非局部模块, 监督学习, 门控循环单元, 幅值谱, 相位谱

Abstract:

The existing deep neural network speech enhancement methods ignore the importance of phase spectrum learning and cause the enhanced speech quality to be unsatisfactory. In view of this problem, a speech enhancement method based on convolutional recurrent network and non-local modules is proposed in the present study. By designing an encoder-decoder network, the time-domain representation of the speech signal is used as the input of the encoding end for deep feature extraction, so as to make full use of the amplitude information and phase information of the speech signal. Non-local modules are added to the convolutional layers of the encoder and decoder to extract key features of the speech sequence while suppressing useless features. A gated loop unit network is introduced to capture the timing correlation information between the speech sequences. The experimental results on the ST-CMDS Chinese speech dataset show that compared with the unprocessed noisy speech, the quality and intelligibility of the enhanced speech are improved by 61% and 7.93% on average.

Key words: speech enhancement, deep neural network, convolutional recurrent network, non-local module, supervised learning, gated recurrent unit, magnitude spectrum, phase spectrum

中图分类号: 

  • TN912.35