基于听感量化编码的神经网络语音合成方法研究

doi:10.16180/j.cnki.issn1007-7820.2019.09.016

Abstract

Abstract:

Current neural network based speech synthesis framework is designed for single speaker, requiring at least a few hours training, and cannot make use of speech data from different speakers, languages, styles. To address this problem, a perception quantification-based neural network speech synthesis method was proposed. In the proposed method, a perception quantification-based model was designed to learn the representations for different attributes of speech. A unified acoustic model was built using the learnt perception quantification representations for different speakers, languages and styles. An adaptation method was introduced to transfer the knowledge from the unified acoustic model to new speakers with limited speech data. The proposed method could effectively control the speaker, language, and style of synthetic speech, achieve cross-language, cross-style speech synthesis, and the adaptation method could reduce the demand for training data to a few minutes. The proposed methods significantly improved the quality and flexibility of speech synthesis systems, and the naturalness of synthesized speech is similar to or better than an average mandarin speaker.

Key words: speech synthesis, perception quantification, neural networks, limited data, cross-language, style control

CLC Number:

TN912.33

LIU Qingfeng,JIANG Yuan,HU Yajun,LIU Lijuan. Research on Perception Quantification-based Neural Speech Synthesis Methods[J].Electronic Science and Technology, 2019, 32(9): 76-79.

Figures/Tables 6

Figure 1.

Figure 2.

Figure 3.

Table 1

Table 2

Table 3

References 16

[1]	Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synconfproc [C].Budapest:Sixth European Conference on Speech Communication and Technology, EUROSPEECH, 1999.
[2]	Tokuda K, Masuko T, Miyazaki N , et al. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling[C].Phoenix:International Conference on Acoustics, Speech and Signal Processing(ICASSP), 1999.
[3]	Tokuda K, Yoshimura T, Masuko T , et al. Speech parameter generation algorithms for HMM-based speech synconfproc[C].Istanbul:International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2000.
[4]	Ling Z H, Deng L, Yu D . Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synjournal[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013,21(10):2129-2139.
[5]	Zen H . Deep learning in speech synconfproc[C].Guangzhou:Keynote Speech Given at Isca Speech Synconfproc Workshop (SSW8), 2013.
[6]	Fan Y, Qian Y, Xie F L , et al. TTS synconfproc with bidirectional LSTM based recurrent neural networks[C].Minneapolis:Fifteenth Annual Conference of the International Speech Communication Association(ISCA), 2014.
[7]	Zen H, Sak H. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synconfproc [C].South Brisbane: International Conference on Acoustics,Speech and Signal Processing (ICASSP),IEEE, 2015.
[8]	Ling Z H, Kang S Y, Zen H , et al. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends[J]. IEEE Signal Processing Magazine, 2015,32(3):35-52.
[9]	Takaki S, Yamagishi J . A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synconfproc [C]. Shanghai:International Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, 2016.
[10]	Chen L H, Raitio T, Valentini-Botinhao C , et al. DNN-based stochastic postfilter for HMM-based speech synconfproc [C]. Singapore:15 ^th Annual Conference of the International Speech Communication Association,INTERSPEECH , 2014.
[11]	Kaneko T, Kameoka H, Hojo N , et al. Generative adversarial network-based postfilter for statistical parametric speech synconfproc [C].New Orleans:International Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, 2017.
[12]	刘庆峰 . 基于听感量化理论的语音合成系统研究[D]. 合肥:中国科学技术大学, 2003.
	Liu Qingfeng . Research on perception quantification-based speech synthesis system[D]. Hefei:University of Science and Technology of China, 2003.
[13]	Hu Y J, Ling Z H . DBN-based spectral feature representation for statistical parametric speech synjournal[J]. IEEE Signal Processing Letters, 2016,23(3):321-325.
[14]	Liu L J, Ding C, Jiang Y, et al. The IFLYTEK system for blizzard challenge [C].Stockholm:The Blizzard ChallengeWorkshop, 2017.
[15]	An S, Ling Z, Dai L. Emotional statistical parametric speech synconfproc using LSTM-RNNs [C].Kuala Lumpur : Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC),IEEE, 2017.
[16]	Hu Y J, Ling Z H . Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synjournal[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2018,26(4):713-724.

中文自然度	基线系统	听感量化系统
中文10 h数据	4.02	4.22
中文1 h数据	3.70	4.05
中文5 min	-	3.46

情感判断正确率	基线系统	听感量化系统	相对提升
中立	81.6%	92.7%	60.3%
开心	91.6%	100%	100%
生气	95.6%	98.3%	61.4%
悲伤	100%	100%	-
4项平均	92.2%	97.75%	71.2%

Research on Perception Quantification-based Neural Speech Synthesis Methods

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 16

Related Articles 12

Metrics

Comments

Recommended 8

[1]	XUE Yongjie,JU Zhiyong. Fish Recognition Algorithm Based on Improved AlexNet [J]. Electronic Science and Technology, 2021, 34(4): 12-17.
[2]	GUO Xin,GAO Yan,JIANG Lin,ZHANG Zhishu. Research on Stability of Nonlinear Systems Based on Event Triggering and Quantization [J]. Electronic Science and Technology, 2020, 33(3): 56-61.
[3]	MIAO Ran,LI Feifei,CHEN Qiu. Scene Recognition Algorithm Based on Convolutional Neural Networks and Multi-Scale space Encoding [J]. Electronic Science and Technology, 2020, 33(12): 54-58.
[4]	ZHENG Meng. Design of English-Chinese Translation System Based on Variational Model [J]. Electronic Science and Technology, 2020, 33(12): 75-78.
[5]	ZHU Weiyun,FU Dongxiang,GE Donglin. PMSM Control System Based on RBF Neural Network [J]. , 2016, 29(1): 161-.
[6]	ZHANG Guoguang. Research on Block Image Segmentation Based on the Neural Network [J]. , 2015, 28(5): 132-.
[7]	HU Yucheng,WANG Chuangxi. An Image Fusion Algorithm Based on NSCT and PCNN [J]. , 2014, 27(4): 30-.
[8]	BAI Xue,TIAN Qichuan,HAO Menglin1. Research on Gender Recognition Method Based on Neural Networks [J]. , 2013, 26(9): 151-.
[9]	YU Peng, WANG Jia-Cheng, JIA Wei-Gang. Embedded License Plate Recognition System Based on Blackfin561 [J]. , 2013, 26(12): 135-.
[10]	MIAO Lin-Song. Software Defect Prediction Based on Cost-Sensitive Neural Networks [J]. , 2012, 25(6): 75-.
[11]	FENG Ju-Yi. Application of Improved BP Algorithm in Stock Market Prediction [J]. , 2011, 24(8): 15-.
[12]	HU Hai-Xu, LUO Wen-Guang. Adaptive Sliding Mode Control for a Class of Affine Nonlinear System Based on Neural Networks [J]. , 2011, 24(4): 12-.