电子科技 ›› 2019, Vol. 32 ›› Issue (9): 76-79.doi: 10.16180/j.cnki.issn1007-7820.2019.09.016

• • 上一篇    

基于听感量化编码的神经网络语音合成方法研究

刘庆峰,江源,胡亚军,刘利娟   

  1. 语音及语言信息处理国家工程实验室,安徽 合肥 230027
  • 收稿日期:2019-06-24 出版日期:2019-09-15 发布日期:2019-09-19
  • 作者简介:刘庆峰(1973-),男,博士,教授,博士生导师。研究方向:信号处理,语音及语言信息处理。|江源(1983-),男,博士研究生。研究方向:语音信号处理,语音合成。
  • 基金资助:
    国家自然科学基金(61871358)

Research on Perception Quantification-based Neural Speech Synthesis Methods

LIU Qingfeng,JIANG Yuan,HU Yajun,LIU Lijuan   

  1. National Engineering Laboratory for Speech and Language Information Processing,Hefei 230027,China
  • Received:2019-06-24 Online:2019-09-15 Published:2019-09-19
  • Supported by:
    National Natural Science Foundation of China(61871358)

摘要:

针对当前神经网络声学建模中数据混用困难的问题,文中提出了一种基于听感量化编码的神经网络语音合成方法。通过设计听感量化编码模型学习海量语音在音色、语种、情感上的不同差异表征,构建统一的多人数据混合训练的神经网络声学模型。在统一的听感量化编码声学模型内通过数据共享和迁移学习,可以显著降低合成系统搭建的数据量要求,并实现对合成语音的音色、语种、情感等属性的有效控制。提升了神经网络语音合成的质量和灵活性,一小时数据构建语音合成系统自然度可达到4.0MOS分,达到并超过普通说话人水平。

关键词: 语音合成, 听感量化编码, 神经网络, 少数据量合成, 跨语种合成, 情感控制

Abstract:

Current neural network based speech synthesis framework is designed for single speaker, requiring at least a few hours training, and cannot make use of speech data from different speakers, languages, styles. To address this problem, a perception quantification-based neural network speech synthesis method was proposed. In the proposed method, a perception quantification-based model was designed to learn the representations for different attributes of speech. A unified acoustic model was built using the learnt perception quantification representations for different speakers, languages and styles. An adaptation method was introduced to transfer the knowledge from the unified acoustic model to new speakers with limited speech data. The proposed method could effectively control the speaker, language, and style of synthetic speech, achieve cross-language, cross-style speech synthesis, and the adaptation method could reduce the demand for training data to a few minutes. The proposed methods significantly improved the quality and flexibility of speech synthesis systems, and the naturalness of synthesized speech is similar to or better than an average mandarin speaker.

Key words: speech synthesis, perception quantification, neural networks, limited data, cross-language, style control

中图分类号: 

  • TN912.33