Electronic Science and Technology ›› 2024, Vol. 37 ›› Issue (4): 1-7.doi: 10.16180/j.cnki.issn1007-7820.2024.04.001

    Next Articles

Multi-Encoder Transformer for End-to-End Speech Recognition

PANG Jiangfei, SUN Zhanquan   

  1. School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2022-10-31 Online:2024-04-15 Published:2024-04-19
  • Supported by:
    National Defense Basic Scientific Research Program(JCKY2019413D001);Medical and Engineering Cross Project of University of Shanghai for Science and Technology(10-21-302-413)

Abstract:

The current widely used Transformer model has a strong ability to capture global dependencies, but it tends to ignore local feature information at shallow layers. To solve this problem, this study proposes a method using multiple encoders to improve the ability of speech feature extraction. An additional convolutional encoder branch is added to strengthen the capture of local feature information, make up for the neglect of local feature information in shallow Transformer, and effectively realize the integration of global and local dependencies of audio feature sequences. In other words, a multi-encoder model based on Transformer is proposed. Experiments on the open-source Chinese Mandarin data set Aishell-1 show that without an external language model, the proposed Transformer-based multi-encoder model has a relative reduction of 4.00% in character error rate when compared with the Transformer model. On the internal non-public Shanghainese dialect data set, the performance improvement of the proposed model is more obvious, and the character error rate is reduced by 48.24% from 19.92% to 10.31%.

Key words: Transformer, speech recognition, end-to-end, deep neural networks, multi-encoder, multi-head attention, feature fusion, convolution branch networks

CLC Number: 

  • TN912.34