利用密集卷积神经网络的语音变换欺骗检测

doi:10.19665/j.issn1001-2400.2021.04.022

Abstract

Abstract:

Voice transformation (VT) spoofing refers to the operations for hiding the speaker’s identity which change a speaker’s acoustic features by speech processing algorithms and result in extremely high false reject rates for automatic speaker recognition (ASR) systems.VT spoofing is implemented with a low cost and has been integrated in many audio editing tools,thus presenting serious threats to social security.However,the research on VT spoofing detection is still insufficient.Hence,in this paper we propose a dense convolutional neural network (DenseNet) based VT detection method for distinguishing spoofed voices and genuine ones.The proposed network consists of 135 layers in total.By maximizing the skip-layers,the data transmission can be enhanced,and both the deep and shallow edge features can be used for classification,so as to alleviate the degradation phenomenon and further to improve detection accuracy.Experimental results show that the detection accuracy with various spoofing factors is over 98%.

Key words: voice transformation spoofing, detection, security, neural network

CLC Number:

TP39

WANG Yong,SU Zhuoyi,ZHU Zhengyu. Detection of voice transformation spoofing using the dense convolutional neural network[J].Journal of Xidian University, 2021, 48(4): 168-175.

Figures/Tables 8

References 31

[1]	PERROT P, AVERSANO G. Voice Disguise and Automatic Detection:Review and Perspectives[C]//Progress in Nonlinear Speech Processing.Berlin:Springer-Verlag, 2007:101-117.
[2]	GOMEZ-ALANIS A, PEINADO A M, GONZALEZ J A, et al. A Gated Recurrent Convolutional Neural Network for Robust Spoofing Detection[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing (TASLP), 2019, 27(12):1985-1999.
[3]	ALAM J, KENNY P. Spoofing Detection Employing Infinite Impulse Response—Constant Q Transform-Based Feature Representations[C]//Proceedings of the 2017 25th European Signal Processing Conference(EUSIPCO).Piscataway:IEEE, 2017:101-105.
[4]	HANILCI C. Speaker Verification Anti-Spoofing Using Linear Prediction Residual Phase Features[C]//Proceedings of the 2017 25th European Signal Processing Conference(EUSIPCO).Piscataway:IEEE, 2017:96-100.
[5]	KAMBLE M R, PATIL H. Novel Energy Separation Based Instantaneous Frequency Features for Spoof Speech Detection[C]//Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO).Piscataway:IEEE, 2017:106-110.
[6]	MUCKENHIM H, KORSHUNOV P, MAGIMAI-DOSS M, et al. Long-Term Spectral Statistics for Voice Presentation Attack Detection[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2017, 25(11):2098-2111. doi: 10.1109/TASLP.2017.2743340
[7]	DINKEL H, QIAN Y, YU K. Small-Footprint Convolutional Neural Network for Spoofing Detection[C]//Proceedings of the 2017 International Joint Conference on Neural Networks(IJCNN).Piscataway:IEEE, 2017:3086-3091.
[8]	SAHIDULLAH M, THOMSEN D A L, HAUTAMAKI R G, et al. Robust Voice Liveness Detection and Speaker Verification Using Throat Microphones[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2018, 26(1):44-56. doi: 10.1109/TASLP.2017.2760243
[9]	LEE K, PARK C, KIM N, et al. Accelerating Recurrent Neural Network Language Model Based Online Speech Recognition System[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE, 2018:5904-5908.
[10]	SAILOR H, KAMBLE M, PATIL H. Auditory Filterbank Learning for Temporal Modulation Features in Replay Spoof Speech Detection[C]//Proceedings of the Interspeech.Piscataway:IEEE, 2018:666-670.
[11]	KUMAR M G, KUMAR R S. Spoof Detection Using Time-Delay Shallow Neural Network and Feature Switching[C]//Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU).Piscataway:IEEE, 2019:1011-1017.
[12]	GOMEZ-ALANIS A, GONZALEZ-LOPEZ A, PEINADO A M. A Kernel Density Estimation Based Loss Function and Its Application to ASV-Spoofing Detection[J]. IEEE Access, 2020, 8:108530-108543. doi: 10.1109/Access.6287639
[13]	BALAMURALI B T, LIN K, LUI S, et al. Toward Robust Audio Spoofing Detection:a Detailed Comparison of Traditional and Learned Features[J]. IEEE Access, 2019, 7:84229-84241. doi: 10.1109/Access.6287639
[14]	KAMBLE M, PATIL H. Analysis of Reverberation via Teager Energy Features for Replay Spoof Speech Detection[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE, 2019:2607-2611.
[15]	YE Y, LAO L, YAN D, et al. Detection of Replay Attack Based on Normalized Constant Q Cepstral Feature[C]//Proceedings of the 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis.Piscataway:IEEE, 2019:407-411.
[16]	NOSEK T, SUZIC S, PAPIC B, et al. Synthesized Speech Detection Based on Spectrogram and Convolutional Neural Networks[C]//Proceedings of the 2019 27th Telecommunications Forum.Belgrade:Serbia, 2019:1-4.
[17]	ACHARYA R, PATIL H, KOTTA H. Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection[C]//Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop.Piscataway:IEEE, 2019:342-349.
[18]	MALIK K M, JAVED A, MALIK H, et al. A Light-Weight Replay Detection Framework for Voice Controlled IoT Devices[J]. IEEE Journal of Selected Topics in Signal Processing,Early Access Article, 2020, 14(5):982-996.
[19]	KAMBLE M R, KRISHNA SAI P A. Speech Demodulation-Based Techniques for Replay and Presentation Attack Detection[C]//Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.Piscataway:IEEE, 2019:1545-1550.
[20]	SINITCA A M, EFIMCHIK N V, SHALUGIN E D, et al. Voice Antispoofing System Vulnerabilities Research[C]//Proceedings of the 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering,St.Piscataway:IEEE, 2020:505-508.
[21]	MONTEIRO J, ALAM J, FALK T H. An Ensemble Based Approach for Generalized Detection of Spoofing Attacks to Automatic Speaker Recognizers[C]//Proceedings of the 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE, 2020:6599-6603.
[22]	WANG Y, DENG Y H, WU H J, et al. Blind Detection of Electronic Voice Transformation with Natural Disguise[C]//Proceedings of the Digital Forensics and Watermaking,LNCS 7809.Berlin:Springer-Varlag, 2013:336-343.
[23]	WU H, WANG Y, HUANG J. Identification of Electronic Disguised Voices[J]. IEEE Transactions on Information Forensics and Security, 2014, 9(3):489-500. doi: 10.1109/TIFS.2014.2301912
[24]	WU H, WANG Y, HUANG J. Blind Detection of Electronic Disguised Voice[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE, 2013:3013-3017.
[25]	LIANG H, LIN X, ZHANG Q, et al. Recognition of Spoofed Voice Using Convolutional Neural Networks[C]//Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing(GlobalSIP).Piscataway:IEEE, 2017:293-297.
[26]	LAROCHE J. Time and Pitch Scale Modification of Audio Signals[M]. Applications of Digital Signal Processing to Audio and Acoustics.Moscow:Kluwer Academic Publishers, 2002:279-310.
[27]	TREHUB S, COHEN A, THORPE L, et al. Development of the Perception of Musical Relations:Semitone and Diatonic Structure[J]. Journal of Experimental Psychology Human Perception and Performance, 1986, 12(3):295-301. doi: 10.1037/0096-1523.12.3.295
[28]	HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition[C].// 2016 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2016:770-778.
[29]	SRIVASTAVA R K, GREFF K, SCHMIDHUBER. Training Very Deep Networks [C]. //Conference and Workshop on Neural Information Processing Systems,Advances in Neural Information Processing Systems 28.New York:Curran Associates, 2015:2377-2385.
[30]	LARSSON G, MAIRE M, SHAKHNAROVICH G. FractalNet:Ultra-Deep Neural Networks without Residuals[C]//Proceedings of the Internatienal Conference on Learning Represemtations.Piscataway:IEEE, 2017:403-410.
[31]	HUANG G, SUN Y, LIU Z, et al. Deep Networks with Stochastic Depth[C]//Proleedings of the European Conference on Computer Vision.Piscataway:IEEE, 2016:646-661.

语料库	片段数	语料库	片段数
训练集Timit-1	7 996	测试集Timit-2	8 967
训练集NIST-1	18 601	测试集NIST-2	14 589
训练集UME-1	7 482	测试集UME-2	6 952

训练数据集	测试数据集	135-DenseNet	CNN^{[14 ]}	MFCC-SVM^[24]
Timit-1	Timit-2	99.45	96.52	95.87
NIST-1	NIST-2	98.04	95.93	94.56
UME-1	UME-2	97.56	94.85	93.63
均值		98.35	95.77	94.69

实验	训练数据集	测试数据集		135-DenseNet	CNN^{[25 ]}
实验1	Timit-1&NIST-1	UME-2		96.45	94.37
实验2	NIST-1&UME-1	TIMIT-2		95.26
实验3	Timit-1&UME-1	NIST-2		80.20
均值			90.63

训练数据集	测试数据集	加入10 dB的噪声	加入15 dB的噪声	加入20 dB的噪声	加入30 dB的噪声	干净的语音
Timit-1	Timit-2	97.65	98.56	99.09	99.15	99.18
NIST-1	NIST-2	91.97	91.97	93.82	96.07	96.86
UME-1	UME-2	91.82	91.82	94.12	94.94	96.31
N-1&T-1	UME-2	87.62	87.62	93.58	95.84	96.16
N-1&U-1	Timit-2	90.21	90.21	95.06	95.91	96.22
T-1&U-1	NIST-2	75.28	75.28	80.07	80.91	81.37
均值		89.09	89.09	92.62	93.80	94.35

训练数据集	测试数据集	压缩过的Wav	未经压缩的Wav
Timit-1	Timit-2	93.18	99.45
NIST-1	NIST-2	97.57	98.04
UME-1	UME-2	94.88	97.56
N_1_T_1	UME-2	92.61	96.45
N_1_U_1	Timit-2	93.65	95.26
T_1_U_1	NIST-2	82.08	80.21
均值		92.33	94.50

Detection of voice transformation spoofing using the dense convolutional neural network

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 31

Related Articles 15

Metrics

Comments

Recommended 10

[1]	LV Wenkai,YANG Pengfei,DING Yunqing,ZHANG Heyu,ZHENG Tianyang. JEDERL:A task scheduling optimization algorithm for heterogeneous computing platforms [J]. Journal of Xidian University, 2021, 48(6): 67-74.
[2]	YU Haoyang,YIN Liang,LI Shufang,LV Shun. Recognition algorithm for the little sample radar modulation signal based on the generative adversarial network [J]. Journal of Xidian University, 2021, 48(6): 96-104.
[3]	HU Daiwang,JIAO Yiyuan,LI Yanni. Novel and efficient algorithm for entity relation extraction with the corpus knowledge graph [J]. Journal of Xidian University, 2021, 48(6): 75-83.
[4]	SUN Yanjing,WEI Li,ZHANG Nianlong,YUN Xiao,DONG Kaiwen,GE Min,CHENG Xiaozhou,HOU Xiaofeng. Person re-identification method combining the DD-GAN and Global feature in a coal mine [J]. Journal of Xidian University, 2021, 48(5): 201-211.
[5]	ZHANG Jiaqi,TAO Haihong,ZHANG Xiushe,HAN Chunlei. A multi-frame track before detect algorithm utilizing measurement space clustering [J]. Journal of Xidian University, 2021, 48(5): 231-238.
[6]	ZHOU Peng,YANG Jun. Semantic segmentation of remote sensing images based on neural architecture search [J]. Journal of Xidian University, 2021, 48(5): 47-57.
[7]	QIAN Zhihua,GAO Chenqiang,YE Sheng. Method for detection of a student’s pose in a multi-scene classroom based on meta-learning [J]. Journal of Xidian University, 2021, 48(5): 58-67.
[8]	ZHANG Shuwei,LI Junmin. Human body detection algorithm in complex monitoring scenes [J]. Journal of Xidian University, 2021, 48(5): 68-77.
[9]	YANG Yunhang,MIN Lianquan. Multi-scalefusion sketch recognition model by dilated convolution [J]. Journal of Xidian University, 2021, 48(5): 92-99.
[10]	DONG Ruchan,JIAO Licheng,ZHAO Jin,SHEN Weiyan. Application of the deep fusion mechanism in object detection of remote sensing images [J]. Journal of Xidian University, 2021, 48(5): 128-138.
[11]	MAO Zhaoyong,WANG Yichen,WANG Xin,SHEN Junge. Vehicle video surveillance and analysis system for the expressway [J]. Journal of Xidian University, 2021, 48(5): 178-189.
[12]	CHENG De,HAO Yi,ZHOU Jingyu,WANG Nannan,GAO Xinbo. Cross-modality person re-identification utilizing the hybrid two-stream neural networks [J]. Journal of Xidian University, 2021, 48(5): 190-200.
[13]	CHEN Changchuan,WANG Haining,HUANG Lian,HUANG Tao,LI Lianjie,HUANG Xiangkang,DAI Shaosheng. Facial expression recognition based on local representation [J]. Journal of Xidian University, 2021, 48(5): 100-109.
[14]	SONG Jianfeng,MIAO Qiguang,WANG Chongxiao,XU Hao,YANG Jin. Multi-scale single object tracking based on the attention mechanism [J]. Journal of Xidian University, 2021, 48(5): 110-116.
[15]	ZHANG Yuhao,CHENG Peitao,ZHANG Shuhao,WANG Xiumei. Lightweight image super-resolution with the adaptive weight learning network [J]. Journal of Xidian University, 2021, 48(5): 15-22.