THE SPEED OF LEARNING CONVOLUTIONAL NEURAL NETWORKS ON THE GPU AND CPU TO DETECT SYNTHESIZED SPEECH USING SPECTROGRAMS
Abstract
In this work has been investigated the possibility of using convolutional neural networks to detect synthesized speech. The Python programming language, the TensorFlow library in combination with the high-level Keras API and the ASVspoof 2019 audio database in flac format were used to create the software application. The voice signal of synthesized and natural speech was converted into mel-frequency spectrograms. The structure of a convolutional neural network with high indicators of recognition accuracy is proposed. The learning speed of neural networks on GPU and CPU is compared using the CUDA library. The influence of the batch size parameter on the accuracy of the neural network was investigated. The TensorBoard tool was used to monitor and profile the learning process of neural networks.
Keywords: audio deepfake, mel-frequency sound spectrograms, convolutional neural networks, learning speed of neural networks.
Full Text:
PDFReferences
[1] Nguyen T. Deep Learning for Deepfakes Creation and Detection: A Survey/ T.Nguyen, Q.Nguyen// Cornell Univer. arXivLabs: Computer Vision and Pattern Recognition arXiv:1909.11573
[2] Masood M. Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward/ M.Masood, M.Nawaz, K.Malik, A.Javed, A.Irtaza // Cornell Univer. Cryptography and Security arXivLabs: arXiv:2103.00484
[3] Almars A. Deepfakes Detection Techniques Using Deep Learning: A Survey / A.Almars //Journal of Computer Science and Engineering 2021, Vol.9 N.5 DOI: 10.4236/jcc.2021.95003
[4] Giudice O. Fighting deepfakes by detecting GAN DCT anomalies/ O.Guidece, L.Guarnera, S.Battiato// Cornell Univer. arXivLabs: Journal Imaging 2021, 7(8), 128 DOI: 10.3390/jimaging7080128/
[5] Ogihara A. Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification / A. Ogihara, U. Hitoshi, A.Shiozaki// Iejce Trans. Fundamentals, Vol. E88–A, N.1 2005. P.280-286 DOI:10.1093/ietfec/E88-A.1.280
[6] Sarasola X. Application of Pitch Derived Parameters to Speech and Monophonic Singing Classification/ X.Sarasola, E.Navas, D.Tavarez, L.Serrano, I.Saratxaga// Applied Science; Basel Vol.9, Iss.15. 2019. DOI: 10.3390/app9153140
[7] Mittal T. Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues / T.Mittal, U.Bhattacharya, R.Chandra, A.Bera, D.Manocha // Cornell Univer. arXivLabs: Computer Vision and Pattern Recognition arXiv:2003.06711
[8] Todisco M. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection / M. Todisco, X. Wang, V. Vestman, Md Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, K. Lee // Cornell Univer. arXivLabs: experimental projects with community collaborators. – 2019. arXiv:1904.05441
[9] Abadi M. TensorFlow: Large_Scale Machine Learning on Heterogeneous Distributed Systems/ M.Abadi, A.Agarwal, Barham P., Brevo E. // Cornell Univer. arXivLabs: Distibuted, Parallel, and Cluster Computing arXiv:1603.04467
[10] Chetlur S. cuDNN: Efficient Priitives for Deep Learning /S.Chetlur, C.Woolley, P.Vandermersch, J.Cohen, J.Tran, Catanzaro, Shelhamer E.// Cornell Univer. arXivLabs: Neural and Evolutionary Computing arXiv:1410.0759
DOI: http://dx.doi.org/10.30970/eli.16.1
Refbacks
- There are currently no refbacks.