Penerapan Time Delay Neural Network pada Model Akustik untuk Sistem Voice-to-Text Berbahasa Sunda
Abstract
Implementation of deep learning techniques has given promising results recently in any research area, especially for pattern recognition. Neural network as a part of deep learning has been widely used to build model for various pattern recognition field including speech recognition. In neural network, weights which is parameters among layers play important roles to capture information from input data. The parameters are updated frequently based on input features in each iteration. In speech recognition, neural network is implemented to build acoustic model that uses speech from different speakers as training data. The acoustic model is built for specific language such as English, Mandarin and Indonesian. In recent years, the speech recognition system using deep neural network for English language has been developed well and use in many applications. But, implementation of deep neural network for local language is rarely done. In this research, time delay neural network is used to build acoustic model for speech recognition system of Sundanese language. Based on experimental result, the implementation of time delay neural network can reduce WER to be 0.57% with well-tuned hyperparameters of neural network.
Keywords
Full Text:
PDF (Bahasa Indonesia)References
Deng, L., Hinton, G., & Kingsbury, B. (2013, May). New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8599-8603). IEEE.
Vanhoucke, V., Devin, M. and Heigold, G., 2013, May. Multiframe deep neural networks for acoustic modeling. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7582-7585). IEEE.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B. and Sainath, T., 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29.
Peddinti, V., Povey, D. and Khudanpur, S., 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.
Seltzer, M.L., Yu, D. and Wang, Y., 2013, May. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7398-7402). IEEE.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. and Silovsky, J., 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
Ittichaichareon, C., Suksri, S. and Yingthawornsuk, T., 2012, July. Speech recognition using MFCC. In International Conference on Computer Graphics, Simulation and Modeling (pp. 135-138).
Dimitriadis, D., Maragos, P. and Potamianos, A., 2005. Robust AM-FM features for speech recognition. IEEE signal processing letters, 12(9), pp.621-624.
Tiwari, V., 2010. MFCC and its applications in speaker recognition. International journal on emerging technologies, 1(1), pp.19-22.
Su, D., Wu, X. and Xu, L., 2010, March. GMM-HMM acoustic model training by a two level procedure with Gaussian components determined by automatic model selection. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4890-4893). IEEE.
Kjartansson, O., Sarin, S., Pipatsrisawat, K., Jansche, M. and Ha, L., 2018. Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali.
Hughes, T., Nakajima, K., Ha, L., Vasu, A., Moreno, P.J. and LeBeau, M., 2010. Building transcribed speech corpora quickly and cheaply for many languages. In Eleventh Annual Conference of the International Speech Communication Association.
Tran, V.H., Nguyen, L.T.T., Hoang, T. and Tran, X.T., 2013. Design and Implementation of a SoPC System for Speech Recognition.
Sakti, S. and Nakamura, S., 2014. Recent progress in developing grapheme-based speech recognition for Indonesian ethnic languages: Javanese, Sundanese, Balinese and Bataks. In Spoken Language Technologies for Under-Resourced Languages.
Rahmawati, R. and Lestari, D.P., 2017, October. Java and Sunda dialect recognition from Indonesian speech using GMM and I-Vector. In 2017 11th International Conference on Telecommunication Systems Services and Applications (TSSA) (pp. 1-5). IEEE.
Arwandani, G., Osmond, A.B. and Nugrahaeni, R.A., 2018. Deep Neural Network Untuk Pengenalan Ucapan Pada Bahasa Sunda Dialek Utara. eProceedings of Engineering, 5(3).
Fathurrahman, D.N., Osmond, A.B. and Saputra, R.E., 2018. Deep Neural Network Untuk Pengenalan Ucapan Pada Bahasa Sunda Dialek Tengah Timur (majalengka). eProceedings of Engineering, 5(3).
Hakim, L.A., Osmond, A.B. and Saputra, R.E., 2018. Recurrent Neural Network Untuk Pengenalan Ucapan Pada Bahasa Sunda Selatan Dialek Garut. eProceedings of Engineering, 5(3).
Kjartansson, O., Sarin, S., Pipatsrisawat, K., Jansche, M. and Ha, L., 2018. Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali.
Senior, A. and Lopez-Moreno, I., 2014, May. Improving DNN speaker independence with i-vector inputs. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 225-229). IEEE.
Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D. and Le, Q.V., 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
Stolcke, A., 2002. SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.
Refbacks
- There are currently no refbacks.