Adapting F5-TTS Model to Kurdish Sorani: Diffusion-based Speech Synthesis with a Specialized Dataset
DOI:
https://doi.org/10.21928/uhdjst.v9n2y2025.pp198-207Keywords:
Speech Synthesis, Diffusion-Based Model, Low-Resource Language, F5-Text-to-speech, Kurdish SoraniAbstract
The Kurdish language is one of the low-resource languages in the field of speech synthesis. Most of the currently available Kurdish text-to-speech (TTS) systems lack both accuracy and naturalness. To address this issue, this study fine-tunes the F5-TTS model for the Kurdish (Sorani) language, an advanced model that had not previously been fine-tuned for this language. The process began with the creation of a high-quality, well-constructed, single-speaker dataset containing more than 10 h of recorded speech that was collected from news, interviews, and short videos. The dataset was very carefully curated to ensure balanced sample durations, accurate transcriptions, emotional diversity, and a wide range of topics and speaking styles. After ensuring that the training data was sufficient, clean, and prepared, the F5-TTS model was fine-tuned on this data. Both objective and subjective evaluations were conducted to verify the model’s performance. For the objective evaluation, the model achieved a character error rate of 4.3% and a word error rate of 20.37%, indicating a high transcription accuracy in the generated audio. In the subjective evaluation, the mean opinion score reached 4.72, showing that the synthesized speech is very close to the original speaker’s voice. These results demonstrate that diffusion-based models like F5-TTS can be effectively adapted to low-resource languages when supported by a well-designed dataset.
References
L. Mohasi and D. Mashao. “Text-to-Speech Technology in Human- Computer Interaction”. In: 5th Conference on Human Computer Interaction in Southern Africa, South Africa (CHISA 2006, ACM SIGHI), pp. 79-84, 2006.
V. R. Reddy and K. S. Rao. “Better Human Computer Interaction by Enhancing the Quality of Text-to-Speech Synthesis”. In: 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI), IEEE, United States, pp. 1-6, 2012.
R. Zhen, W. Song, Q. He, J. Cao, L. Shi and J. Luo. “Human-computer interaction system: A survey of talking-head generation”. Electronics, vol. 12, no. 1, p. 218, 2023.
T. Xie, Y. Rong, P. Zhang, W. Wang and L. Liu. “Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey”. [arXiv Preprint]; 2025.
T. Reitmaier, E. Wallington, D. K. Raju, O. Klejch, J. Pearson, M. Jones, P. Bell, S and Robinson. “Opportunities and Challenges of Automatic Speech Recognition Systems for Low-Resource Language Speakers”. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1-17, 2022.
S. Muhamad and H. Veisi. “End-to-end kurdish speech synthesis based on transfer learning”. Passer Journal of Basic and Applied Sciences, vol. 4, no. 2, pp. 150-160, 2022.
H. A. Ahmad and T. A. Rashid, “Central Kurdish text-to-speech synthesis with novel end-to-end transformer training”. Algorithms, vol. 17, no. 7, p. 292, 2024.
A. A. Abdullah, S. S. Muhamad and H. Veisi. “Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality Waveglow Vocoder Approach”. [ArXiv Preprint]; 2024.
M. K. Mahmood, A. Q. H. Rash, M. A. Q. H. Rash, S. K. Mahmood and H. Güler. “Kurdish and Persian: Dialects or separate languages?” International Journal of Social Science and Human Research, vol. 6, p. 2216, 2023.
G. Tavadze. “Spreading of the Kurdish language dialects and writing systems used in the Middle East”. Bulletin of the Georgian National Academy of Sciences, vol. 13, no. 1, pp. 170-174, 2019.
K. S. Esmaili. “Challenges in Kurdish Text Processing”. [arXiv Preprint]; 2012.
E. M. Qadir and H. H. Padar. “Punctuation in English and Kurdish: A contrastive study”. Koya University Journal of Humanities and Social Sciences, vol. 5, no. 1, pp. 41-61, 2022.
Y. Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu and X. Chen. “F5-tts: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”. [Preprint]; 2024.
A. J. Hunt and A. W. Black. “Unit selection in a concatenative speech synthesis system using a large speech database”. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, Atlanta, GA, USA, pp. 373-376, 1996.
A. Falaschi, M. Giustiniani and M. Verola. “A hidden Markov model approach to speech synthesis”. In: First European Conference on Speech Communication and Technology, Presented at the EUROSPEECH. pp. 2187-2190, 1989.
K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi and K. Oura. “Speech synthesis based on hidden Markov models”. Proceedings of the IEEE, vol. 101, no. 5, pp. 1234-1252, 2013.
H. Zen, K. Tokuda and A. W. Black. “Statistical parametric speech synthesis”. Speech Communication, vol. 51, no. 11, pp. 1039- 1064, 2009.
Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, Q. Le, Y. Agiomyrgiannakis, R. Clark and R. A. Saurous. “Tacotron: Towards End-to-End Speech Synthesis”. [ArXiv Preprint]; 2017.
J. Shen, R. Pang, R. J. Weiss, M. Schuste, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu. “Natural tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions”. In: Presented at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, United States, pp. 4779-4783, 2018.
H. Ali, S. Subramani, R. Varahamurthy, N. Adupa, L. Bollinani and H. Malik. “Collecting, Curating, and Annotating Good Quality Speech Deepfake Dataset for Famous Figures: Process and Challenges”. Cornell University, United States, 2025.
K. Azizah and W. Jatmiko. “Transfer learning, style control, and speaker reconstruction loss for zero-shot multilingual multi-speaker text-to-speech on low-resource languages”. IEEE Access, vol. 10, pp. 5895-5911, 2022.
P. Do, M. Coler, J. Dijkstra and E. Klabbers. “Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection”. [arXiv Preprint]; 2023.
R. Huang, C. Zhang, Y. Wang, D. Yang, J. Tian, Z. Ye, L. Liu, Z. Wang, Z. Jiang, X. Chang, J. Shi, C. Weng, Z. Zhao and D. Yu. “Make-a-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Vol. 1. [Long Papers], pp. 10929-10942, 2024.
T. Saeki, G. Wang, N. Morioka, I. Elias, K. Kastner, F. Biadsy, A. Rosenberg, B. Ramabhadran, H. Zen, F. Beaufays and H. Shemtov. “Extending multilingual speech synthesis to 100+ languages without transcribed data”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, United States, pp. 11546-11550, 2024.
Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg and B. Ramabhadran. “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning”. [arXiv Preprint]; 2019.
T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi and H. Saruwatari. “Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining”. [arXiv Preprint]; 2023.
Y. Xian, B. Schiele and Z. Akata. “Zero-Shot Learning -- the Good, the Bad and the Ugly”. [arXiv Preprint]; 2020.
“The LJ Speech Dataset”. Available from: https://www.kaggle.com/ datasets/mathurinache/the-lj-speech-dataset [Last accessed on 2025 Apr 28].
A. Katumba, S. Kagumire, J. Nakatumba-Nabende, J. Quinn and S. Murindanyi. “A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesis”. Data Brief, vol. 62, p. 111915, 2025.
Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel and M. Le. “Flow Matching for Generative Modeling”. [arXiv Preprint]; 2023.
W. Peebles and S. Xie. “Scalable Diffusion Models with Transformers”. [arXiv Preprint]; 2023.
J. Ho, A. Jain and P. Abbeel. “Denoising Diffusion Probabilistic Models”. [arXiv Preprint]; 2020.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and L. Polosukhin. “Attention is All You Need”. [arXiv Preprint]; 2023.
Z. Ge, L. Kaushik, M. Omote, and S. Kumar. “Speed up Training with Variable Length Inputs by Efficient Batching Strategies”. In: Interspeech. pp. 156-160, 2021.
“Weights and Biases: The AI Developer Platform”. Weights and Biases. Available from: https://wandb.ai/site [Last accessed on 2025 May 04].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hamreen Ahmad, Aree Mohammed

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.