LINGUISTIC MODELLING OF LOANWORDS IN UZBEK TEXT-TO-SPEECH SYSTEMS
- Authors
-
-
Abduraxmanova Nazokat Azamjonovna
Doctoral Researcher (PhD Candidate) at Fergana State University
Author
-
- Keywords:
- Uzbek electronic corpus, loanwords, homograph, bi-phonemic grapheme, pronunciation variant, vowel sequence, morphophonology, linguistic front-end, text-to-speech (TTS).
- Abstract
-
This paper discusses how loanwords in Uzbek – especially Russian/international and Arabic – Persian strata – should be handled in the linguistic front-end of text-to-speech (TTS) systems, using evidence from the Uzbek electronic corpus. The study aims to identify and systematize orthography – pronunciation mismatches that cause typical synthesis errors: bi-phonemic graphemes (the Uzbek letter <j> encoding both affricate and fricative realizations), homography driven by the lack of distinct letters for foreign phonemes (e.g., the letter <o> in certain borrowed forms), vowel sequences in loanwords that are realized with inserted glides (y/v), and stress- and syllable-structure-dependent variants. The methodology is based on descriptive analysis, phonetic – morphological interpretation, identification and analysis of typical word combinations in the electronic corpus, as well as a comparative approach to practical experiences in Turkish, Kazakh, and Tatar languages. The results propose a set of practical front-end strategies: enriched pronunciation lexicon for loanwords, context-based homograph disambiguation, explicit glide-insertion rules for vowel sequences, and curated exception lists for fricative-j items. Overall, corpus-driven linguistic normalization is argued to be the key factor for improving naturalness and reducing errors in Uzbek TTS.
- References
-
1. Altınok, D. (2016). Towards Turkish ASR: Anatomy of a rule-based Turkish g2p. arXiv. https://arxiv.org/abs/1601.03783
2. Oflazer, K. (1992). Two-level description of Turkish morphology. In Proceedings of COLING 1992 (pp. 1-6).
3. Yuret, D., & Türe, F. (2006). Learning morphological disambiguation rules for Turkish. In Proceedings of NAACL-HLT 2006 (pp. 328-334). Association for Computational Linguistics. https://doi.org/10.3115/1220835.1220877
4. Khassanov, Y., et al. (2021). KazakhTTS: An open-source Kazakh text-to-speech synthesis dataset. In Proceedings of Interspeech 2021 (pp. 440-444). ISCA. https://doi.org/10.21437/Interspeech.2021-93
5. (Dataset) TatarTTS: Open-source text-to-speech dataset for the Tatar language. (n.d.). Hugging Face Datasets. https://huggingface.co/datasets
6. Abdurakhmonova, N., Tuliyev, U., & Gatiatullin, A. (2021). Linguistic functionality of Uzbek Electron Corpus: uzbekcorpus.uz. In 2021 International Conference on Information Science and Communications Technologies (ICISCT) (pp. 1-4). IEEE.
7. Abdurakhmonova, N., Alisher, I., & Sayfulleyeva, R. (2022). MorphUz: Morphological analyzer for the Uzbek language. In 2022 7th International Conference on Computer Science and Engineering (UBMK) (pp. 61-66). IEEE.
8. Rahmatullayev, S. (2006). Hozirgi adabiy o‘zbek tili. Toshkent: Universitet.
9. Mirtojiyev, M. M. (2013). O‘zbek tili fonetikasi. Toshkent: Universitet.
10. Tog‘ayev I. B. (2025). Computer modeling of phonetic and orthographic transliteration of loanwords in the Uzbek language. Development of Science. https://devos.uz/article.php?id=1844
11. Hamroyeva S. M., & Makhmudjonova G. U. (2025). The significance of G2P models for the low resource Uzbek language. Қ. Жұбанов атындағы Ақтөбе өңірлік университетінің хабаршысы, 2(80).
- Downloads
- Published
- 2026-02-20
- Issue
- Vol. 2 No. 2 (2026)
- Section
- Articles
- License
-

This work is licensed under a Creative Commons Attribution 4.0 International License.








