Abstract
Processing text across different scripts presents significant hurdles in natural language processing, especially when dealing with non-standardized orthographies and informal writing systems common in low-resource languages. To address this, we introduce Sawtone, an integrated framework designed to enable consistent cross-script phonetic alignment and text normalization. At its heart is an architecture built for interoperability, combining a unified phonological feature space rooted in linguistic principles with modular, language-specific adapters. This structure allows for robust mapping and comparison between any pair of scripts. Crucially, it enables diverse adapters—developed using different methods or data—to work together cohesively for cross-language tasks. The framework readily supports alloglottographic text and is designed to function with minimal resource requirements. We demonstrate its practicality through implementations for transliteration, cross-script sequence alignment, and text normalization, further illustrated by a case study on preprocessing Moroccan Arabic data for Large Language Model (LLM) training. Initial results are encouraging: transliteration reached an 88% BLEU score, phonetic-based text sequence alignment achieved 87-95% accuracy across various language and script pairs, and text normalization significantly reduced variations in spelling. Sawtone offers a structured, interoperable foundation for advancing phonetic-aware NLP across linguistic boundaries.
References
Ameur, M. S. H. & Meziane, F. & Guessoum, A. 2019. ANETAC: Arabic named entity transliteration and classification dataset. (arXiv:1907.03110).
Ansari, Z. 2017. Improving text normalization by optimizing nearest neighbor matching.
Archangeli, D. B. & Pulleyblank, D. 1994. Grounded phonology. Cambridge, MA: MIT Press.
Baevski, A. & Zhou, H. & Mohamed, A. & Auli, M. 2020. Wav2vec 2.0: A framework for self-supervised learning of speech representations. .
Bird, S. 2020. Digital support for threatened languages: Progress and challenges. Computer 53(4). 82-85.
Bird, S. & Klein, E. 1994. Phonological analysis in typed feature systems. Computational Linguistics 20(3). 455-491.
Bouamor, H. & Hassan, S. & Habash, N. 2018. The MADAR Arabic dialect corpus and lexicon. In Calzolari, Nicoletta & Choukri, Khalid & Cieri, Christopher & Declerck, Thierry (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 3387-3396. Miyazaki: European Language Resources Association.
Chan, W. & Jaitly, N. & Le, Q. & Vinyals, O. 2015. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. (arXiv:1508.01211). DOI: https://doi.org/10.1109/ICASSP.2016.7472621
Chomsky, N. & Halle, M. 1968. The sound pattern of English. Chicago: The University of Chicago Press.
Clements, George N. 1985. The geometry of phonological features. Phonology 2(1). 225-252. DOI: https://doi.org/10.1017/S0952675700000440
Clements, George N. & Hume, E. V. 1995. The internal organization of speech sounds. In Goldsmith, J. A. (ed.), The handbook of phonological theory, 245-306. Cambridge, MA: Blackwell.
Conneau, A. & Khandelwal, K. & Goyal, N. & Chaudhary, V. & Wenzek, G. & Guzmán, F. & Stoyanov, V. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440-8451. Abu Dhabi: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.747
Crystal, D. 2011. Internet linguistics: A student guide. London: Routledge. DOI: https://doi.org/10.4324/9780203830901
Crystal, D. 2012. English as a global language. Cambridge: Cambridge University Press.
Darwish, K. 2014. Arabizi detection and conversion to Arabic. In Habash, N. & Vogel, S. (eds.), Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). 217-224. Doha: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/W14-3629
De Premare, A.-L. 1998. Dictionnaire arabe-français (dialecte marocain). Paris: L’Harmattan.
Devlin, J. & Chang, M.-W. & Lee, K. & Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J. & Doran, C. & Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, 4171-4186, Minneapolis: Association for Computational Linguistics.
Doherty, L. n.d. Ipa-dict – monolingual wordlists with pronunciation information in IPA. (https://github.com/open-dict-data/ipa-dict) (Accessed 2025-05-25).
Elgeish, M. 2019. Learning joint acoustic-phonetic word embeddings for speech recognition. (arXiv:1908.00493).
Gessler, L. & Zeldes, A. 2022. MicroBERT: Effective training of low-resource monolingual BERTs through parameter reduction and multitask learning. In Ataman, D. etc. (eds.), Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL), 86-99. Abu Dhabi: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.mrl-1.9
Goldsmith, J. A. 1990. Autosegmental and metrical phonology. Oxford: Basil Blackwell.
Gordon, M. K. & Ladefoged, P. 2001. Phonation types: A cross-linguistic overview. Journal of Phonetics 29(4). 383-406. DOI: https://doi.org/10.1006/jpho.2001.0147
Graham, Y. & Baldwin, T. & Moffat, A. & Zobel, J. 2013. Continuous measurement scales in human evaluation of machine translation. In Pareja-Lora, A. & Liakta, M. & Dipper, S. (eds.), Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 33-41. Sofia: Association for Computational Linguistics.
Graves, A. & Jaitly, N. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Xing, E. P. & Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning. 1764-1772. Beijing: PMLR.
Habash, N. Y. 2010. Introduction to arabic natural language processing. Synthesis Lectures on Human LanguageTechnologies 3(1). 1-187. DOI: https://doi.org/10.1007/978-3-031-02139-8_1
Ibn Ḥammād al-Jawharī, Ismāʿīl. n.d. Al-Ṣiḥāḥ fī al-lugha. Bayrūt: Dār al-Fikr.
Ibn Sīda. 2000. Al-Muḥkam wa-al-muḥīt al-aʿẓam. Bayrūt: Dār al-Kutub al-ʿIlmiyya.
International Phonetic Association. 1999. Handbook of the IPA: A guide to the use of the international phonetic alphabet. Cambridge: Cambridge University Press.
Kamali, Omar. & Abchir, M. 2024. Finding Moroccan Arabic (Darija) in Fineweb 2. (https://huggingface.co/blog/omarkamali/gherbal-multilingual-fineweb-moroccan-arabic) (Accessed 2025-05-25).
Karimi, S. 2011. Machine transliteration survey. ACM Comput. Surv. 43. 17. DOI: https://doi.org/10.1145/1922649.1922654
al-Khalīl, Ibn Aḥmad al-Farāhīdī. 2003. Kitāb al-ʿAyn. Bayrūt: Dār al-Kutub al-ʿIlmiyya.
Knight, K. & Graehl, J. 1998. Machine transliteration. Computational Linguistics 24(4). 599-612.
Kondrak, G. 2003. Phonetic alignment and similarity. Computers and the Humanities 37(3). 273-291. DOI: https://doi.org/10.1023/A:1025071200644
Koskenniemi, K. 1983. Two-level model for morphological analysis. IJCAI’83. Proceedings of the Eighth international joint conference on Artificial intelligence, vol. 2, 683-685. Karlsruhe: Morgan Kaufmann Publishers.
Kunchukuttan, A. & Mehta, P. & Bhattacharyya, P. 2018. The IIT Bombay English-Hindi parallel corpus. In Calzolari, N. & Choukri, Kh. & Cieri, C. & Declerck, T. (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association.
Ladefoged, P. & Johnson, K. 2011. A course in phonetics. Boston, MA: Cengage Learning.
Ladefoged, P. & Maddieson, I. 1996. The sounds of the world’s languages. Oxford: Blackwell Publishing.
Lehiste, I. 1970. Suprasegmentals. Cambridge, MA: MIT Press.
Li, X. & Metze, F. & Mortensen, D. & Watanabe, S. & Black, A. 2022. Zero-shot learning for grapheme to phoneme conversion with language ensemble. In Muresan, S. & Nakov, P. &Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, 2106-2115, Dublin: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.findings-acl.166
Llama 3 Team. 2024. The llama 3 herd of models. (arXiv:2407.21783).
Lourentzou, I. & Manghnani, K. & Zhai, C. X. 2019. Adapting sequence to sequence models for text normalization in social media. In Calzolari, N. & Choukri, K. & Cieri, C. & Declerck, T. (eds.), Proceedings of the Thirteenth International AAAI Conference on Web and Social Media (ICWSM 2019), 335-345. Munich: AAII. DOI: https://doi.org/10.1609/icwsm.v13i01.3234
Qwen Team. 2024. Qwen2.5 technical report. (arXiv:2412.15115).
Mortensen, D. 2018. Epitran: Precision G2P for many languages. In Calzolari, N. & Choukri, K. & Cieri, C. & Declerck, T. (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2710-2714. Miyazaki: European Language Resources Association (ELRA).
Naji, N. & Allan, J. 2016. On cross-script information retrieval. 9626. 10.1007/978-3-319-30671-1_70. DOI: https://doi.org/10.1007/978-3-319-30671-1_70
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3). 443-453. DOI: https://doi.org/10.1016/0022-2836(70)90057-4
Ni, J. 2018. Multilingual grapheme-to-phoneme conversion with global character vectors. Interspeech 2018. 2823-2827. DOI: https://doi.org/10.21437/Interspeech.2018-1626
Pan, L. 2020. Multilingual BERT post-pretraining alignment. In Toutanova, K. & Rumshisky, A. & Zettlemoyer, L. & Hakkani-Tur, D. & Beltagy, I. & Bethard, S. & Cotterell, R. & Chakraborty, T. & Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 210-219. Abu Dhabi: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.naacl-main.20
Radford, A. & Wu, J. & Child, R. & Luan, D. & Amodei, D. & Sutskever, I. 2019. Language models are unsupervised multitask learners. (https://api.semanticscholar.org/CorpusID:160025533) (Accessed 2025-05-25)
Raffel, C. & Shazeer, N. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. (arXiv:1910.10683).
Rosca, M. & Breuel, T. 2016. Sequence-to-sequence neural network models for transliteration. (arXiv: 1610.09565).
Sälevä, J. & Lignos, C. 2022. ParaNames: A massively multilingual entity name corpus. (arXiv:2202.14035). DOI: https://doi.org/10.18653/v1/2022.sigtyp-1.15
Sharma, D. 2021. Learning phonetic word embeddings. (arXiv:2109.14796).
Sonmez, O. 2014. Graph-based text normalization. In Moschitti, A. & Walter, B. & Daelemans, W. (eds.), Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing (EMNLP), 313-324, Doha: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/D14-1037
Sproat, R. & Jaitly, N. 2016. RNN approaches to text normalization: A challenge. (arXiv:1611.00068). DOI: https://doi.org/10.21437/Interspeech.2017-35
Stevens, K. N. 1998. Acoustic phonetics. Cambridge, MA: MIT Press. DOI: https://doi.org/10.7551/mitpress/1072.001.0001
Unseth, P. 2005. Sociolinguistic parallels between choosing scripts and languages. Written Language & Literacy 8(1). 19-42. DOI: https://doi.org/10.1075/wll.8.1.02uns
Wang, A. & Pruksachatkun, Y. & Nangia, N. & Singh, A. & Michael, J. & Hill, F. & Levy, O. & Bowman, S. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. & Larochelle, H. & Beygelzimer, A. & d’Alché-Buc, F. & Fox, E. & Garnett, R. (eds.), Advances in neural information processing systems, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2190-2194. Vancouver: Association for Computational Linguistics.
Watanabe, S. & Hori, T. & Kim, S. & Hershey, J. R. & Hayashi, T. 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. Journal of Selected Topics in Signal Processing 11(8). 1240-1253. DOI: https://doi.org/10.1109/JSTSP.2017.2763455
Watson, J. C. E. 2002. The phonology and morphology of Arabic. Oxford: Oxford University Press. DOI: https://doi.org/10.1093/oso/9780199257591.001.0001
Wu, Y. & Schuster, M. & Chen, Z. & Le, Q. V. & Norouzi, M. & Macherey, W. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. (arXiv:1609.08144).
License
Copyright (c) 2025 Omar Kamali

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
