Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction


  • Hanar Hoshyar Mustafa Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq
  • Rebwar M. Nabi Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq




Kurdish Language, Kurmanji Dialect, Kurdish Lemmatizer, Kurdish Spell-checker and Spell-correction, Kurdish Dataset


There are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more specifically for Kurmanji dialect, which increased the need of creating such systems. Lemmatization is the process of determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors determine whether a word is correctly spelled also correct a range of spelling errors, respectively. This research aims to present a lemmatization and a word-level error correction system for Kurdish Kurmanji Dialect, which are the first tools for this dialect based on our knowledge. The proposed approach for lemmatization is built on morphological rules, and a hybrid approach that relies on the n-gram language model and the Jaccard Coefficient Similarity algorithm was applied to the spell-checker and spell-correction. The process results for lemmatization, as detailed in this article, rates of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. Furthermore, for spell-checker and spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained.


Z. Kurdî, M.Û. Zarên Wî and H.S. Khalid. “Kurdish Language, its Family and Dialects”. 2020. Available from: https://www.dergipark. org.tr/en/pub/kurdiname/issue/50233/637080 [Last accessed on 2022 Aug 15].

D.N. MacKenzie. “Kurdish Dialect Studies”. Oxford University Press, London, 1961. Available from: https://www.books. google.iq/books/about/Kurdish_dialect_studies_2_1962. html?id=eaf2zaeacaaj&redir_esc=y [Last accessed on 2022 May 31]

“Kurdish Academy of Language Enables the Kurdish Language in New Horizon”. Available from: https://www.kurdishacademy. org/?q=node/41 [Last accessed on 2022 Jun 04].

N.A. Khoshnaw, Z.U.Z. Sulaimaniyah. “Awer Station”, 2011. Available from: https://rezmanikurde.blogspot.com/2018/01/blog-post_26.html?m=1 [Last accessed on 2022 Jun 09].

R. Gupta and A.G. Jivani. “LemmaChase: A Lemmatizer”. International Journal on Emerging Technologies, vol. 11, no. 2, pp. 817-824, 2020.

D. Hládek, J. Staš, S. Ondáš, J. Juhár and L. Kovács. “Learning string distance with smoothing for OCR spelling correction”. Multimedia Tools and Applications, vol. 76, no. 22, pp. 24549-24567, 2017.

H. Mubarak. “Build Fast and Accurate Lemmatization for Arabic”. vol. Proceedings of the European Language Resources Association (ELRA). Miyazaki, Japan, 2018. Available from: https:// www.aclanthology.org/L18-118 [Last accessed on 2022 Jun 08].

N. Zukarnain, B.S. Abbas, S. Wayan, A. Trisetyarso and C.H. Kang. “Spelling Checker Algorithm Methods for Many Languages”, in Proceedings of 2019 International Conference on Information Management and Technology, (ICIMTech), 2019, pp. 198-201.

A.A. Freihat, M. Abbas, G. Bella and F. Giunchiglia. “Towards an optimal solution to lemmatization in Arabic”. Procedia Computer Science, vol. 142, pp. 132-140, 2018.

A. Yazdani, M. Ghazisaeedi, N. Ahmadinejad, M. Giti, H. Amjadi and A. Nahvijou. “Automated misspelling detection and correction in Persian clinical text”. Journal of Digital Imaging, vol. 33, no. 3, pp. 555-562. 2019.

S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, “Parsivar: A Language Processing Toolkit for Persian,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. Available from: https://www. aclanthology.org/L18-1179 [Last accessed on 2022 Aug 20].

A. Rashidi and M.Z. Lighvan. HPS: A hierarchical Persian stemming method. International Journal on Natural Language Computing, vol. 3, no. 1, pp. 11-20, 2014.

A.M. Mustafa and T.A. Rashid. Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, vol. 44, no. 1, pp. 15-27, 2018.

S. Salavati and S. Ahmadi. “Building a Lemmatizer and a spell-checker for Sorani Kurdish”. CoRR, vol. abs/1809.10763, 2018. Available from: https://www.arxiv.org/abs/1809.10763 [Last accessed on 2021 Aug 15].

S. Niwattanakul, J. Singthongcha, E. Naenudorn, and S. Wanapu. “Using of Jaccard Coefficient for Keywords Similarity”, in Proceedings of the International Multi Conference of Engineers and Computer Scientists. vol. 1, 2013. Available from: https://www. data.mendeley.com/v1/datasets/s9wyvvbj9j/draft?preview=1 [Last accessed on 2022 Apr 08].