Kurdish Text Segmentation using Projection-Based Approaches

Authors

  • Tofiq Ahmed Tofiq Department of Computer Science, College of Science, University of Sulaimani, Sulaimani, Iraq
  • Jamal Ali Hussein Department of Computer Science, College of Science, University of Sulaimani, Sulaimani, Iraq

DOI:

https://doi.org/10.21928/uhdjst.v5n1y2021.pp56-65

Keywords:

Optical character recognition, Character segmentation, Kurdish text segmentation, Projection-based approach, Cursive writing optical character recognition

Abstract

An optical character recognition (OCR) system may be the solution to data entry problems for saving the printed document as a soft copy of them. Therefore, OCR systems are being developed for all languages, and Kurdish is no exception. Kurdish is one of the languages that present special challenges to OCR. The main challenge of Kurdish is that it is mostly cursive. Therefore, a segmentation process must be able to specify the beginning and end of the characters. This step is important for character recognition. This paper presents an algorithm for Kurdish character segmentation. The proposed algorithm uses the projection-based approach concepts to separate lines, words, and characters. The algorithm works through the vertical projection of a word and then identifies the splitting areas of the word characters. Then, a post-processing stage is used to handle the over-segmentation problems that occur in the initial segmentation stage. The proposed method is tested using a data set consisting of images of texts that vary in font size, type, and style of more than 63,000 characters. The experiments show that the proposed algorithm can segment Kurdish words with an average accuracy of 98.6%.

References

[1] H. Althobaiti and C. Lu. “A survey on Arabic Optical Character Recognition and an Isolated Handwritten Arabic Character Recognition Algorithm Using Encoded Freeman Chain Code”. 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, pp. 1-6, 2017.
[2] A. Lawgali. “A survey on Arabic character recognition”. International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 8, no. 2, pp. 401-426, 2015.
[3] S. Elaiwat and M. A. Abu-Zanona. “A three stages segmentation model for a higher accurate off-line arabic handwriting recognition. World of Computer Science and Information Technology Journal, vol. 2, no. 3, pp. 98-104, 2012.
[4] M. A. Abdullah, L. M. Al-Harigy and H. H. Al-Fraidi. “Off-line Arabic handwriting character recognition using word segmentation”. Journal of Computing, vol. 4, pp. 40-44, 2012.
[5] Y. M. Alginahi. “A survey on Arabic character segmentation”. International Journal on Document Analysis and Recognition, vol. 16, no. 2, pp. 105-126, 2013.
[6] A. Cheung, M. Bennamoun and N. W. Bergmann. “An Arabic optical character recognition system using recognition-based segmentation”. Pattern Recognition, vol. 34, no. 2, pp. 215-233, 2001.
[7] N. A. Shaikh, G. A. Mallah and Z. A. Shaikh. “Character segmentation of Sindhi, an Arabic style scripting language, using height profile vector”. Australian Journal of Basic and Applied Sciences, vol. 3, no. 4, pp. 4160-4169, 2009.
[8] M. M. Alipour. “A new approach to segmentation of Persian cursive script based on adjustment the fragments”. International Journal of Computers and Applications, vol. 64, no. 11, pp. 21-26, 2013.
[9] S. N. Nawaz, M. Sarfraz, A. Zidouri and. W. G. AI-Khatib. “An Approach to Offline Arabic Character Recognition Using Neural Networks”. In: 10th IEEE The IEEE International Conference on Electronics, Circuits, and Systems, IEEE, vol. 3, pp. 1328-1331, 2003.
[10] L. Zheng, A. H. Hassin and X. Tang. “A new algorithm for machine printed Arabic character segmentation”. Pattern Recognition Letters, vol. 25, no. 15, pp. 1723-1729, 2004.
[11] A. Zidouri and K. Nayebi. “Adaptive Dissection Based Subword Segmentation of Printed Arabic Text”. In: 9th International Conference on Information Visualisation (IV), IEEE, pp. 239-243, 2005.
[12] J. Ahmad. “Optical character recognition system for Arabic text using cursive multi-directional approach”. Journal of Computational Science, vol. 3, pp. 549-555, 2007.
[13] M. Omidyeganeh, K. Nayebi. “A New Segmentation Technique for Multi font Farsi/Arabic Texts”. In: IEEE International Conference on Acoustics Speech, and Signal Process., IEEE, vol. 2, 2005.
[14] T. Sari, L. Souici, and M. Sellami. “Off-line Handwritten Arabic Character Segmentation Algorithm: ACSA”. In: Proceeding 8th International Workshop Front Handwriting Recognit., IEEE, pp. 452-457, 2002.
[15] R. Mehran, H. Pirsiavash and F. Razzazi. “A Front-end OCR for Omni-font Persian/Arabic Cursive Printed Documents”. In: Digital Image Computing: Techniques and Applications (DICTA), IEEE, pp. 56-56, 2005.
[16] A. Al-Nassiri, S. Abdulla and R. Salam. “The segmentation of offline arabic characters, categorization and review”. International Journal on Media Technology, vol. 1, no. 1, pp. 25-34, 2017.
[17] M. M. Altuwaijri and M. A. Bayoumi. “A thinning algorithm for Arabic characters using art2 neural network”. IEEE Transactions on Circuits and Systems, vol. 45, no. 2, pp. 260-264, 1998.
[18] A. A. A. Ali and M. Suresha. Survey on segmentation and recognition of handwritten arabic script. SN Computer Science, vol. 1, p. 192, 2020.
[19] I. Aljarrah, O. Al-Khaleel, K. Mhaidat, M. Alrefai, A. Alzu’bi and M. Rabab’ah. 2012. Automated System for Arabic Optical Character Recognition. In: Proceedings of the 3rd International Conference on Information and Communication Systems(ICICS’12).
[20] Y. Alginahi. “A survey on Arabic character segmentation”. International Journal on Document Analysis and Recognition, vol. 16, pp. 105-126, 2013.
[21] Y. Zhang, Z. Q. Zha and L. F. Bai. “A license plate character segmentation method based on character contour and template matching”. Applied Mechanics and Materials, vol. 333, pp. 974-979, 2013.
[22] I. Ahmed, M. Sabri and P. Mohammad. Printed Arabic Text Recognition. Guide to OCR for Arabic Scripts, 2012.
[23] M. Bennamoun and B. Boashash. “A structural-description-based vision system for automatic object recognition”. IEEE Transactions on Systems, Man, and Cybernetics, vol. 27, no. 6, pp. 893-906, 1997.
[24] M. Mostafa.“An Adaptive Algorithm for the Automatic Segmentation of Printed Arabic Text”. In: 17th National Computer Conference, International Society for Optics and Photonics, Saudi Arabia, pp. 437-444, 2004.
[25] K. Mohammad, A. Qaroush, M. Ayesh, M. Washha, A. Alsadeh and S. Agaian. Contour-based character segmentation for printed Arabic text with diacritics. Journal of Electronic Imaging, vol. 28, no. 4, p. 1, 2019.
[26] R. Saabni. “Efficient Recognition of Machine Printed Arabic Text Using Partial Segmentation and Hausdorff Distance”. In: 6th International Conference Soft Computing and Pattern Recognition (SoCPaR), pp. 284-289, 2014.
[27] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil and H. Moin. “Segmentation free nastalique urdu OCR”. World Academy of Science, Engineering and Technology, vol. 4, no. 10, pp. 456-461, 2010.
[28] K. Anwar, Adiwijaya and H. Nugroho. “A Segmentation Scheme of Arabic Words with Harakat”. In: IEEE International Conference on Communications, Networks and Satellite (COMNESTAT), pp. 111-114, 2015.
[29] M. Amaram, K. Zidi, G. Ghedira and S. Zidi. “New Rules to Enhance the Performances of Histogram Projection for Segmenting Smallsized Arabic Words,” In: International Conference on Hybrid Intelligent Systems, 2016.
[30] F. I. Firdaus, A. Khumaini and F. Utaminingrum. “Arabic Letter Segmentation Using Modified Connected Component Labeling”. In: International Conference on Sustainable Information Engineering and Technology (SIET), pp. 392-397, 2017.
[31] M. Amara, K. Zidi and K. Ghedira. “An efficient and flexible knowledge- based Arabic text segmentation approach”. The International Journal of Computer Science and Information Security, vol. 15, no. 7, pp. 25-35, 2017.
[32] M. A. Radwan, M. I. Khalil and H. M. Abbas. “Predictive segmentation using multichannel neural networks in Arabic OCR system”. Lecture Notes in Computer Science, vol. 9896, pp. 233-245, 2016.
[33] F. Qomariyah, F. Utaminingrum and W. F. Mahmudy. “The segmentation of printed Arabic characters based on interest point”. The Journal of Telecommunication, Electronic and Computer Engineering, vol. 9, no. 2-8, pp. 19-24, 2017.
[34] A. Zoizou, A. Zarghili and I. Chaker. “A new hybrid method for Arabic multi-font text segmentation, and a reference corpus construction”. Journal of King Saud University Computer and Information Sciences, vol. 32, no. 5, pp. 576-582, 2018.
[35] A. Fawzi, M. Pastor and C. D. Martínez-Hinarejos. “Baseline Detection on Arabic Handwritten Documents”. P Proceedings of the 2017 ACM Symposium on Document Engineering, pp. 193-196, 2017.

Published

2021-05-16

Issue

Section

Articles