Rough Set-Based Feature Selection for Predicting Diabetes Using Logistic Regression with Stochastic Gradient Decent Algorithm


  • Kanaan M. Kaka-Khan Department of Information Technology, University of Human Development, Iraq
  • Hoger Mahmud Department of Information Technology, the American University of Iraq, Sulaimani
  • Aras Ahmed Ali University College of Goizha, Sulaymaniyah



Logistic Regression, Stochastic Gradient Descent, Rough Set Theory, K-fold cross-validation, Diabetes prediction


Disease prediction and decision-making plays an important role in medical diagnosis. Research has shown that cost of disease prediction and diagnosis can be reduced by applying interdisciplinary approaches. Machine learning and data mining techniques in computer science are proven to have high potentials by interdisciplinary researchers in the field of disease prediction and diagnosis. In this research, a new approach is proposed to predict diabetes in patients. The approach utilizes stochastic gradient descent which is a machine learning technique to perform logistic regression on a dataset. The dataset is populated with eight original variables (features) collected from patients before being diagnosed with diabetes. The features are used as input values in the proposed approach to predict diabetes in the patients. To examine the effect of having the right variable in the process of making predictions, five variables are selected from the dataset based on rough set theory (RST). The proposed approach is applied again but this time on the selected features to predict diabetes in the patients. The results obtained from both applications have been documented and compared as part of the approach evaluations. The results show that the proposed approach improves the accuracy of predicting diabetes when RST is used to select variables for making the prediction. This paper contributes toward the ongoing efforts to find innovative ways to improve the prediction of diabetes in patients.


“Diabetesatlas”. Available from: [Last accessed on 2022 Aug 08].

M. Anouncia, C. Maddona, P. Jeevitha and R. Nandhini. “Design of a diabetic diagnosis system using rough sets”. Cybernetics and Information Technologies, vol. 13, no. 3, pp. 124-169, 2013.

F. E. Gmati, S. Chakhar, W. L. Chaari and H. Chen. “A rough set approach to events prediction in multiple time series”. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, vol. 10868, pp. 796- 807, 2018.

H. Patel and D. Patel. “Crop prediction framework using rough set theory”. International Journal of Engineering and Technology, vol. 9, pp. 2505-2513, 2017.

S. K. Manga. “Currency crisis prediction by using rough set theory”. International Journal of Computer Applications, vol. 32, p. 48-52, 2011.

B. B. Nair, V. Mohandas and N. Sakthivel. “A decision tree-rough set hybrid system for stock market trend prediction”. International Journal of Computer Applications, vol. 6, no. 9, pp. 1-6, 2010.

“Pima-Indians-Diabetes-Dataset”. Available from: https://www. [Last accessed on 2022 May 04].

Z. Pawlak. “Rough set theory and its applications to data analysis”. Cybernetics and Systems, vol. 29, no. 7, pp. 661-688, 1998.

P. Achlioptas. “Stochastic Gradient Descent in Theory and Practice”. Stanford University, Stanford, CA, 2019.

J. Brownlee. Machine Learning Algorithms from Scratch with Python. Machine Learning Mastery, 151 Calle de San Francisco, US, 2016.

H. H. Inbarani and S. U. Kumar. “A novel neighborhood rough set based classification approach for medical diagnosis”. Procedia Computer Science, vol. 47, pp. 351-359, 2015.

E. S. Al-Shamery and A. A. R. Al-Obaidi. “Disease prediction improvement based on modified rough set and most common decision tree”. Journal of Engineering and Applied Sciences, vol. 13, no. Special issue 5. pp. 4609-4615, 2018.

R. Ghorbani and R. Ghousi. “Predictive data mining approaches in medical diagnosis: A review of some diseases prediction”. International Journal of Data and Network Science, vol. 3, no. 2, pp. 47-70, 2019.

R. Ali, J. Hussain, M. H. Siddiqi, M. Hussain and S. Lee. “H2RM: A hybrid rough set reasoning model for prediction and management of diabetes mellitus”. Sensors, vol. 15, no. 7, pp. 15921-15951, 2015.

S. Sawa, R. D. Caytiles and N. C. S. Iyengar. “A Rough Set Theory Approach to Diabetes”. In: Conference: Next Generation Computer and Information Technology, 2017.

S. Ramesh, H. Balaji, N. Iyengar and R. D. Caytiles. “Optimal predictive analytics of pima diabetics using deep learning”. International Journal of Database Theory and Application, vol. 10, no. 9, pp. 47-62, 2017.

K. Thangadurai and N. Nandhini. “Integration of rough set theory and genetic algorithm for optimal feature subset selection on diabetic diagnosis”. ICTACT Journal on Soft Computing, vol. 8, no. 2, 2018.

V. Talasila, K. Madhubabu, K. Madhubabu, M. Mahadasyam, N. Atchala and L. Kande. “The prediction of diseases using rough set theory with recurrent neural network in big data analytics”. International Journal of Intelligent Engineering and Systems, vol. 13, no. 5, pp. 10-18, 2020.

T. R. Gadekallu and X. Z. Gao. “An efficient attribute reduction and fuzzy logic classifier for heart disease and diabetes prediction”. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), vol. 14, no. 1, pp. 158-165, 2021.

“Medium”. Available from: why-how-and-when-to-scale-your-features-4b30ab09db5e [Last accessed on 2022 Jun 05].

E. Rahm and H. H. Do. “Data cleaning: Problems and current approaches”. IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3-13, 2000.

D. Borkin, A. Némethová, G. Michal’conok and K. Maiorov. “Impact of data normalization on classification model accuracy”. Research Papers Faculty of Materials Science and Technology Slovak University of Technology, vol. 27, no. 45, pp. 79-84, 2019.

“Machine Learning Mastery”. Available from: https://www. [Last accessed on 2022 Aug 06].

G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta and S. K. Tayebati. “Comparative machine-learning approach: A follow-up study on Type 2 diabetes predictions by cross-validation methods”. Machines, vol. 7, no. 4, pp. 74, 2019.

D. K. Choubey, P. Kumar, S. Tripathi and S. Kumar. Performance evaluation of classification methods with PCA and PSO for diabetes. Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 9, no. 1, p. 5, 2020.

R. Patra and B. Khuntia. “Analysis and prediction of Pima Indian diabetes dataset using SDKNN classifier technique”. IOP Conference Series: Materials Science and Engineering, vol. 1070, no. 1, p. 012059, 2021.

V. Chang, J. Bailey, Q. A. Xu and Z. Sun. “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms”. Neural Computing and Applications. vol. 34, no. 10, pp. 1-7, 2022.