1Department of Information Technology, University of Human Development, Iraq, 2Department of Information Technology, the American University of Iraq, Sulaimani, 3University College of Goizha, Sulaymaniyah
DOI: 10.21928/uhdjst.v6n2y2022.pp85-93
ABSTRACT
Disease prediction and decision-making plays an important role in medical diagnosis. Research has shown that cost of disease prediction and diagnosis can be reduced by applying interdisciplinary approaches. Machine learning and data mining techniques in computer science are proven to have high potentials by interdisciplinary researchers in the field of disease prediction and diagnosis. In this research, a new approach is proposed to predict diabetes in patients. The approach utilizes stochastic gradient descent which is a machine learning technique to perform logistic regression on a dataset. The dataset is populated with eight original variables (features) collected from patients before being diagnosed with diabetes. The features are used as input values in the proposed approach to predict diabetes in the patients. To examine the effect of having the right variable in the process of making predictions, five variables are selected from the dataset based on rough set theory (RST). The proposed approach is applied again but this time on the selected features to predict diabetes in the patients. The results obtained from both applications have been documented and compared as part of the approach evaluations. The results show that the proposed approach improves the accuracy of predicting diabetes when RST is used to select variables for making the prediction. This paper contributes toward the ongoing efforts to find innovative ways to improve the prediction of diabetes in patients.
Index Terms: Logistic Regression, Stochastic Gradient Descent, Rough Set Theory, K-fold Cross-validation, Diabetes Prediction
Changes in human lifestyle and the deterioration of the environment have left a negative impact on human health. For that reason, human health has always been the subject of research with the aim to improve it. Diabetes is a group of metabolic diseases which result in high blood sugar levels for a prolonged period. As stated by International Diabetes Federation, 537 million adults (20–79 years) are living with diabetes which is 1 in 10 of adult population. This number is predicted to rise to 643 million by 2030 and 783 million by 2045 [1]. Diabetes has been the subject of research for some times by multidisciplinary scientists with the aim to find and improve methods that lead to effective prevention, diagnosis, and treatment of the disease. For instance, in a similar approach, in 2013, Anouncia et al. proposed a diagnosis system for diabetes. The system is implemented to diagnose the type of diabetes based on symptoms provided by patients. They have used rough set-based knowledge representation in developing their system and the results showed improvements in terms of accuracy of diabetes type diagnosis and the time it takes for the diagnosis [2]. Despite all the efforts invested into researching diagnostic techniques for diabetes, research shows that there is still room for improvement, especially in areas related to the level of accurately in predicting the disease in a patient. Rough set theory (RST) has been used by researchers to predict a wide array of topics such as time series prediction [3], crop prediction [4], currency crisis prediction [5], and stock market trends prediction [6]. In this research, we use RST to select variables in a dataset with the aim to improve the level of accuracy in predicting diabetes in a patient. Stochastic gradient descent algorithm is used to process the variables selected to make diabetes prediction based on computed logistic regression values from the dataset. The dataset used for all experiments in this study is made available by the Pima Indian Diabetes [7]. This paper contributes toward the ongoing efforts to find innovative ways to improve the prediction of diabetes in patients by proposing a new approach to predict diabetes in patients using machine learning techniques. The results presented in Sections 5.1 and 5.2 show that the approach improves accuracy in making diabetes predictions compared to other available approaches.
The rest of this paper is organized as follows: Section 2 provides the theoretical background needed to understand the selected techniques and Section 3 provides a survey of related literatures. Section 4 provides the description of the methodology used in this study. Experimental results and discussion are provided in Section 5. Finally, conclusions are drawn in Section 6.
This section provides a basic background on the theories used in the study.
Rough set [8] is proposed by Pawlak to deal with uncertainty and incompleteness. It offers mathematical tools to discover patterns hidden in datasets and identifies partial or total dependencies in a dataset based on indiscernibility relation. The technique calculates a selection of features to determine the relevant feature. The general procedures in rough set are as follows:
The Lower Approximation of set D is the set of objects in a table of information which certainly belongs to the class X:
The Upper Approximation of a set X includes all objects in a table of information which possibly belongs to the class X:
Boundary Region is the difference between upper approximation set and lower approximation set that is referred to as Bnd (X)
Positive Region is the set of all objects that belong to lower approximation, which means, the union of the lower approximation consist of the union of all the lower approximation sets:
Indiscernibility of positive reign for any G ⊆ Att is the associated equivalence relation:
Reducts are the minimum range representation of the original data without loss of information:
According to [9], stochastic gradient descent is a function’s minimizing process, following the slope or gradient of that function. In general, in machine learning, stochastic gradient descent can be considered as a technique to evaluate and update the weights every iteration, which minimizes the error in training data models. While training, this optimization technique tries to show each and every training sample to the model one by one. For each training sample, the model produces an output (prediction), calculates the error, and updates to minimize the error for the next output, and this process is repeated for a fixed number of epochs or iterations. Equation-7 describes the way of finding and updating the set of weights (coefficients) in a model from the training data.
Here, b is the coefficient (weight) being estimated, learning rate is a learning value that can be con
Logistic regression [10] is a two-class problems linear classification algorithm. Equation 9 represents the logistic regression algorithm. In this algorithm, to make a prediction (y), using coefficient (weight) values, the input values (X) are combined in a linear form. Logistic regression produces an output of binary value (0 or 1).
The foundation of logistic regression algorithm is Euler’s number, the estimated output is represented as yhat, the algorithm’s bias is b0, and the coefficient (weight) for the single input value (x1) is represented as b1. The logistic regression produces a real value as an output (yhat) which is between 0 and1. To be mapped to an estimated class value, the output needs to be converted (rounded) to an integer value. Each column (attribute) of the dataset has an associate value (b) that should be estimated from the training data and it is the actual model’s representation that can be saved for further use.
Prediction is a widely used approach in many fields of science including healthcare to foresee possible outcomes of a cause. Disease prediction is certainly an area, where researchers have been working by applying a number of different theories including machine learning theories with the aim to find methods to make the most accurate prediction possible. RST is one of the theories used to classify and predict diseases. For instances, the authors of [11] have used the theory to classify medical diagnosis, the authors of [12] and [13] have modified and used the theory to improve disease prediction. Type 1 and 2 diabetes were the focus of the authors of [14], in which they developed a hybrid reasoning model to address prediction accuracy issues. Based on their results, they claim that their approach raises diabetes prediction accuracy to 95% compared to other existing approaches. In 2017, RST was used by the authors of [15] to develop a model for patient clustering in a dataset. The authors considered average values calculated from diabetes indicators in a dataset to cluster the patients in it. In the same year, deep learning was utilized by the authors of [16] to establish an intelligent diabetes prediction model, in which patients’ risk factors collected in a dataset were considered to make the prediction.
In 2018, Fuzzy RST is applied first to select specific features in a dataset, later in the process, to improve prediction performance, save processing time, and better diagnosis accuracy that the Optimized Generic Algorithm (OGA) is applied. The results obtained from the study shows that the approach has achieved the objectives of the study [17]. In 2020, Vamsidhar Talasila and Kotakonda Madhubabu proposed the use of RST technique to select the most relevant features to be inputted to the Recurrent Neural Network (RNN) technique for disease prediction. They claimed that the RST-RNN method achieved accuracy of 98.57% [18]. In the same year, Gao and Cheng proposed an improved neighborhood rough set attribute reduction algorithm (INRS) to increase the dependence of conditional attributes based on considering the importance of individual features for diabetes prediction [14]. In 2021, Gadekallu and Gao proposed a model using an approach based on rough sets to reduce the attributes needed in heart disease and diabetes prediction [19]. The main limitation of these studies is the fact that none has considered the quantity and quality of viables used to make diagnostic predictions.
The approach used in this study is similar to the ones used in the surveyed literatures but differs in objectives. We use RST to select the best features in a dataset and use stochastic gradient decent algorithm to compute the logistic regression values from the selected features in the dataset with the aim to improve the prediction accuracy of diabetes in a patient.
This section provides insights on the methodology used to achieve the objectives of the study. The methodology is comprised six major steps:
A dataset is selected, examined for suitability and reliability based on a number of characteristics, and uploaded to be analyzed. The dataset selected and uploaded for the purpose of this research is provided by Pima Indians Diabetes [7]. The selected dataset involves predicting diabetes within 5 years in Pima Indians given medical details. The dataset is a 2-class classification problem and consists of 76 samples with 8 input and 1 output variable. The variable names are as follows: Number of Times Pregnant, Plasma Glucose concentration a 2 h in an oral glucose tolerance test, Diastolic Blood Pressure (mm Hg), Triceps Skinfold Thickness (mm), 2-h Serum Insulin (mu U/ml), Body Mass Index (weight in kg/[height in m]2), Diabetes Pedigree Function, Age, and Class Variable (0 or 1). Before implementing the model, it is highly preferred to do preprocessing due to some deficiencies. Usually, the dataset contains features highly varying in magnitudes, units, and range which may results in inaccurate output [20]. In this work due to use of stochastic gradient descent algorithm, the dataset has been normalized using min-max scaling to bring all values to between 0 and 1. Table 1 shows a sample of the selected dataset.
TABLE 1: The first ten records of the diabetes dataset used in this study
The selected diabetes dataset is preprocessed and normalized. To increase the efficiency and accuracy of the model, the dataset needs to be pre-processed before applying the proposed model since the data may contain null values, incorrect, and redundant information. In general, data processing involves two major steps: data cleaning and data normalization. Data cleaning means removing incorrect information or filling out missing values to increases the validity and quality of a dataset though applying a number of different methods [21]. In this study, in case of any tuple containing missing values, the missed attribute value assumed to be 0 (this is achieved using the fill_mising_values () function from the python script developed for the implementation phase of this study). Redundant or unnecessary columns are deleted to have a high quality dataset (this is achieved using the remove_duplicate_columns () function from the python script). To let all features have equal weight and contribution to the model, the range of each feature needs to be scaled, for this purpose, the dataset is normalized to a range of [0,1] by the following processes: String columns converting: the string columns are converted to float through str column using the float() function. Min max finding: min and max values of each column of the dataset are found through using the dataset minmax() function. Finally, the dataset is normalized by the min-max normalization method using the following equation adapted form [22].
In this step, RST is applied to select the features which might produce a better prediction. There are nine variables in total in the dataset, as shown in Table 1. The class variable is considered as a dependent variable and the other eight variables are assumed as predictors or independent variables. Table 2 presents the regression calculation summary for diabetes classification of the dataset. The result of the calculation clearly shows that the accuracy of diabetes prediction is 30.32% if all variables in the dataset are considered in the calculation. The low accuracy result is an indication that there might be one or more variables which are not fit to be used for prediction. The regression calculation also shows that the un-standardized regression coefficient (b) is 0.06 for pregnancies, which indicates that if all other predictors are controlled then an increment of one unit in pregnancies increases the accuracy by 0.06. The same statement can be made for the other variables. To flitter the features that might produce a better diabetes prediction, the dataset is grouped together into nine elementary sets based on indiscernibility relation level between the data elements. Table 3 shows the details of the groups. To further process the groups, the discernibility matrix has been developed for the elementary sets and the result is shown in Table 4. From the discernibility matrix, a discernibility function has been developed, as shown in equation 11.
TABLE 2: Linear regression statistics of diabetes dataset
TABLE 3: Elementary sets
TABLE 4: Discernibility matrix
As the result of discernibility function of all elementary sets for the entire dataset, we found that:
f(A) = a1∨a2∨a5∨a6∨a8 where a1 is Pregnancies; a2 is Plasma glucose; a5 is Insulin; a6 is DPF; and a8 is age attribute. Table 5 shows the reduct matrix for the elementary sets. From the reduct matrix, all reducts and core attributes have been found:
TABLE 5: Reducts matrix
f(R1) = a1∨a2∨a6; f(R2) = a1∨a2∨a5∨a8; f(R3) = a2∨a5∨a8; f(R4) = a1∨a2∨a8; f(R5) = a2∨a6∨a8; f(R6) = a1∨a2∨a6∨a8; f(R7) = a2∨a5∨a6; f(R8) = a1∨a2∨a5; f(R9) = ∨a2∨a5∨a6∨a8. Finally, Table 6 shows the features that are selected to be used for making diabetes prediction.
TABLE 6: Indiscernibility table
Table 3 shows the indiscernibility level of the relation between the patients.
Table 6 represents the last step of RST process, in which the data are simplified, and the indiscernibility relations are stated. The * symbol means that a certain variable has no impact in a certain case, for example, if the patient’s pregnancy is (0–1) and plasma glucose is (0–22) and DPF is (0-0.25), then the patient has diabetes regardless of the value of other attributes, and so on.
In this step, the logistic regression algorithm with stochastic gradient descent technique is applied on the selected features in the previous step. The major steps of the application are as follows:
The dataset is loaded into the model through load_dataset() function.
The dataset is preprocessed through str column to float(), dataset minmax(), and normalize dataset() functions accordingly.
The dataset is split into k-folds and trainset. Test set creation for training the model is achieved through cross validation split() function.
Coefficients or weights are the values that determine the model accuracy and can be estimated for training data using stochastic gradient descent. The algorithm uses two parameters to estimate the weights (coefficient), the first one is learning rate to specify the amount of each weight, and it is corrected continuously, while it is updated. The second one is Epochs which is the loop through the training process while updating the coefficient. The Coefficients Estimating is achieved through coefficients sgd() function.
For each instance in the training data, each coefficient is updated throughout all epochs. The error that the model makes is the criteria for updating the coefficients. The simple equation can be used to calculate the error (equation-12).
Error = (Expected output value) – (Prediction made with the candidate coefficients) (12)
Predictions are generated; equation 7 describes the prediction process which is the most important part of the model. Prediction process will be needed twice: first in stochastic gradient descent to evaluate candidate coefficient values and second in the model when it is finalized to produce outputs (predictions) on test data. The prediction process is achieved through predict() function. Fig. 1 shows the execution flow of the proposed approach.
Fig. 1. Proposed diabetes prediction method.
Finally, the results obtained are compared. Fig. 1 shows the proposed diabetes prediction method.
In this research, k-fold cross-validation technique has been used to evaluate the learned model’s performance on unseen data. Cross-validation is a resampling procedure used to validate machine learning models on a limited data sample. Using k-fold, cross-validation means that k models will be construct, evaluated, and through using mean model error, the model’s performance is estimated. After rounding the predicted value of each row which is a float number between 0 and 1, it will be compared to its actual value. If they are equal, the prediction is considered as a correct result. Simple error equation (equation 13) will be used to evaluate each model.
The general procedure is as follows: (1) Shuffle the dataset randomly. (2) Split the dataset into k groups, (3) take a group as a test set and the remaining as a training set, the same procedure will be repeated for each and every group; (4) as usual, the model will be Fitting on the training set and evaluating on the test set, and (5) retain the result (evaluation score) the model can be discarded [17], [23]. For this work, a learning rate, training epochs, and k value are (0.1, 100, 5) subsequently.
After implementing the model twice; first on the dataset with all features, and second with features selected by applying RST, the results can be discussed as follows:
The aim of using logistic regression is predicting the dependent variable (output variable) based on equation 7, and the aim of using stochastic gradient descent technique is minimizing the error of predicted coefficient values while training the model on the dataset. For model training, k-fold cross-validation technique is used to split out the dataset to 5 folds (groups), a fold is used as a test set and the others as train sets, for example:
Mode l: Fold1 for test and fold2, fold3, fold4, and fold5 for train
Mode 2: Fold2 for test and fold1, fold3, fold5, and fold5 for train
Mode 3: Fold3 for test and fold1, fold2, fold4, and fold5 for train
Mode 4: Fold4 for test and fold1, fold2, fold3, and fold5 for train
Mode 5: Fold5 for test and fold1, fold2, fold3, and fold4 for train.
For each model, after training for 100 epochs (iterations) and minimizing the errors to a desired results and calculate the accuracy using equation 11, the score can be calculated using equation 14.
The total number of models used is five. Table 7 summarizes the models result and the overall score. The overall score is 77.12% for the model on the dataset with all features.
TABLE 7: Accuracy score of each model used
The same process applied on the dataset with selected features based on RST, the result is presented in Table 8.
TABLE 8: Accuracy and score for all five models for selected features
Table 9 shows the comparison between the results obtained from both implementations; implementing the model on the dataset with all features and the RST-based selected features. The results show that RST-based selected features for machine learning compared to the data set with all features give more accurate predictions.
TABLE 9: Accuracy and score for all five models using all features, RST-based selected features
The baseline score for the selected dataset is 65% our experiment results which indicated that the proposed approach increased the prediction accuracy for diabetes dataset with all features from 65% to 77% and 80% for RST-based features dataset, as shown in Table 10.
TABLE 10: Accuracy summery of baseline and proposed algorithm for diabetes
Finally, it can be summarized that implementing the logistic regression algorithm with stochastic gradient descent technique is one of the suitable choices for diabetes predictions on the basis of the results. At the same time, rather than using all features, more precise predictions can be made by feature selection based on rough set for neural network. Table 11 summarizes a comparison between our works with some of the most recently published works.
TABLE 11: Dataset classification comparison
In the health-care sector predicting, the presence or non-presence of diseases is important to help people know their health status so that they take the necessary steps to control the disease.
This paper explores the use of stochastic gradient descent algorithm to apply logistic regression on datasets to make predictions on the presence of diabetes. The Pima Indian Diabetes dataset is used to produce results using the proposed technique. The experiments results show that diabetes can be predicted more accurately using logistic regression with stochastic gradient descent algorithm when RST is used to select the important features on a normalized dataset. This is paper makes a real contribution in the use of interdisciplinary techniques to improve prediction mechanisms in health-care sector in general diabetes prediction in specific. The main purpose of this work is showing the significance of using RST with machine learning algorithms, hence in the future; the same theory can be applied with other algorithms to have a better result.
[1]. “Diabetesatlas“. Available from:https://www.diabetesatlas.org [Last accessed on 2022 Aug 08].
[2]. M. Anouncia, C. Maddona, P. Jeevitha and R. Nandhini. “Design of a diabetic diagnosis system using rough sets“. Cybernetics and Information Technologies, vol. 13, no. 3, pp. 124-169, 2013.
[3]. F. E. Gmati, S. Chakhar, W. L. Chaari and H. Chen. “A rough set approach to events prediction in multiple time series“. In:International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, vol. 10868, pp. 796-807, 2018.
[4]. H. Patel and D. Patel. “Crop prediction framework using rough set theory“. International Journal of Engineering and Technology, vol. 9, pp. 2505-2513, 2017.
[5]. S. K. Manga. “Currency crisis prediction by using rough set theory“. International Journal of Computer Applications, vol. 32, 48-52, 2011.
[6]. B. B. Nair, V. Mohandas and N. Sakthivel. “A decision tree-rough set hybrid system for stock market trend prediction“. International Journal of Computer Applications, vol. 6, no. 9, pp. 1-6, 2010.
[7]. “Pima-Indians-Diabetes-Dataset“. Available from:https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database [Last accessed on 2022 May 04].
[8]. Z. Pawlak. “Rough set theory and its applications to data analysis“. Cybernetics and Systems, vol. 29, no. 7, pp. 661-688, 1998.
[9]. P. Achlioptas. “Stochastic Gradient Descent in Theory and Practice“. Stanford University, Stanford, CA, 2019.
[10]. J. Brownlee. Machine Learning Algorithms from Scratch with Python. Machine Learning Mastery, 151 Calle de San Francisco, US, 2016.
[11]. H. H. Inbarani and S. U. Kumar. “A novel neighborhood rough set based classification approach for medical diagnosis“. Procedia Computer Science, vol. 47, pp. 351-359, 2015.
[12]. E. S. Al-Shamery and A. A. R. Al-Obaidi. “Disease prediction improvement based on modified rough set and most common decision tree“. Journal of Engineering and Applied Sciences, vol. 13, no. Special issue 5. pp. 4609-4615, 2018. [13]. R. Ghorbani and R. Ghousi. “Predictive data mining approaches in medical diagnosis:A review of some diseases prediction“. International Journal of Data and Network Science, vol. 3, no. 2, pp. 47-70, 2019. [14]. R. Ali, J. Hussain, M. H. Siddiqi, M. Hussain and S. Lee. “H2RM:A hybrid rough set reasoning model for prediction and management of diabetes mellitus“. Sensors, vol. 15, no. 7, pp. 15921-15951, 2015. [15]. S. Sawa, R. D. Caytiles and N. C. S. Iyengar. “A Rough Set Theory Approach to Diabetes“. In:Conference:Next Generation Computer and Information Technology, 2017. [16]. S. Ramesh, H. Balaji, N. Iyengar and R. D. Caytiles. “Optimal predictive analytics of pima diabetics using deep learning“. International Journal of Database Theory and Application, vol. 10, no. 9, pp. 47-62, 2017. [17]. K. Thangadurai and N. Nandhini. “Integration of rough set theory and genetic algorithm for optimal feature subset selection on diabetic diagnosis“. ICTACT Journal on Soft Computing, vol. 8, no. 2, 2018. [18]. V. Talasila, K. Madhubabu, K. Madhubabu, M. Mahadasyam, N. Atchala and L. Kande. “The prediction of diseases using rough set theory with recurrent neural network in big data analytics“. International Journal of Intelligent Engineering and Systems, vol. 13, no. 5, pp. 10-18, 2020. [19]. T. R. Gadekallu and X. Z. Gao. “An efficient attribute reduction and fuzzy logic classifier for heart disease and diabetes prediction“. Recent Advances in Computer Science and Communications (Formerly:Recent Patents on Computer Science), vol. 14, no. 1, pp. 158-165, 2021. [20]. “Medium“. Available from:https://www.medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e [Last accessed on 2022 Jun 05]. [21]. E. Rahm and H. H. Do. “Data cleaning:Problems and current approaches“. IEEE Data Engineering Bulletin,vol. 23, no. 4, pp. 3-13, 2000. [22]. D. Borkin, A. Némethová, G. Michal'conok and K. Maiorov. “Impact of data normalization on classification model accuracy“. Research Papers Faculty of Materials Science and Technology Slovak University of Technology, vol. 27, no. 45, pp. 79-84, 2019. [23]. “Machine Learning Mastery“. Available from:https://www.machinelearningmastery.com/k-fold-cross-validation [Last accessed on 2022 Aug 06]. [24]. G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta and S. K. Tayebati. “Comparative machine-learning approach:A follow-up study on Type 2 diabetes predictions by cross-validation methods“. Machines, vol. 7, no. 4, pp. 74, 2019. [25]. D. K. Choubey, P. Kumar, S. Tripathi and S. Kumar. Performance evaluation of classification methods with PCA and PSO for diabetes. Network Modeling Analysis in Health Informatics and Bioinformatics,vol. 9, no. 1, 5, 2020. [26]. R. Patra and B. Khuntia. “Analysis and prediction of Pima Indian diabetes dataset using SDKNN classifier technique“. IOP Conference Series:Materials Science and Engineering, vol. 1070, no. 1, 012059, 2021. [27]. V. Chang, J. Bailey, Q. A. Xu and Z. Sun. “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms“. Neural Computing and Applications. vol. 34, no. 10, pp. 1-7, 2022.