Download paper

Diabetes Datasets Using Data Mining



Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis. Data mining tools allow enterprises to predict future trends. According to WHO 2014 report, around 422 million people worldwide are suffering from diabetes. This methods strongly based on data mining techniques can be effectively applied for high blood pressure risk prediction. In this paper, we explore the early prediction of diabetes via three different data mining methods: Navie Bayes, Logistic regression and KNN.

WEKA Explorer and WEKA Experimenter interface. WEKA tool is a good classification tool used in this paper. we used diabetes dataset from the UCI machine learning repository. The performances of these three algorithms have been analyzed on diabetes dataset using training data testing mode.

Key words: WEKA tool, classification, Association, Clustering, Prediction and KDD etc.


Data mining is described as the process of discovering correlations, patterns and trends to search through a large amount of data stored in repositories, databases, and data warehouses.

so there are new tools and techniques are being progress to solve this problem through automation. Diabetes mellitus is a chronic disease and a major public health challenge worldwide. Diabetes leads to many other diseases such as blindness, blood pressure, heart disease, and kidney disease and liver damage.However, In medical field these datasets are widely distributed, diversified and enormous in nature. These datasets are sorted and integrated by the hospital management systems.

Top Experts
Writer Jennie
Verified expert
4.8 (467)
Verified expert
4.8 (756)
Academic Giant
Verified expert
5 (345)
hire verified expert

Many researchers are conducting experiments for diagnosing the diseases using various classification algorithms of machine learning approaches like J48, SVM, Naive Bayes, Decision Tree etc. as researches have proved that machine-learning algorithms works better in diagnosing different diseases.

The supervised learning of algorithm in contrast with clustering is called Classification. It classifies or maps a data item into any one of many predefined classes. Classification algorithms or techniques are responsible for building a model that will accurately predict the category of unseen instances. Classification has a wide variety of applications in a number of diverse domains such as medical diagnosis, document organization, and many others.

  • Input the data
  • Choose the classifier technique
  • Train your data set by your classifier
  • Data testing
  • Calculate the accuracy
  • Compare the classifier accuracy

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.


Saba Bashir, Usman Qamar, Farhan Hassan Khan,4M.Younus Javed An Efficient Rule-based Classification of Diabetes Using ID3, C4.5 & CARTEnsembles. 2014, IEEE the most rapidly increasing diseasesworldwide which occurs mostly due to obesity and lack ofexercise. Just an efficient rule of diabetes predicting by rule applications like ID3,C4.5.Similar ensemble techniques can be applied on other disease datasets such as breast cancer, heart disease and liverdisease. Moreover, heterogeneous individual classifiers can be used as base classifiers such as Nave Bayes, SVM and neural networks etc. Neural network and SVM classifiers.

Vrushali R. Balpande,Rakhi D. Wajgi. Prediction and Severity Estimation ofDiabetes Using Data Mining Technique. 2017,IEEE Diabetes is a metabolic disease where the impropermanagement of blood glucose levels led to risk ofgenerating abnormalities in functioning of criticalorgans like heart attack, kidney, eye diseases etc. Imporving the different alogrithms to predicte the diabetes which leads to other diseases by svm,knn and etc. The work, more test will be performed forprediction and severity estimation. Some other different parameters are considered. Their might be other risk factors that did not consider, Factorsinclude family history, smoking, metabolicsyndrome, inactive lifestyle. By considering allother attributes more accuracy prediction and quantification of severity estimation may be foundout..

Wenqian Chen, Shuyu Chen, Hancui Zhang.A Hybrid Prediction Model for Type 2 DiabetesUsing K-means and Decision Tree. 2017IEEE Diabetes Mellitus conclusionfrom insulin resistance which is acondition in which cells fail to use insulinproperly, although for sometimes alsowith an absolute insulin deficiency. Thistype was previously specified to as non-insulin-dependent diabetes mellitus. The Data set is collected from Pima Indian diabetesdataset containing various attributes like Age, Sex,BMI, Test Results of diabetes. Dataset is also madefrom test results of diabetic and nondiabeticpatients and also identification of ranges. There few aspects of this study that could be extended in the future. For instance, the proposed model is proposed to apply to Type 2 diabetesdiagnosis which is a two-class classification problem. It would be interesting to see its behavior on multi-class classificationproblems. The proposed model is applied to numeric data only, so improve the model to its behavior on different types of medical data, such as images and signals is required to assess the effectiveness of the proposed method with larger amount of data.

Deepthi Sisodia,Dilip Singh Sisodia. Prediction of Diabetes using classification techniques. 2018 IEEE Many complicationsoccur if diabetes remains untreated and unidentified. The tedious identifying process results in visiting of a patient to a diagnosticcenter and consulting doctor. Here the diabetes can be by the various classification techniques suc as: SVM,Decision Tree and KNN. The designedsystem with the used machine learning classification algorithms can be used to predict or diagnose other diseases. Thework can be extended and improved for the automation of diabetes analysis including some other machine learning algorithms.

S.Ananthi,V.Bhuvaneswari. Prediction of heart and kidney risks in DiabeticProne Population using Fuzzy Classification. 2017,IEEE. Early diagnosing of diabetic causing heart, kidney and eyecomplications is difficult and challenging. Data miningtechniques are applied on clinical data attributes ofdiabetics to predict the risk factors. develop a fuzzy classification model to predict heartand kidney complications using diabetic clinical data.This predictingthe risk complications of diabetics can be applied in big dataanalytics.

Messan Komi, J un Li,Yongxin Zhai, Xianguo Zhang. Application of Data Mining Methods in Diabetes Prediction. 2017,IEEE. Diabetes mellitus or simply diabetes is a diseasecaused due to the increase level of blood glucose. Varioustraditional methods, based on physical and chemical tests, areavailable for diagnosing diabetes. accuracy to predict the diabetes using different techniques Based on the results demonstrated on ANN method provides highest accuracy of the 0.89 to predict the disease. Compared to other methods and due to the complexity and variety of the data set, the Logistic regression and SVM are less able to obtain an expected result. The proposed work can be further enhanced and expanded for thedisease prediction. For instance, the feature used in thispaper can incorporate other medical attributes. It can alsoconsider to use other data mining techniques, like TimeSeries, Clustering and Association Rule.

Raid M. Khalil,Adel Al-Jumaily. Machine Learning Based Prediction of Depression among Type 2 Diabetic Patients. 2017,IEEE. Type 2 diabetes has a quite high incidence all over theworld. For the prevention and treatment of Type 2 diabetes, earlydetection is demanded.Developing an application which can predict the diabetes by parameters and different instances. Here other learning methods can be tried out forbetter accuracy. Depression is a multi-factorial disease.There may be some spurious association of different factorsdue to confounding so optimization is needed.

Pradeep K R,Dr. Naveen N C. Predictive Analysis of Diabetes using J48 classification technique. 2016,IEEE. Just it is an applications that used in the paper that all the type of diabetes can be predicited. Therefore subtractiveclustering can be used to produce accurate results by using alarge number of membership functions. In ANN there is areduction in performance when the training database is toolong. This inhibits the performance of ANN and also resultsin large training time. Early diagnoses is required for its low cost so and the perfeerable to its J48 alogorithm by the online web applications.9 Aparimita Swain,Sachi Nandan Mohanty,Ananta Chandra Das.


2016,IEEE Diabetes deaths according to the world health record 2014 around 422 million people To find accuqaret result by different alogrithmLike svm, ANN and etc.Some other network training algorithms can be used andmore input variables and parameters might be considered forgetting better classification and accuracy for decision makingin Diabetes Mellitus. The computational complexities couldbe attempted in future.10 Deepika Verma,Dr. Nidhi Mishra. Analysis and Prediction of Breast cancer and Diabetes disease datasetsusing Data mining classification Techniques. 2017,IEEE. the fields prediction and identification of various diseasessuch as stoke, diabetes, cancer, hypothyroid andheart disease etc. Solution is the this data sets can be predicted by the algorithms like Svm, logistics Regreesion, kNN and etc. the used machine learning classification algorithms can be used to predict or diagnose. for the automation of diabetes machine learning algorithm.


One of the important real-world medical problems is the detection of diabetes at its early stage. Diabetes is considered as one of the deadliest and chronic diseases which causes an increase in blood sugar. Many complications occur if diabetes remains untreated and unidentified. Although the outperform other data mining methods, the relationship between attributes is more difficult to understand. Diabetes Mellitus conclusion from insulin resistance which is a condition in which cells fail to use insulin properly, although for sometimes also with an absolute insulin deficiency. This type was previously specified to as non- insulin-dependent diabetes mellitus. Here the parameters can also used as problem identification. The fields prediction and identification of various diseases such as stoke, diabetes, cancer, hypothyroid and heart disease etc..


The proposed model is applied to numeric data only, we could improve the model to see its behavior on different types of medical data, such as images and signals. Moreover, for practical implementation is required to assess the effectiveness of the proposed method with a larger amount of data. Some other network training algorithms can be used and more input variables and parameters might be considered for getting better classification and accuracy for decision making in Diabetes Mellitus. The computational complexities could be attempted. With the rapidly growing demand for medical data analysis, the proposed model can be fairly useful to the researchers and doctors for their decision-making on the patients as by using such an efficient model they can make more accurate decisions. the used machine learning classification algorithms can be used to predict or diagnose. for the automation of diabetes machine learning algorithm. This predicting the risk complications of diabetics can be applied in big data analytics. And the testing would be done to the datasets.


  1. Rohit Arora and Suman ” Comparative Analysis of Classification Algorithms on Different Datasets using WEKA,”2012 International Journal of Computer Applications (0975 ” 8887) Volume 54- No.13, September 2012.
  2. Karatsiolis, S. Schizas, C.N.: Region based Support Vector Machine Algorithm for Medical Diagnosis on Pima Indian Diabetes DataSet. In: Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), Cyprus,(2012).
  3. G.J. Simon, P. J. Caraballo, T. M. Therneau, S. S. Cha, M. Regina Castro and Peter W.Li, Extending Association Rule Summarization Techniques to Assess Risk Of Diabetes Mellitus, IEEE Transactions on Knowledge and Data Engineering,vol.27, no.1, January 2015.
  4. Ibrahim N H, Mustapha A, Rosli R, et al. A hybrid model of hierarchical clustering and decision tree for rule-based classification of diabetic patients[J]. International Journal of Engineering & Technology, 2013,5(5).
  5. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I., 2017. Machine Learning and Data Mining Methods in Diabetes Research. Computational and Structural Biotechnology Journal 15, 104″116. doi:10.1016/j.csbj.2016.12.005.
  6. Aiswarya Iyer et al., Diagnosis of diabetes using classification mining techniques, International Journal of Data Mining & Knowledge Management Process, Vol.5, Issue 1, 2015.
  7. Arash Sharifi ,Asiyeh Vosolipour, Mahdi Mohammad Teshnelab ,Hierarchical Takagi- Sugeno Type Fuzzy System for Diabetes Mellitus Forecasting, Proceedings of the Seventh International Conference on Machine Learning and Cybernatics,vol.4,pp.1265-1270,2008.
  8. Sachi Nandan Mohanty, Dilip Kumar Pratihar and Damodar Suar,Influence of Mood Stated on Information Processing Decision Making Using Fuzzy Reasoning Tool and Neuro-Fuzzy System Based on Mamdani Approach, Int.J.Fuzzy Computation and Modelling,vol.1,pp.252-268,2015.
  9. Arash Sharifi ,Asiyeh Vosolipour, Mahdi Mohammad Teshnelab, Hierarchical Takagi- Sugeno Type Fuzzy System for Diabetes Mellitus Forecasting,Proceedings of the Seventh International Conference on Machine Learning and Cybernatics,vol.4,pp.1265-1270,2008.
  10. Pradhan, P.M.A., Bamnote, G.R., Tribhuvan, V.,Jadhav, K., Chabukswar, V., Dhobale, V., 2012. A Genetic Programming Approach for Detection of Diabetes. International Journal Of Computational Engineering Research 2, 91″94.

Cite this page

Diabetes Datasets Using Data Mining. (2019, Aug 20). Retrieved from

Are You on a Short Deadline? Let a Professional Expert Help You
Let’s chat?  We're online 24/7