Helping patients know whether they are diabetic or not using Machine Learning
Diabetes is an illness which is basically caused by high glucose level in the human body. If left untreated it can cause a lot of complications which might also affect to the most of the organs present in the body. In simple, terms diabetes is caused when the blood glucose level is higher than normal, which is basically cause when there is high or low amount of insulin secretion in the body. The process involved in getting diabetes is basically the carbohydrates in the food which we eat is digested in the intestines and convert it into Sugar called glucose, this glucose which is flowing in our blood should be passed to the body cells, so in order to do this the pancreas create a hormone which is insulin, this helps the glucose present in the blood to be flown to body cells. There are different types of Diabetes i.e.
- Tyep-1: No insulin is generated by the pancreas.
- Type-2: Insulin is generated but the body cells are not able to sue them properly.
- Type-3 / gestational diabetes: This is basically caused during the pregnancy which changes the metabolism of a human and these changes are the results of hormones being produced that keep insulin from doing its job.
Random Forest Classifier:
Random Forest Classifier is an ensemble algorithm which works with multiple algorithms parallelly. This is a supervised algorithm and it can be used with both classification and regression problems. The output of the new data is estimated either by using majority voting or average voting technique. Since the algorithm works with bagging technique, multiple decision trees are used to provide the output for the specific input. This is a key difference between decision trees and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features. Random forest works best with large datasets and high dimensional.
Decision Tree Classifier:
A decision tree is a tool for making decisions and the process for making decisions is in a tree like structure, decision tree is a supervised machine learning algorithm mainly used for predicting the outcome after computing all the attributes. The process flow of Decision tree goes from Root node to leave node i.e., the decision node.
Logistic Regression is a Supervised algorithm which mostly works in the case of binary classification problems. Logistic regression is a sophisticated algorithm where the data to be trained using this algorithm should be properly presented i.e., Normalized/Scaled, Columns should be Converted to numerical and data should be neat and clean. The output is presented in the form of logit score, where this helps us to predict the likelihood of an event occurring of a given problem. The main reason of getting a S curve in the below chart is that the sigmoid function does the trick of converting the given number in the range between 0 and 1.
Sigmoid(x) = Y = 1 / 1+e -z
Gradient Boosting Classifier:
Gradient Boosting is a boosting algorithm which consists of ensemble of classification and regression trees. In GB each predictor tries to correct the error made by its predecessor. Unlike the Adaboost algorithm here we don’t change the training instances, instead each predictor/tree is trained by the residual errors caused by the previous trees. All the leaners/trees in the gradient boosting have the same equal weights, since the weights are referred as learning rate which is small in magnitude.
How to Execute?
Make sure you have checked the add to path tick boxes while installing python, anaconda.
Refer to this link, if you are just starting and want to know how to install anaconda.
If you already have anaconda and want to check on how to create anaconda environment, refer to this article set up jupyter notebook. You can skip the article if you have knowledge of installing anaconda, setting up environment and installing requirements.txt
- Install the prerequisites/software’s required to execute the code from reading the above blog which is provided in the link above.
- Press windows key and type in anaconda prompt a terminal opens up.
- Before executing the code, we need to create a specific environment which allows us to install the required libraries necessary for our project.
- Type conda create -name “env_name”, e.g.: conda create -name project_1
- Type conda activate “env_name, e.g.: conda activate project_1
- Go to the directory where your requirement.txt file is present.
- cd <>. E.g., If my file is in d drive, then
7.cd d:\License-Plate-Recognition–main #CHANGE PATH AS PER YOUR PROJECT, THIS IS JUST AN EXAMPLE
8. If your project is in c drive, you can ignore step 5 and go with step 6
9. g., cd C:\Users\Hi\License-Plate-Recognition-main
10. CHANGE PATH AS PER YOUR PROJECT, THIS IS JUST AN EXAMPLE
11. Run pip install -r requirements.txt or conda install requirements.txt (Requirements.txt is a text file consisting of all the necessary libraries required for executing this python file. If it gives any error while installing libraries, you might need to install them individually.)
12. To run .py file make sure you are in the anaconda terminal with the anaconda path being set as your executable file/folder is being saved. Then type python main.pyin the terminal, before running open the main.py and make sure to change the path of the dataset.
13. If you would like to run .ipynb file, Please follow the link to setup and open jupyter notebook, You will be redirected to the local server there you can select which ever .ipynb file you’d like to run and click on it and execute each cell one by one by pressing shift+enter.
Please follow the above links on how to install and set up anaconda environment to execute files.
The Dataset is collected form Kaggle Repository which contains 769 Instances with 9 features. Some of the features which correspond and highly correlated with our target class are, pregnancies, glucose level, blood pressure, Insulin, Age. This study’s goal is to predict whether the patient is affected with diabetes or not, where this presence is valued from no presence to likely presence.
- Logistic Regression
2. Decision Tree
3. Random Forest Classifier
4. Support Vector Machine
Exploratory Data Analysis
Evaluation metrics are considered as one of the most important steps in any machine learning and deep learning projects, where it will allow us to evaluate how good our model is performing on the new data or on unseen data. There are a lot of evaluation metrics which can be used in order to assess how good our model is performing such as roc_auc_curve, f1_score, recall, precision and each of which work for specific problem we deal. So, for our project we have gone with confusion matrix and classification report which helps us to evaluate not just the accuracy of the model but also the other metrics such as precision, recall and f1_score.
Issues you may face while executing the code
- We might face an issue while installing specific libraries, in this case, you might need to install the libraires manually. Example: pip install “module_name/library” i.e., pip install pandas
- Make sure you have the latest or specific version of python, since sometimes it might cause version mismatch.
- Adding path to environment variables in order to run python files and anaconda environment in code editor, specifically in any code editor.
- Make sure to change the paths in the code accordingly where your dataset/model is saved.
Refer to the Below links to get more details on installing python and anaconda and how to configure it.
All the required data has been provided over here. Please feel free to contact me for model weights and if you face any issues.
Yes, you now have more knowledge than yesterday, Keep Going.