Owner: Jacob McEwen Contact: [email protected]
This is a machine learning model that classifies patients by predicting whether a patient is non-diabetic or is prediabetic/has diabetes.
https://www.kaggle.com/datasets/julnazz/diabetes-health-indicators-dataset/data
The dataset includes 21 features and ~236,000 entries. This project leverages many different standard techniques, including EDA (exploratory data analysis) before the models are constructed. The two models that will be used are Random Forest and Logistic Regression, both classification algorithms.
- Python
- NumPy
- Pandas
- Matplotlib
- Scikit-learn
Other tools for the environment used are as follows:
- Anaconda
- Jupyter Notebook
- Clone and unzip the repo
- Launch Anaconda Prompt
- cd into the directory of the .ipynb file
- Activate the conda environment
- Launch jupyter notebook
The accuracy of both models are ~86%. This can be further improved with hyperparameter tuning. This project invites usage of hyperparamter tuning in all aspects. Please feel free to experiment with the model and see what improvements or modifications you can make!