Predicting Heart Disease Based on Personal Key Indicators

The presentation of this project can be accessed [here]

Interactive Dashboard displaying Exploratory Analysis [Dashboard]

Significance

Several health conditions, your lifestyle, your age and family history can increase your risk for heart disease. These are called risk factors. About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Some risk factors for heart disease cannot be controlled, such as your age or family history. But you can take steps to lower your risk by changing the factors you can control [Source: Center for Disease Control and Prevention]

Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.

Questions

Does sex/gender have an association with heart disease?
Is there an association between age & heart disease?
Are BMI, Smoking, Alcohol drinking, and prior stroke associated to heart disease?

Technologies Used

Data Cleaning & Analysis:

The Pandas library will be used to clean the data and perform an exploratory analysis. Further analysis will be completed using Python.

Database Storage:

Our database will be hosted in PostgreSQL, and will be populated from Jupyter Notebook using SQLite.

Machine Learning

The SciKitLearn machine learning library will be used to create a classifier. Our training and testing setup is 70/30.

Dashboard

A functioning and interactive dashboard will be created using Tableau to present findings and results.

Machine Learning Model - Flow Chart

we plan to follow the steps of the following machine learning flow chart. This chart is subject to change and will be updated as we dive deeper into our analysis.

Communication Protocols

Team members will communicate using the #final-project Slack channel.
Team members will meet at least 1x per week.
Team members will plan next time they meet at the end of every meeting-- being flexible and open to each other's schedules.
Team members will communicate with each other in the case they need internal deadline extensions or help with their part of the project.
Team members will distribute work evenly amongst each other and be repsonsible for their distrib

Update on the status of the project

We have divided the project in several steps that are linked with the way this type of analysis have to be performed:

1- Explore the dataset

Degree of progress: 90% / 100%

We have worked exploring and analyzing the dataset to understand the information and its quality. A summary of the findings have been performed on Tableau for a better visualization:

Dashboard Stories: [Heart Disease Stories]

General Overview of the Dataset

Overview of the non-numeric Variables

2- Transform values and variables

We have analized and concluded that the dataset is quite clean and complete. There are not null variables and the data contained in the variables is consistent. The only variable where we found the need of cleaning is Sleeptime, where we found some data the could be considered as a mistake, as is shown on the graph below:

Considering that there are some responders with sleeptime higher that what we believe normal for a human been we decided to exclude the outliers by using the quantile method.

The results obtained after the method have more sense (sleeping hours between 3 and 11 hours). By applying this method we have reduced the dataset in 4,523 rows which represents only 1,42% of the original dataset.

We have also used a correlation matrix to analyze which are the most correlated variables to predict a heart disease condition:

Finally we detect that the database is not balanced to perform a Machine Learning process. Class imbalanced is generally normal in classification problems. But, in some cases, the imbalance is quite acute where the majority class’s presence is much higher than the minority class.

Now that we have split our data into training and testing sets, we can scale the data using Scikit-learn's The standard scaler standardizes the data. Which means that each feature will be rescaled so that its mean is 0 and its standard deviation is 1

We have decided to use the RandomOverSampler and RandomUnderSampler method to adjust the balance of the dataset.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

3- Load the data into the model

We have loaded the dataset with the adjustments expresed in the previous step. We are in the process of analyzing if all the variables helps to predict heart disease or if we could have a better model with less variables involved. Modeling is an iterative process: you may need more data, more cleaning, another model parameter, or a different model. It's also important to have a goal that's been agreed upon, so that you know when the model is good enough.

4- Testing different Machine Learning Model

We have decided to test three different Machine Learning Model Methods: Confusion Matrix, Logistic Regresion and Random Forest Classifier.

Confusion Matrix:

Logistic Regresion:

Random Forest Classifier

The best results have been achieved with the Random Forest Classifier method. Achieving the following eficiency values:

Accuracy 0.9211621

Precision 0.863853

Recall 0.9602351

In summary, this model is one of the best one for predicting heart disease based on personal key indicators because the model's accuracy, 0.92, is high, and the precision and recall are good enough to state that the model will be good at predicting heart disease.

5- Machine Learning Model Improvements

In the next steps we improved the model by excluding some variables that do not increase the quality of the prediction.

According to the results the Accuracy avg/total for the Precision is 0.88, for the Recall 0.86, for F1-Score 0.86, and support 52.6.

6- Conclusions and Comunication

We performed a chi-square analysis to test if there is an association between gender and heart disease. We hypothesized that the distrubution of heart disease is equal amongst the genders.The p-value (p=0.0) of our chi-square analysis is below the 0.05 signinficance level, this we reject our null hypothesis. There is a difference in the distribution of heart disease amongst the genders.
We performed a chi-square analysis to test if there is an association between age category and heart disease. We hypothesized that the distrubution of heart disease is equal amongst the age groups. The p-value (p=0.0) of our chi-square analysis is below the 0.05 signinficance level, this we reject our null hypothesis. There is a difference in the distribution of heart disease amongst the different age groups.
We performed a multivaraite logistic regresstion test the BMI, Smoking, Alcohol drinking, and prior stroke and their association to heart disease?

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Previous_Versions		Previous_Versions
Resources		Resources
__pycache__		__pycache__
.README.md.swo		.README.md.swo
.README.md.swp		.README.md.swp
446-4469747_gears-clipart-gear-icon-png-gear-icon-free.png		446-4469747_gears-clipart-gear-icon-png-gear-icon-free.png
Data Analysis.ipynb		Data Analysis.ipynb
Green Team - Predicting Heart Disease.pdf		Green Team - Predicting Heart Disease.pdf
Phase_1_Heart_Database_Creation.ipynb		Phase_1_Heart_Database_Creation.ipynb
Phase_2_Exploring_data_Set.ipynb		Phase_2_Exploring_data_Set.ipynb
Phase_3_Machine_learning_Analysis.ipynb		Phase_3_Machine_learning_Analysis.ipynb
Predicting Heart Disease - Presentation.pdf		Predicting Heart Disease - Presentation.pdf
README.md		README.md
config.py		config.py
gears-icon-3-gears-clipart-machine-cross-symbol-transparent-png-26354.png		gears-icon-3-gears-clipart-machine-cross-symbol-transparent-png-26354.png
heart_disease_V3.0.ipynb		heart_disease_V3.0.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Heart Disease Based on Personal Key Indicators

Significance

Questions

Technologies Used

Data Cleaning & Analysis:

Database Storage:

Machine Learning

Dashboard

Machine Learning Model - Flow Chart

Communication Protocols

Update on the status of the project

1- Explore the dataset

2- Transform values and variables

3- Load the data into the model

4- Testing different Machine Learning Model

Confusion Matrix:

Logistic Regresion:

Random Forest Classifier

5- Machine Learning Model Improvements

6- Conclusions and Comunication

About

Releases

Packages

Contributors 4

Languages

ivn-m/predicting_heartdisease

Folders and files

Latest commit

History

Repository files navigation

Predicting Heart Disease Based on Personal Key Indicators

Significance

Questions

Technologies Used

Data Cleaning & Analysis:

Database Storage:

Machine Learning

Dashboard

Machine Learning Model - Flow Chart

Communication Protocols

Update on the status of the project

1- Explore the dataset

2- Transform values and variables

3- Load the data into the model

4- Testing different Machine Learning Model

Confusion Matrix:

Logistic Regresion:

Random Forest Classifier

5- Machine Learning Model Improvements

6- Conclusions and Comunication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages