Hobot documentary

Introduction:

When we talked about the most scary deadly disease, it’s undeniable to not think of Cardiovascular diseases (CVDs), since it was and still is the number 1 cause of death globally.

According to Policy Advice, around 17.9 million lives each year were taken from this disease, which accounts for 31% of all deaths worldwide.We all know that Heart failure is a common event caused by CVDs. Fortunately, most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management. In this project, we would like to implement all of our statistical, machine learning, and programming knowledge in order to encounter this problem, and let users interact the machine learning model via web application. Hopefully, we also would like to educate the user by visualizing the insight of the data we used

Objective:

To implement one of our machine learning models for detecting early signs of heart failure for people who once had heart failure and left ventricular systolic dysfunction before with great accuracy.

To visualize the insight of the data we used, so the user will be technically educated via our data.

To make an statistical inference on the sample data

Everything we created will allowed the user to interact via our web application

Question:

To create a product that satisfies all of our objectives, we need to create right questions in order to obtain them.

For the Machine learning objective (Predictive task), we decided to mainly choose five of them

Which machine learning algorithm archive the best performance based on F1 score

For the visualization objective, we decided to mainly choose four of them

What does the characteristic of each patients distributed (Univariate Analysis)
Is there any relationship between feature (Bivariate analysis)
Is there any feature highly correlated to the heart failure (death event)
What is the mean and standard deviation of each laboratory result with respect to the different group of qualitative features?

For the statistical inference objective, we decided to mainly pick

Is there any significant difference in the proportion of one of qualitative features for each type of qualitative feature?

The solution we expected for these questions are in form of a product (web application), we will let the user explore interactively.

Data:

To implement machine learning, visualization, and statistical inference, we cannot ignore the used data for our problem. In this case, we wish to obtain the data of the patient who once had heart failure and left ventricular systolic, so our machine learning has something to learn from.

The dataset was originally uploaded from the research paper. Thankfully, there is someone who uploaded this dataset in Kaggle which is more easy to get the data

The data of patients that we used was collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015 [52, 66]. The patients consisted of 105 women and 194 men, and their ages range between 40 and 95 years old. All 299 patients had left ventricular systolic dysfunction and had previous heart failures that put them in classes III or IV of New York Heart Association (NYHA) classification of the stages of heart failure.

The following table shows the features of our dataset.

Feature	Explanation	Measurement
Age	Age of the patient	Years
Anaemia	Decrease of red blood cells or hemoglobin	Boolean
High blood pressure	If a patient has hypertension	Boolean
Creatinine phosphokinase	Level of the CPK enzyme in the blood	mcg/L
Diabetes	If the patient has diabetes	Boolean
Ejection fraction	Percentage of blood leaving	Percentage
Sex	Woman or man	Binary
Platelets	Platelets in the blood	kilo platelets/mL
Serum creatinine	Level of creatinine in the blood	mg/dL
Serum sodium	Level of sodium in the blood	mEq/L
Smoking	If the patient smokes	Boolean
Time	Follow-up period	Days
(target) death event	If the patient died during the follow-up period	Boolean

Overall, There are 5 categorical features (including target), and 8 numerical features.

Web Application:

( You could visit by yourself by redirect to this link : https://hobotai.herokuapp.com/ )
Regarding our 6 questions, we decided to launch a web application to answer them called Hobot. Hobot is a disguised A.I Cardiologist. It trained from our dataset, and implemented a Random forest algorithm in order to predict the likelihood of having heart failure again.

Furthermore, it also provides a classroom for a user who really wants to learn an insight from this dataset.

Basically, it can provide you with the data analytic report of the dataset, dashboard of the insight of data, and prediction of heart failure. We will focus on every section of our web application later.

However, when it comes to building a medical application, we need to put the medical disclaimer in order to prevent the user from being a substitute doctor by this web application. We do not guarantee an accuracy of prediction.

Figure A shows a medical disclaimer I put in my web application, and Figure B shows a reminder to the user before they did the prediction task. The user has to tick our agreement before predicting too.

Figure A

Figure B

Perhap, you are wondering which programming language this web application relies on.

The majority of languages we use are python, since it allows us to launch web applications, while also allowing us to manipulate the data frame, machine learning, Interactive dashboard plot, statistical inference, and more.

The rest are the languages that use to make a component of a webpage (HTML, CSS).

Figure C shows the percentage of language used. Figure D shows the overall files we run for this web application. (does not include the model training part which solely is jupyter notebook)

Figure C

Figure D

The random forest model is located at model.pkl as Figure E shows.

The folders we showed in Figure D are uploaded here.

Unfortunately, Figure D does not show the file which I used to train the Machine learning model (Random forest). I will separate the file of the progress of how I train the model here.. Which used to return model.pkl file (as shown in figure E).

Prediction:

This report did not aim to describe the progress of training machine learning thoroughly. Anyway we are going to brief the overall of our training model.

In brief, our model contains two main components: standardscaler, and random forest with 89 trees. We visualize in figure F. The Random Forest Classifier was chosen as our architecture, since its performance (f1 score) is better among others like Logistic Regression,Support Vector Classifier, and K-nearest neighbor, while standardScaler also improves the performance better than no standardization.

Figure F

Basically, Standard scaler is used to standardize our data, then we input the standardized data to our random forest algorithm which has 89 trees. Each tree does not always return the same output, so the random forest uses an aggregation technique called Majority-Voting. Suppose there are 50 trees that predict this patient as having heart failure, and the rest (49 trees) predict as not having heart failure, then this random forest predicts as having heart failure. What makes each tree different is the dataset they used. Every tree got bootstrapped replicas from the original dataset. Bootstrapping is just a resampling technique that involves sampling with replacement. You may wonder why it has 89 trees, why not 100 or something.

Basically, we used hyperparameter tuning as shown in Figure H, and it turns out that n_estimator = 89 achieves the highest f1 score (the evaluation that focuses on precision and recall score). It achieved 89% F1 score on the training set, and 75% on the test set. The confusion matrix for the test set using by our model shows in Figure G

Figure G

Figure H

Figure I shows three of the trees in our random forest; tree 2,51,and 89 respectively.

Figure I

Once, we trained it. We can now deploy this model into our web application which is located at the prediction zone.

The below GIF shows the user could interact in the prediction zone using our web application.

Our model does not only return the classification, but also provides the probability of predicting also.

Overview of Data:

Perhap the user does need to see which Hobbit was fed on, so we decided to make an Insight section to let them explore by themselves. The section is shown in Figure J

Figure J

You may wonder what the left sidebar does. It lets the user filter the characteristics of patients by themselves. The below GIF shows how the dataset changed when the user adjusted it.

Our web application also allows people to download the filtered dataset in form of .csv and excel as well, so they could possibly use them to analyze.

Now, sometimes the user does not want to see the whole dataset, but rather than the analytic, hence we allow them to generate the report with the given filtered dataset. The below GIF shows the report for the entire dataset (no filter).

This report covers univariate analysis, and bivariate analysis in form of scatter plot, and also correlation of each feature.

Since, there are many variables to cover. I decided to just analyze Age and Diabete variable only.

When you toggle the detail like the GIF below. You observe that there is no clear distribution of the age distribution. It is not monotonic. The patients in this dataset are 40, and the maximum is 95. This is an alternative way to let the user know the limitations of patients. If the user age is out of his range [40,95]. It might not return an accurate result due to extrapolation.

The peak of the patient's age is 60 (35 patients).

Figure k shows the summary of Quantile statistics, and descriptive statistics.

Figure k

Next, move on to the categorical feature; Diabete.

58.2% of the patients are diabete, and 41.8% are not.

In bivariate analysis only concerns the relationship of numerical features. In this case, there are 6 numerical features, so the permutation of every possible interaction plot is (6)(6) = 36. Since it’s too many, then we decided to pick one of them; serum creatine and ejection fraction.

Figure L

According to Figure L, Although there is a no clear linear relationship, the people who is has high ejection fraction tends to have high serum creatine as well.

Now, our goal here is to check which characteristic of the patient lead to more likely to have heart failure the most.

Figure M : Pearson Coefficient

Based on Pearson Coefficient, we obviously observe that the time and DEATH_EVENT are highly negatively correlated. We can interpret that the more patients who recently met the doctor for follow- up are more likely to have a chance of have heart failure again. This makes sense. If you do see them have no heart failure for a long time. It looks like they stay healthy.

Figure M : PhiK Coefficient

According to the Phi_K documentary, Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution.

Perhap the data might not have a strong linear relationship, rather non-linear relationship, one of the metric of correlation that we are using for capturing non-linear relationship is PhiK.

According to Figure M, we see that the sex and smoking are highly correlated. The interpretation of this required us to observe the grouped barplot from the Inference section.

Statistics Inference:

Figure N

According to Figure N, we observed that the female patients are extremely unlikely to smoke.

However, we need to know that Is there any significantly difference between the people who smoking classified by sex. This leads us into the next section. (Inference). The below GIF shows the section.

The above GIF shows the two proportional Inferences with the alpha level : 0.025.

Hence, we can obtain the following statistical test.

We can conclude that there is significantly difference of the proportion of smoking between each level sex with confidence level 0.975. There are only a 2.5 % chance that our conclusion was wrong.

One more interesting question using this king of statistical inference is that Is there any significantly difference of the proportion of sex between each level diabetes with 2.5% significance level ? We do it again as shown in the below GIF

We then obtain the following statistical test, and conclude that the females are more likely to be diabete more than male. (However, it’s not always true, since the patient who has heart failure and Left ventricular systolic dysfunction does not represent the whole population.)

Interesting Insight:

Sometimes, all insight we really want is just the mean and standard deviation of the lab result. Have you ever wondered what the mean and standard deviation of creatine phosphokinase is for the people who will have heart failure and don’t ? The below GIF will resolve your doubt.

Figure O

Perhap, you don’t want to know the insight of every range of age or follow-up period, you could adjust the patient like the following gif. (for example age: [40,60], time:[100,200])

Figure P : Filtered patient

If you look carefully, the patients who have heart failure around this range of age and time, there must be more likely to have more Creatine Phosphokinase. (The people who have heart failure again with age : [40:60], time:[100:200] have the mean of CPK around 1000, while no filter, the mean of them are around 500)

You won’t see any patient who are (relatively) young, but control a good amount of creatinine Phosphokinase level but still have heart failure again.

Conclusion:

Creating a Web application helps the user to explore the data and to help predict the risk of having heart failure again as well. This could be helpful to let them detect the early sign by themselves, although the prediction is not always true.

We observe that generating report. We try to maximize the performance (F1) of them, and turnt out that the Random forest algorithm is the best score. This web application allows them to see how the characteristics of each patient are distributed and also the summary statistics. Although there are no clear relationship between features, but pearson’s coefficient shown us that the time and death event are highly correlated, while PhiK shown us that sex and smoking are highly correlated (which also concern non-lnear relationship). The statistical inference like Two proportion test proves that the patients who are male are more statistically likely to smoke when compared to females . We also observe the mean and standard deviation with the given feature. It turnt out that the patient who are relative young, and a bit long time follow-up period but still have heart failure again are more likely to have high CPK level for both male and female

Appendix:

The source code is uploaded via github.

The whole web application : https://github.com/saranpan/heart-failure-detector

The model training : https://github.com/saranpan/heart-failure-detector/tree/pipeline-train

👉Chat now with Hobot Cardiologist :

At the end of the month, it is possible that you will not be able to access the website due to the expiration of our heroku dyno. However, this does not prevent you from using the chat bot offline by downloading a zip file and running the command "streamlit run app.py" on your terminal. This will launch the web application for you to use. application.