Prediction of Heart Failure using machine learning with Project

8 min readJun 11, 2023

In today’s fast-paced world, people often prioritize their daily responsibilities and neglect their health, leading to a rise in various illnesses. Among them, heart disease has emerged as a major concern, with approximately 31% of global deaths attributed to heart-related conditions, according to the World Health Organization (WHO). Therefore, it becomes crucial to predict the occurrence of heart disease and take proactive measures for prevention and timely intervention.

The medical sector and hospitals generate a vast amount of data, which can sometimes be challenging to analyze manually. However, leveraging the power of machine learning techniques for predictive analysis and data handling can significantly enhance efficiency for healthcare professionals. This study aims to explore heart disease and its risk factors while utilizing machine learning techniques to predict its occurrence. Furthermore, a comparative analysis of different machine learning algorithms will be conducted to evaluate their performance in heart disease prediction.

The primary objective of this research is to develop an accurate prediction model for heart disease using machine learning. By harnessing the capabilities of advanced algorithms, this study seeks to provide medical practitioners with a valuable tool for early detection and proactive management of heart disease. Comparative analysis of machine learning algorithms will offer insights into the strengths and weaknesses of each approach, enabling informed decision-making regarding the most effective prediction model.

By combining medical expertise with machine learning techniques, we aim to contribute to the field of healthcare by improving heart disease prediction and ultimately reducing the impact of this prevalent condition on individuals and society as a whole.

Objective:

The main objective of this project is to build a robust predictive model that can effectively identify individuals at high risk of heart failure. By analyzing historical data and leveraging machine learning algorithms, the model will be trained to predict the likelihood of heart failure based on input features such as age, gender, medical history, blood pressure, cholesterol levels, and lifestyle factors. The ultimate goal is to improve patient outcomes by enabling early intervention and personalized healthcare management.

Outline:

Introduction

Overview of Heart Failure and its Impact on Public Health
Importance of early detection and proactive management
Role of machine learning in predicting heart failure

2. Data Collection

Identify and gather a comprehensive dataset containing relevant features
Explore medical records, clinical databases, and publicly available datasets
Ensure data privacy and ethical considerations

3. Data Preprocessing

Handle missing values, duplicates, and inconsistencies in the dataset
Perform feature engineering to extract meaningful information
Normalize or scale numerical features
Encode categorical variables

4. Exploratory Data Analysis

Perform statistical analysis and visualization to gain insights into the data
Examine correlations between features and heart failure
Identify any patterns or trends in the data

5. Feature Selection

Select the most relevant features for heart failure prediction
Utilize techniques like correlation analysis, feature importance, and domain knowledge
Remove irrelevant or redundant features

6. Model Selection and Training

Choose suitable machine learning algorithms for prediction
Split the dataset into training and testing sets
Train the models using various algorithms (e.g., logistic regression, random forests, support vector machines)
Use appropriate evaluation metrics to assess model performance

7. Model Evaluation and Comparison

Evaluate trained models using metrics such as accuracy, precision, recall, and AUC-ROC curve
Compare the performance of different models to identify the most accurate and reliable one
Consider factors like interpretability, computational complexity, and scalability

8. Model Deployment

Deploy the selected model into a production environment
Develop an API or integrate the model into a web application or healthcare system
Ensure real-time predictions and handle input data securely

9. Performance Monitoring and Improvement

Continuously monitor the model’s performance in the production environment
Collect feedback from healthcare professionals and end-users
Update and retrain the model as new data becomes available
Fine-tune hyperparameters and consider ensemble techniques for further improvements

10. Conclusion

Recap project objectives and achievements
Highlight the significance of the predictive model for the early detection of heart failure
Discuss potential future enhancements and extensions of the project

Here are some of the most commonly used machine learning algorithms for predicting heart failure:

Logistic regression is a simple but effective algorithm that can be used to predict binary outcomes, such as whether or not a patient will develop heart failure.
Support vector machines (SVMs) are powerful algorithms that can be used to predict both binary and continuous outcomes.
Neural networks are more complex algorithms that can learn complex relationships between features and outcomes.

Introduction

The prediction of heart failure using machine learning techniques is a crucial application in healthcare. This project aims to develop a predictive model that can accurately predict the likelihood of an individual experiencing heart failure based on various risk factors and medical indicators. By leveraging machine learning algorithms and a dataset containing relevant features, this project aims to assist healthcare professionals in the early detection and proactive management of heart failure.

Heart, as a vital organ of the human body, plays a crucial role in pumping blood to every part of our anatomy. Its proper functioning is essential for the brain and various other organs to operate effectively. When the heart fails to function correctly, the person’s life is at immediate risk, as the brain and organs will cease to work within minutes. Unfortunately, changes in lifestyle, work-related stress, and poor dietary habits have contributed to an alarming increase in heart-related diseases.

Heart diseases have emerged as one of the leading causes of death worldwide. According to the World Health Organization (WHO), these diseases claim the lives of 17.7 million people annually, accounting for 31% of all global deaths. In India, heart-related diseases have become the primary cause of mortality, with 1.7 million deaths recorded in 2016, as reported by the 2016 Global Burden of Disease Report.

The impact of heart diseases extends beyond loss of life. It significantly burdens healthcare systems and reduces individuals’ productivity. The WHO estimates that India has incurred losses of up to $237 billion between 2005 and 2015 due to cardiovascular diseases. Therefore, accurate and feasible prediction of heart-related diseases is of utmost importance.

Medical organizations worldwide collect vast amounts of data on various health-related issues, including heart diseases. These datasets hold immense potential for gaining valuable insights when analyzed using machine learning techniques. However, the sheer size and complexity of these datasets can be overwhelming for human minds to comprehend. This is where machine learning algorithms come into play, providing effective tools to explore and extract meaningful patterns and predictions from the data.

By leveraging machine learning techniques, medical professionals can accurately predict the presence or absence of heart-related diseases. These algorithms have proven to be invaluable in handling large, noisy datasets, enabling more accurate diagnoses and timely interventions. With the ability to process vast amounts of data, machine learning algorithms have become indispensable tools in the quest for better understanding and prediction of heart diseases.

In conclusion, the prediction of heart-related diseases is paramount in the field of healthcare. The use of machine learning techniques allows medical organizations to harness the power of data and gain valuable insights for improved diagnoses, prevention, and treatment. By combining medical expertise with advanced technology, we can strive to reduce the impact of heart disease, enhance patient outcomes, and promote overall well-being in communities worldwide.

References:

[1] World Health Organization (WHO)

[2] Global Burden of Disease Report (2016)

Prediction of the Occurrence of Heart Failure

This project aims to predict the occurrence of heart failure through multiple classificational algorithms.

Data Import and Exploration

Dataset

We will use the Heart Failure Prediction dataset available on Kaggle (https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). The dataset contains 299 records with 13 features, including age, gender, and various medical indicators. The target variable is a binary classification of whether the patient experienced heart failure or not.

# import what we need here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
import os
import time


# the data source
# the corresponding file is available at https://www.kaggle.com/datasets/ineubytes/
# If you use google colab, PLEASE put the corresponding csv dataset into the root d
# The file will be deleted everytime in google colab!!! And you might use additiona
# If you use jupyter lab, make sure that you set the directory to the place where t
# os.getcwd()
# os.chdir('your directory goes here')
df = pd.read_csv('heart.csv')

# explore data
df.head()

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2

Heart Failure Prediction

In [1]:

import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings("ignore")
    
    
    import os
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv

In [2]:

data=pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

In [3]:

data.head()

Out[3]:

ageanaemiacreatinine_phosphokinasediabetesejection_fractionhigh_blood_pressureplateletsserum_creatinineserum_sodiumsexsmokingtimeDEATH_EVENT075.005820201265000.001.91301041155.0078610380263358.031.11361061265.001460200162000.001.31291171350.011110200210000.001.91371071465.011601200327000.002.71160081

In [4]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB

In [5]:

data.describe()

Out[5]:

ageanaemiacreatinine_phosphokinasediabetesejection_fractionhigh_blood_pressureplateletsserum_creatinineserum_sodiumsexsmokingtimeDEATH_EVENTcount299.000000299.000000299.000000299.000000299.000000299.000000299.000000299.00000299.000000299.000000299.00000299.000000299.00000mean60.8338930.431438581.8394650.41806038.0836120.351171263358.0292641.39388136.6254180.6488290.32107130.2608700.32107std11.8948090.496107970.2878810.49406711.8348410.47813697804.2368691.034514.4124770.4781360.4676777.6142080.46767min40.0000000.00000023.0000000.00000014.0000000.00000025100.0000000.50000113.0000000.0000000.000004.0000000.0000025%51.0000000.000000116.5000000.00000030.0000000.000000212500.0000000.90000134.0000000.0000000.0000073.0000000.0000050%60.0000000.000000250.0000000.00000038.0000000.000000262000.0000001.10000137.0000001.0000000.00000115.0000000.0000075%70.0000001.000000582.0000001.00000045.0000001.000000303500.0000001.40000140.0000001.0000001.00000203.0000001.00000max95.0000001.0000007861.0000001.00000080.0000001.000000850000.0000009.40000148.0000001.0000001.00000285.0000001.00000

In [6]:

data.columns

Out[6]:

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [7]:

#missing value
    
    data.isnull().sum()

Out[7]:

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [8]:

data["DEATH_EVENT"].value_counts()

Out[8]:

0    203
1     96
Name: DEATH_EVENT, dtype: int64

In [9]:

#unique value analysis
    
    for i in list(data.columns):
        print("{}->{}".format(i,data[i].value_counts().shape[0]))

age->47
anaemia->2
creatinine_phosphokinase->208
diabetes->2
ejection_fraction->17
high_blood_pressure->2
platelets->176
serum_creatinine->40
serum_sodium->27
sex->2
smoking->2
time->148
DEATH_EVENT->2

In [10]:

#categorical feature analysis
    
    categorical_list=["anaemia","diabetes","high_blood_pressure","sex","smoking","DEATH_EVENT"]

In [11]:

import matplotlib.pyplot as plt
    import seaborn as sns

    data_categoric = data.loc[:, categorical_list]    fig, axs = plt.subplots(ncols=len(categorical_list), figsize=(20,5))    for i, col in enumerate(categorical_list):
        sns.countplot(x=col, data=data_categoric, hue="DEATH_EVENT", ax=axs[i])
        axs[i].set_title(col)    plt.tight_layout()
    plt.show()

In [12]:

#numeric feature analysis
    
    numeric_list=["age", "creatinine_phosphokinase",
           "ejection_fraction", "platelets",
           "serum_creatinine", "serum_sodium", "time","DEATH_EVENT"]

In [13]:

data_numeric = data.loc[:, numeric_list]
    sns.pairplot(data_numeric, hue = "DEATH_EVENT", diag_kind = "kde")
    plt.show()

In [14]:

#standardization
    
    from sklearn.preprocessing import StandardScaler
    scaler=StandardScaler()
    scaled_array=scaler.fit_transform(data[numeric_list[:-1]])

In [15]:

pd.DataFrame(scaled_array).describe()

Out[15]:

0123456count2.990000e+02299.0000002.990000e+022.990000e+022.990000e+022.990000e+022.990000e+02mean5.703353e-160.000000–3.267546e-177.723291e-171.425838e-16–8.673849e-16–1.901118e-16std1.001676e+001.0016761.001676e+001.001676e+001.001676e+001.001676e+001.001676e+00min-1.754448e+00–0.576918–2.038387e+00–2.440155e+00–8.655094e-01–5.363206e+00–1.629502e+0025%-8.281242e-01–0.480393–6.841802e-01–5.208700e-01–4.782047e-01–5.959961e-01–7.389995e-0150%-7.022315e-02–0.342574–7.076750e-03–1.390846e-02–2.845524e-018.503384e-02–1.969543e-0175%7.718891e-010.0001665.853888e-014.111199e-015.926150e-037.660638e-019.387595e-01max2.877170e+007.5146403.547716e+006.008180e+007.752020e+002.582144e+001.997038e+00

In [16]:

#correlation analysis
    
    f,ax=plt.subplots(figsize = (8,5))
    sns.heatmap(data.corr(), annot = True, fmt = ".1f", linewidths = .5,ax=ax)
    plt.show()

In [17]:

#encoding categorical columns
    
    data1=data.copy()
    data1= pd.get_dummies(data1, columns = categorical_list[:-1], drop_first = True)
    data1.head()

Out[17]:

agecreatinine_phosphokinaseejection_fractionplateletsserum_creatinineserum_sodiumtimeDEATH_EVENTanaemia_1diabetes_1high_blood_pressure_1sex_1smoking_1075.058220265000.001.91304100110155.0786138263358.031.11366100010265.014620162000.001.31297100011350.011120210000.001.91377110010465.016020327000.002.71168111000

In [18]:

x_data = data1.drop(["DEATH_EVENT"], axis = 1)
    y = data1.DEATH_EVENT.values