Neural Networks Assignment: Heart Disease Classification

Source: The dataset was obtained from the Heart Disease Classification - Neural Network Kaggle Notebook data page.

Introduction and Task Details

The goal is to create a classification model that can accurately predict whether a patient will or won't have heart disease based on their recorded health metrics.

Contents

  1. Getting the imports and data
  2. Assessing the quality of the data
  3. Preparing the data for analysis

Importing Python Packages and the Dataset

Column Names Explained

Data Quality Assessment

Overview of the Data

Considerations

Source: For more information on normalisation and how to know when to use standardisation vs normalisation click here and here.

Visual Data Quality Assessment

At this stage, we are checking data distributions and for the presence of outliers to help decide which normalisation technique will be best to use.

Source: See this medium article for a guide on detecting and removing outliers from datasets in Python.

Features with Outliers

Features with skews

Given the high presence of outliers, MinMaxScaler will not be an appropriate normalisation technique to adopt for this dataset. RobustScaler will be implemented initially as it appears to be the most robust to outliers (compared to MinMaxScaler, Normalization, and StandardScaler). Below is information on the Robust Scaler technique [see scikit-learn's][https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler] guide on RobustScaler for more information.

Data Preprocessing

Did it Work? Overview of RobustScaler Impact The table below demonstrates that the existence of outliers and skews for each feature have not changed at all since using RobustScaler.

Before vs After: Features with Outliers, Features with skew

Outliers Before Outliers After Skew Before Skew After
age NO NO NO NO
sex NO NO NO NO
cp NO NO LEFT LEFT
trestbps YES YES NO NO
chol YES YES NO NO
fbs YES YES NO NO
restecg NO NO LEFT LEFT
thalach YES YES NO NO
exang NO NO NO NO
oldpeak YES YES LEFT LEFT
slope NO NO RIGHT RIGHT
ca YES YES LEFT LEFT
thal YES YES RIGHT RIGHT
target NO NO NO NO

See below for a summarised comparison of the two dataframes (RobustScaler is presented first).

Correlations

Numerical categories key:

Pairplot Explained (this will be a lot, skim read for summaries) NB: duplicate pairs and pairs with inverted axis (e.g. thal against thal, age against thal, thal against age) have not been included, only one version of each pair is discussed.























It is very difficult to see any strong pairwise relationships or correlations between the features. The pairplot did not highlight any significant correlations. Next step is to create heatmaps to see this better.

Individual correlations to target summarised

Split dataset for models

Models (currently unfinished)

See this tutorial of a beginner's guide to deeep learning in python.