Heart Disease ML Pipeline

Sep 2025 · 5 min read · MSc AI, Kristiania University College — Introduction to AI

Python scikit-learn pandas seaborn logistic regression random forest cross-validation

The Problem

Heart disease is the leading cause of death globally, yet many of its key risk factors are measurable through routine clinical tests — blood pressure, cholesterol, resting ECG, maximum heart rate achieved during exercise. The data exists; the question is whether a machine learning model can reliably predict disease presence from it, and whether that model can be made interpretable enough for a clinical audience.

This project built a complete supervised ML pipeline from raw clinical data through to evaluated, interpretable predictions, and extended it with an unsupervised component to discover structure in the patient population without using the diagnosis label.

My Approach

The dataset contains clinical measurements from patients, with a binary target indicating heart disease presence. I treated this as a structured binary classification problem and designed the pipeline in two phases: supervised learning to predict the outcome, and unsupervised learning to find naturally occurring patient groupings.

For the supervised phase, I chose two models deliberately selected for their contrasting properties: Logistic Regression as a transparent, coefficients-based baseline, and Random Forest as a stronger ensemble method whose feature importances could complement the logistic coefficients. Using both together tells you more than either alone.

Evaluation was built around 5-fold cross-validation rather than a single train/test split, to get stable performance estimates across the full dataset.

What I Built

Preprocessing pipeline: A scikit-learn Pipeline object chained four stages: SimpleImputer for missing values (mean strategy for numerical, most-frequent for categorical), one-hot encoding for categorical features, StandardScaler for numerical features, and VarianceThreshold to drop near-zero-variance columns. Wrapping these in a Pipeline ensured no data leakage across cross-validation folds — the scaler and imputer are fit only on training folds.

Logistic Regression: The interpretable baseline. Coefficients were extracted and visualised to show which clinical features push predictions toward or away from a positive diagnosis. This gives clinicians a direct feature-level explanation — a requirement in medical ML applications.

Random Forest: The ensemble model. Feature importances were computed and compared to the logistic coefficients — agreement between the two methods increases confidence that a feature is genuinely predictive rather than an artefact of one model's assumptions.

Evaluation suite: Each model was assessed with a full classification report (precision, recall, F1, support), confusion matrix visualisation, ROC-AUC curve, and 5-fold cross-validation scores. Reporting both precision and recall is important in medical settings where the cost of a false negative (missed disease) differs from a false positive.

Unsupervised component: After the supervised analysis, the diagnosis label was removed and clustering was applied to the patient feature space to discover natural groupings — asking whether patients with similar clinical profiles tend to cluster in ways that correspond to disease presence, without telling the algorithm the answer.

CV Folds

5-fold

Models

LR + RF

Eval Metrics

F1, ROC-AUC

Results

Both models achieved strong classification performance on the heart disease dataset. The Random Forest outperformed Logistic Regression on overall accuracy and ROC-AUC, as expected given its capacity to model non-linear interactions between features. However, Logistic Regression's interpretability advantage remained significant — its coefficients produced clinically sensible rankings of feature importance that were easier to communicate and audit.

5-fold cross-validation confirmed that the performance numbers were stable and not the result of a favourable single split. The standard deviation of CV scores was low for both models, indicating consistent generalisation.

The unsupervised component found that patient clusters aligned meaningfully with diagnosis status — suggesting that the clinical feature space contains genuine structure separating higher- and lower-risk patients, even without using the label.

Note: exact accuracy, F1, and AUC values are in the full notebook (IAI4100-1_25H_ARTEFACT).

What I Learned

The Pipeline object in scikit-learn changed how I think about preprocessing. Before this project I applied transformations globally before splitting. Wrapping everything in a Pipeline makes cross-validation correct by construction — there is no opportunity to leak validation fold statistics into the training process. It's a small habit with significant consequences for the validity of reported metrics.

I also came away with a stronger appreciation for why interpretability matters in healthcare ML. A model that is 2% more accurate but requires a black box to explain will face a much harder adoption path than one that can tell a clinician "this patient's high resting heart rate and ST depression are the primary drivers of this prediction." The logistic regression's transparency is not a weakness — in context, it's a feature.

The comparison between supervised and unsupervised results was the most intellectually interesting part. When unsupervised clusters align with known labels, it validates the signal quality of the features. When they don't, it usually means the label is capturing something the features don't — or that the clustering algorithm is finding a different, also-real structure.

GitHub → Live Demo →