Bank Customer Churn — What-If Calculator

May 2023 · 5 min read · BSc Applied Data Science, Noroff — Statistical Analysis Tools and Techniques

Python scikit-learn pandas logistic regression interpretability what-if analysis

The Problem

Customer churn — when a profitable customer cancels or walks away — is one of the most expensive failure modes in retail banking. Acquiring a replacement customer costs many times more than retaining an existing one, and the highest-value customers are typically also the most mobile. The question is whether routine customer attributes — age, balance, product holdings, satisfaction scores, activity flags — carry enough signal to identify customers at elevated churn risk early enough to do something about it.

The dataset was a 10 000-row European retail-banking sample with demographic, account, and engagement features, and a binary Exited label flagging customers who closed their accounts. The deliverable was a model plus the analytical narrative around what drives churn — the kind of artefact a retention team could actually argue with.

My Approach

I chose Logistic Regression deliberately, even though tree-based models tend to win on raw accuracy for tabular churn problems. The reason is interpretability — for a retention team to act on the model's predictions, they need to be able to point at a customer and say this is why we think they'll leave. Logistic Regression hands you that directly: a signed coefficient per feature, applied to a standardized input, gives a clean per-customer decomposition of the prediction.

One important design choice was to exclude the Complain feature. In this dataset Complain correlates near-perfectly (~0.99) with Exited, which means a model that uses it is trivially accurate but operationally useless — by the time a customer has formally complained, the retention window is often closed. Excluding it forces the model to reason from the leading indicators (engagement, balance trajectory, product mix, satisfaction scores) that retention can actually intervene on.

The pipeline is a single scikit-learn Pipeline wrapping a ColumnTransformer — median-imputed StandardScaler for numerics, most-frequent-imputed OneHotEncoder for categoricals — feeding into LogisticRegression(max_iter=1000). Performance is reported via 5-fold stratified cross-validation rather than a single split.

What I Built

Cleaned dataset: Loaded the 10 000-row sample, dropped CSV-export artefact columns, removed the leaky Complain column, and split features into eight numeric (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary, Satisfaction Score, Point Earned) and five categorical (Location, Gender, HasCreditCard, IsActiveMember, Card Type) inputs.

Sklearn pipeline: A single Pipeline chained the ColumnTransformer preprocessing through to the LogisticRegression classifier. Wrapping everything in a Pipeline means cross-validation cannot leak scaler statistics from validation folds into training folds.

5-fold stratified CV: Performance measured across five stratified folds to get an honest estimate that doesn't depend on a lucky split. Reported accuracy, F1, and ROC-AUC, with standard deviations — F1 is the one that matters here because the dataset is imbalanced (~20% churn rate) and a naive "predict stayed for everyone" classifier already gets ~80% accuracy.

Live what-if calculator: A Gradio app wrapping the trained pipeline. Pick a real customer from the dataset and their actual attributes populate the inputs. Then tweak anything — sliders for numeric features, dropdowns for categoricals — and watch the churn probability and driver breakdown update live, on every input change, without a Predict button. The interactivity is the point: it lets you ask counterfactual questions like "what if this customer became active again?" or "would lowering their balance trigger churn?"

Driver-contribution panel: For each prediction, the logistic-regression coefficients are applied to that customer's scaled feature values and aggregated back to the original feature names. The result is a per-customer breakdown of which inputs are pushing the prediction toward churn (red) or staying (green), with the magnitude visible on a horizontal bar chart.

Dataset Size

~10k rows

Features Used

CV Folds

5-fold

Results

5-fold stratified CV yielded accuracy ≈ 0.81, ROC-AUC ≈ 0.77, and F1 ≈ 0.32 on the positive (churn) class. The F1 looks low at first glance but is honest — on an imbalanced dataset with the leaky Complain feature removed, predicting the minority class from leading indicators alone is genuinely hard, and a 0.77 AUC means the model is meaningfully ranking customers even when the default 0.5 threshold misses many of them.

The driver breakdown surfaced consistent and operationally useful patterns. IsActiveMember = No, Age in the upper third, and Location = Germany were the strongest churn-pushing factors. NumOfProducts = 1 pushed up too, while NumOfProducts = 2 was the strongest protective factor — consistent with the well-known engagement-product-count relationship in retail banking.

The most informative use of the what-if calculator is on customers the model gets wrong — e.g. a customer flagged as low-risk who in reality churned. Nudging their attributes to see what would have flipped the prediction surfaces exactly where the model's blind spots are, which is more useful than any single accuracy number.

What I Learned

The decision to exclude the Complain feature was the most important choice in the whole project. It's the difference between a model that demonstrates technical accuracy and a model that's useful to a retention team. Leaky features make benchmark numbers look great and operational performance terrible — a lesson that generalises far beyond banking.

Interpretability is not the same as transparency. Logistic regression coefficients are transparent — you can read them — but they're only interpretable in context. Showing a coefficient table is much less useful than showing a per-customer decomposition that demonstrates how that coefficient is actually working in a specific case. The what-if interactivity does exactly that.

Live reactivity changes how people engage with a model. A Predict button frames the model as something you query for an answer. A live-updating interface frames it as something you play with. The second framing produces much richer questions — and richer questions are usually how you find out what the model doesn't know.

GitHub → Live Demo →