A six-model machine learning benchmark on 10,000 bank customers — investigating not just which algorithm wins, but how to meaningfully improve recall on a real-world imbalanced dataset using SMOTE, sample weighting, and threshold tuning.
Domain
Banking · Customer Churn
Data Sources
Kaggle
Stack
Python · ML
Published
March 2026

Every year, banks lose billions to customer churn — and most find out a customer has left only after the account closes. This project builds a machine learning early warning system that flags who is likely to leave before they do, giving a retention team time to act.
Six algorithms were tested. Three techniques were applied to push recall beyond the baseline. The result is an honest, end-to-end analysis of what works, what doesn't, and why — including where the model still falls short.
In banking, retaining an existing customer costs 5–7× less than acquiring a new one. The objective here wasn't just to build a model — it was to give a retention team a reliable, ranked list of customers most likely to leave, early enough to intervene.
Three things the model needed to deliver:
The goal: Maximise churn detection. Minimise the cost of missing someone who was about to walk away.
The model was trained on a real-world-structured dataset of 10,000 bank customers from Kaggle, with 14 features including age, balance, credit score, number of products, and active membership status.

Customer Churn - Class Distribution
The trap most projects fall into: This dataset has a structural imbalance — 79.6% of customers stayed, only 20.4% churned. A model that simply predicts "stays" for every single customer would score 80% accuracy without identifying a single person at risk. Accuracy alone is a completely misleading metric here.
The real measure is recall on churners — of all the customers who actually left, how many did the model catch? That's the number this project was built to maximise.
No missing values or duplicates. Administrative fields (CustomerID, Surname) were dropped — they carry no predictive signal and would introduce noise.
Five base algorithms were evaluated across a complexity spectrum, plus a Histogram Gradient Boosting variant:
After benchmarking, three techniques were applied to push churner recall beyond the 49% baseline:

GBM feature importance top 10 predictors
The Gradient Boosting model's feature importance reveals what's actually driving customer exits:
Business insight: A retention campaign targeting inactive single-product customers in older age brackets, particularly those with growing dormant balances, is directly derivable from this analysis. These aren't just model outputs — they're campaign briefs.

Model Leaderboard: Accuracy Vs Recall Catch Rate
The leaderboard exposes the accuracy trap clearly. Every model scores between 80–87% accuracy — the spread looks narrow. But recall on churners tells a completely different story, ranging from 19% (Logistic Regression) to 49% (GBM).
Gradient Boosting wins the baseline benchmark — not because it has the highest accuracy, but because it catches the most churners. Complex, sequential tree-building allowed it to capture the non-linear patterns that simpler models like Logistic Regression and SVM missed entirely.
But 49% recall means the model still missed more than half of the customers who were about to leave. That's the honest baseline — and it's the starting point for the improvement work below.

ROC Curve comparison - All 6 Models
The ROC curve measures each model's ability to separate churners from non-churners across every possible decision threshold — not just the default 50% cutoff. An AUC of 1.0 is perfect; 0.5 is random.
GBM leads with an AUC of 0.871, meaning it correctly ranks a churner above a non-churner 87% of the time. This confirms that the model has genuine discriminative power — the 49% recall limitation isn't a fundamental weakness of the model, but an artefact of the imbalanced training data and the default decision threshold.

Recall Improvement GBM mitigation approach
All three approaches improved recall meaningfully. Each involves a different tradeoff:

Model Recall improvement Table
Sample weighting produced the largest raw improvement — recall jumped from 48.9% to 75.9%, catching three quarters of all churners. The cost is a ~7 point drop in overall accuracy and a higher false-alarm rate.
SMOTE offers the better operational balance for most retention teams — 66.3% recall with 61.5% precision means the majority of flagged customers are genuine risks.
Which to deploy depends on the business, not the model. A large retention team with budget to contact many customers should use sample weighting. A smaller team needing high-confidence leads should use SMOTE.

Precision vs Recall Tradeoff across Thresholds

Precision–Recall Tradeoff Scatter — Default GBM
These two charts make the threshold tradeoff explicit and actionable. As the decision boundary lowers, the model catches more churners (recall rises) but also raises more false alarms (precision falls).
The recommended threshold of 0.35 sits at the inflection point — recall improves from 48.9% to 60.0% with a manageable precision reduction. For a bank with a dedicated retention team, this is the lowest-effort, highest-impact change: no retraining, no new data, just a configuration adjustment.

Results Summary Table
No project should end without naming what it doesn't do well: