Predicting Customer Churn: Selecting the Optimal ML Model

Predicting Customer Churn: Selecting the Optimal ML Model
March 25, 2026
Data Science
Machine Learning

Customer retention is significantly more cost-effective than customer acquisition. This project developed a machine learning "early warning system" to identify bank customers at high risk of churning (leaving the bank).

By evaluating five different predictive algorithms, a Gradient Boosting Machine (GBM) was selected for deployment. The winning model achieved an 87% overall accuracy and, most importantly for the business, correctly identified 49% of all actual churnersβ€”the highest catch rate among all models tested.

πŸ’Ό Business Problem & Context

In the competitive banking sector, understanding why customers leave is paramount to maintaining a stable and profitable customer base. The primary objective of this project was to provide the bank's marketing and retention teams with a tool to:

  1. Predict Probability: Assign a probability score indicating a customer's likelihood to churn.
  2. Identify Drivers: Determine the primary features (age, balance, salary) that influence a customer's decision to leave.
  3. Proactive Intervention: Enable targeted retention campaigns (e.g., personalised offers, improved service) for high-risk customers before they walk away.

The Goal: Maximise active retention efforts while minimising costs.

πŸ“Š Data Source & Description

The model was trained and validated using a popular Churn Modelling dataset from Kaggle.

πŸ› οΈ Methodology & Technical Workflow

This project followed a complete end-to-end data science lifecycle:

1. Data Exploration & Cleaning

Initial analysis ensured data integrity, confirming no missing values or duplicate records. Non-predictive features (Surname, CustomerID) were dropped to improve model focus.

2. Preprocessing & Feature Engineering

Data transformation was crucial for model performance:

  • Encoding: Converted categorical text (Gender, Geography) into numerical data using Label Encoding and One-Hot Encoding (drop_first=True to avoid the Dummy Variable Trap).
  • Feature Scaling: Standardised features like Balance and EstimatedSalary using StandardScaler to ensure large numbers did not unfairly dominate the models.

3. Multi-Model Evaluation

I adopted an experimental approach, evaluating five different algorithms with varying complexities to establish a robust baseline and find the top performer:

  • Linear Baseline: Logistic Regression.
  • Margin Optimizer: Linear SVM.
  • Instance-Based: K-Nearest Neighbors (KNN).
  • Parallel Ensemble: Random Forest.
  • Sequential Ensemble (Perfectionist Relay): Gradient Boosting (GBM).

πŸ“‰ Visual Analysis & Strategic Insights

This section explains the crucial data visualisations that drove the project's strategy and conclusions.

Chart 1: Understanding the "Why" (Feature Importance)

Bargraph Feature Importance: What Drives Customer Churn?

Feature Importance: What Drives Customer Churn?

What is it? This horizontal bar chart visualises the "logic" the Random Forest model used to make decisions. It ranks features based on how much they reduced uncertainty about whether a customer would churn.

Interpretation:

  • Age is Dominant: By a significant margin, Age is the most powerful predictor of churn. This tells us that life stage is the primary driver of customer mobility in this dataset.
  • Financial Drivers:Β EstimatedSalary, CreditScore, and Balance form the next tier of importance. This is a financially-driven dataset, as expected in banking.
  • Low Importance:Β Gender and Geography_Spain are at the very bottom, indicating that for this specific bank, location or gender had minimal predictive value.

Business Insight: Churn is driven by demographics (Age), not geography or gender. This is actionable: The bank should investigate tailored retention strategies for specific age demographics (e.g., offers for younger customers moving toward home ownership, or products focused on security for older age brackets).

Chart 2: The Project Executive Summary & Model Leaderboard

Model Leaderboard: Accuracy vs. Recall (Catch Rate)

Model Leaderboard: Accuracy vs. Recall (Catch Rate)

What is it? This professional infographic provides a high-level summary of the entire project. It visually ties together the approach, the key features, the tools used (scikit-learn, Python), the central leaderboard, and the final results of the winning model.

Interpretation:

  • Approach: Showcases the multi-model, comparative methodology.
  • Key Features Highlight: Re-emphasises Age, Balance, and Salary as the critical data points.
  • The Leaderboard: Visually confirms that the complex ensemble models (GBM, Random Forest) significantly outperformed the linear models (LogReg, SVM). The sequential perfectionism of GBM allowed it to edges out Random Forest.
  • Results Panel: Highlights the two final competing metrics: The impressive 86.75% overall accuracy, but also the critical 49% Recall.

Business Insight: Human behaviour is complex and rarely follows a straight line. Simple models (LogReg/SVM) were "blind" to non-linear patterns. GBM's victory confirms that capturing complex, branching "if-then" relationships is necessary to successfully identify at-risk bank customers.

πŸ† Model Evaluation & Comparison (The Verdict)

While Overall Accuracy (86.75%) is a strong headline number, it is not the most important metric for churn prediction. In imbalanced datasets like this (where only 20% of customers leave), you can achieve 80% accuracy just by predicting "Everyone stays".

The True "North Star" is RECALL (our successful catch rate for churners).

Final Project Leaderboard

Project image

Model Evaluation Leaderboard

Final Reasoning: Why GBM Wins for Production

The Gradient Boosting Machine (GBM) is the superior choice for deployment. It achieved the highest Recall (0.49), identifying 7 more actual churners than the Random Forest (192 total vs. 185).

In banking retention, the cost of missing a customer about to leave (False Negative) is vastly greater than the minimal cost of sending a personalised discount to someone who was going to stay anyway (False Positive) . The GBM model provides the bank with the maximum opportunity to intervene and secure loyalty.

πŸ“ˆ Tools & Technologies

  • Language: Python
  • Data Manipulation: Pandas, NumPy
  • Visualisation: Matplotlib, Seaborn
  • Machine Learning (scikit-learn): Logistic Regression, Linear SVC, KNeighborsClassifier, RandomForestClassifier, GradientBoostingClassifier
  • Evaluation Metrics: classification_report, confusion_matrix, accuracy_score