Predicting Customer Churn Using XGBoost: A Comprehensive Guide

Photo by Blake Wisz on Unsplash

Predicting Customer Churn Using XGBoost: A Comprehensive Guide

Table of Contents

  1. Introduction

  2. Understanding the Dataset

  3. Setting Up the Environment

  • Clone the GitHub Repository

  • Install Dependencies

  • Load the Dataset

  • Run the Jupyter Notebook

4. Data Preprocessing

  • Handling Missing Data and Categorical Variables

  • Correcting Numerical Data Formats

  • Feature Scaling

5. Model Building

  • Splitting the Data

  • Training the XGBoost Classifier

  • Evaluating the Model

6. Hyperparameter Tuning

  • Setting Up GridSearchCV

  • Evaluating the Tuned Model

7. Conclusion

8. Next Steps

  • Experiment with Additional Features

  • Try Different Algorithms

  • Deploy the Model

9. References

1. Introduction

In today’s highly competitive market, customer retention is as crucial as acquiring new customers. For subscription-based businesses, understanding and predicting customer churn — when a customer stops using a service — can significantly impact revenue. By leveraging machine learning techniques, companies can predict which customers are likely to churn and take proactive measures to retain them.

In this blog post, we’ll walk through a detailed process of building a machine learning model to predict customer churn using the XGBoost algorithm, known for its efficiency and performance in classification tasks. We will cover everything from data preprocessing, model building, and evaluation to hyperparameter tuning. The dataset used in this project is sourced from Kaggle, and by the end of this post, you’ll have a clear understanding of how to implement a churn prediction model for your own datasets.

2. Understanding the Dataset

The dataset for this project provides a rich set of features related to customer behavior, including:

  • Average Order Value: The average value of orders placed by the customer.

  • Discount Rates: The average discount the customer receives.

  • Product Views: The number of product pages viewed by the customer.

  • Session Details: Information about the customer’s interactions during their sessions.

The target variable in this dataset is Churn, a binary indicator (0 or 1) representing whether a customer has churned.

Dataset Overview:

  • File Name: data.csv

  • Number of Columns: 20

  • Key Features: average_order_value, discount_rate_per_visited_product, product_detail_view, location_code, etc.

  • Target Variable: Churn

3. Setting Up the Environment

Before we dive into the model-building process, you need to set up your Python environment. This involves installing the necessary libraries and tools required to execute the code.

3.1 Clone the GitHub Repository

The first step is to clone the repository containing all the code and data for this project.

git clone https://github.com/Gayathri-Selvaganapathi/customer_churn_prediction.git
cd customer-churn-prediction

3.2 Install Dependencies

Install the required Python packages using the requirements.txt file.

pip install -r requirements.txt

3.3 Load the Dataset

Download the dataset from Kaggle and place the data.csv file in the root directory of the project.

3.4 Run the Jupyter Notebook

Open the Jupyter Notebook or JupyterLab and navigate to Customer_Churn_Prediction.ipynb. This notebook contains all the steps for data preprocessing, model building, and evaluation.

4. Data Preprocessing

Data preprocessing is a crucial step that prepares the dataset for model training. Proper preprocessing can greatly enhance model performance and ensure that the features fed into the model are relevant and correctly formatted.

4.1 Handling Missing Data and Categorical Variables

The dataset includes a variety of features, some of which are categorical and need to be converted into a format that the machine learning model can process. For example:

  • Location Code: Initially stored as an integer, this column represents categorical data (like postal codes). We convert it into a string and then into categorical data.

  • Yes/No Columns: Columns such as credit_card_info_save and push_status are binary categorical variables. These are converted to integers (0 and 1) to facilitate the model's learning process.

df['location_code'] = df['location_code'].astype(str)
df['credit_card_info_save'] = df['credit_card_info_save'].replace({'Yes': 1, 'No': 0})
df['push_status'] = df['push_status'].replace({'Yes': 1, 'No': 0})

4.2 Correcting Numerical Data Formats

Some numerical columns contain commas as thousand separators, which need to be replaced with dots to convert the data into float format. This step ensures that these values can be correctly used in mathematical operations during model training.

df['average_order_value'] = df['average_order_value'].str.replace(',', '.').astype(float)
df['discount_rate_per_visited_product'] = df['discount_rate_per_visited_product'].str.replace(',', '.').astype(float)

4.3 Feature Scaling

Feature scaling is essential in ensuring that all numerical values are within the same range. This step prevents features with larger scales from disproportionately influencing the model. We use Normalizer to scale the numerical features.

from sklearn.preprocessing import Normalizer
scaler = Normalizer()
scaled_features = scaler.fit_transform(df[['average_order_value', 'discount_rate_per_visited_product']])
df_scaled = pd.DataFrame(scaled_features, columns=['average_order_value', 'discount_rate_per_visited_product'])

5. Model Building

With our data preprocessed and ready, we can now focus on building the model. The XGBoost classifier is a powerful tool that uses gradient boosting techniques to achieve high accuracy, especially for structured data.

5.1 Splitting the Data

Before training the model, we need to split the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

from sklearn.model_selection import train_test_split
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

5.2 Training the XGBoost Classifier

We initialize the XGBoost classifier and train it on the training data. After training, we evaluate the model on the test set.

import xgboost as xgb

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)
y_pred = xgb_clf.predict(X_test)

5.3 Evaluating the Model

The model’s performance is evaluated using the accuracy score, which measures the proportion of correct predictions. Initially, the model achieves an accuracy of 91.54%.

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Initial Model Accuracy: {accuracy * 100:.2f}%")

6. Hyperparameter Tuning

Hyperparameter tuning involves adjusting the model’s parameters to optimize performance. XGBoost offers several hyperparameters that can be fine-tuned to improve the model’s accuracy.

6.1 Setting Up GridSearchCV

We use GridSearchCV to systematically test different combinations of hyperparameters. The parameters tuned include max_depth, learning_rate, gamma, and subsample.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'gamma': [0, 1, 5],
    'subsample': [0.8, 1.0]
}
grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)

6.2 Evaluating the Tuned Model

After hyperparameter tuning, the final model’s accuracy improves to 92.72%, demonstrating the effectiveness of fine-tuning in enhancing model performance.

final_accuracy = grid_search.best_score_
print(f"Final Model Accuracy after Tuning: {final_accuracy * 100:.2f}%")

7. Conclusion

Predicting customer churn is a vital aspect of maintaining a strong customer base in subscription-based businesses. By building a machine learning model using XGBoost, we were able to predict customer churn with an accuracy of over 92%. This project highlights the importance of data preprocessing, feature scaling, and hyperparameter tuning in developing robust machine learning models.

The techniques and methods demonstrated in this project can be applied to various business cases, making XGBoost a versatile tool

8. Next Steps

If you’re interested in exploring this project further, consider the following:

  1. Experiment with Additional Features: Incorporate more features from the dataset or external sources to improve model performance.

  2. Try Different Algorithms: Compare XGBoost’s performance with other classification algorithms like Random Forest, SVM, or Neural Networks.

  3. Deploy the Model: Once satisfied with the model’s performance, deploy it into a production environment using tools like Flask, Django, or FastAPI.

9. References

1. Kaggle Dataset

2.My GitHub Repo

3. Referred model