R for UX Researchers Series: Article #4
Tutorial: Predicting User Churn with Decision Trees
Summary: This article covers data preparation, building and evaluating a decision tree predictive model, and creating visualizations that communicate actionable insights with real-world data.
The Scenario
I did some consultation work with a subscription-based service company; we meticulously tracked user interactions, subscription status, engagement metrics, and customer support questions. One day, the Director of Customer Success approached me with an important question:
"Can you predict which users will likely cancel their subscriptions next month? We want to proactively engage with these users, but we don't know which ones."
Given all of our user data, I was pretty sure this question could be answered with a predictive model. After some googling, I decided a decision tree would be the best type of model to answer the Director of Customer Success's question. A decision tree analysis is like a flowchart where each internal node represents a decision based on an attribute, each "branch" represents the outcome of that decision, and each "leaf" node represents a final decision or classification. Because of this, using R and RStudio was a no-brainer for this project. When it comes to predictive modeling, I suggest not even considering Excel or Google Sheets as a solution.
Why Not Just Use Excel or Google Sheets?
To predict user churn, we need a solution that can handle complex data and build legit predictive models. Here's why R and RStudio are the ideal tools for this analysis:
Building Predictive Models: Building a decision tree in Excel or Google Sheets is nearly impossible. They don't have the built-in capabilities for advanced statistical modeling like R does. You'd need to jump through hoops with complex scripts and add-ons, and even then, it wouldn't be nearly as flexible or powerful. This is why my go-to for this kind of thing is R.
Handling Large and Complex Data: R manages and manipulates large datasets better than Excel and Google Sheets.
Advanced Data Cleaning: R provides powerful functions for data prep that Excel and Google Sheets just can't do.
Model Validation and Tuning: R has tools that help you check how well our model works (cross-validation) and tweak its settings for the best performance (hyperparameter tuning). Cross-validation is like testing our model multiple times with different parts of the data to make sure it works well overall, not just on one specific sample. Hyperparameter tuning is adjusting the "settings" of our model to find the best combination that gives us the most accurate predictions.
Reproducibility and Automation: R scripts automate the entire process, ensuring consistency and saving time when updating analyses or applying them to different datasets.
Advanced Visualization: R packages like ggplot2 create detailed, customizable, and high-quality visualizations that are way better than what you can get out of Excel and Google Sheets.
In this tutorial, we'll walk through how to set up the analysis, interpret the results, and create appropriate data visualizations in RStudio. By the end, you'll be able to provide data-driven insights to help reduce user churn. Let's get started!
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Customer Churn Dataset from Kaggle:
Visit the Customer Churn Dataset page on Kaggle.
If you don't already have a Kaggle account, create one and log in.
On the dataset page, click the Download button to get the dataset files
customer_churn_dataset-testing-master.csvandcustomer_churn_dataset-training-master.csv.Save the downloaded file on your computer.
✏️ NOTE: Make a note of the full paths to the
customer_churn_dataset-testing-master.csvandcustomer_churn_dataset-training-master.csvfiles on your computer. You'll need these paths to import the datasets in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio.
Go to File > New Project > New Directory > New Project.
Name your project (e.g., "User_Churn_Analysis") and choose a location on your computer.
Click Create Project.
Set Up Your Project Structure: Within your new project directory folder on your computer, create a new folder and name it data.
Move the downloaded
customer_churn_dataset-testing-master.csvandcustomer_churn_dataset-training-master.csvfiles into the new data folder that is in your project directory folder. This location is referred to as the files relative path.

Step 3: Install Necessary Packages
Install the required packages by copy/pasting the code snippet below into the RStudio Console:
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyverse")
install.packages("corrplot")
install.packages("caret")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("pROC")
install.packages("scales")Step 4: Load the Libraries
Next, in your R script, load the necessary libraries:
library(dplyr)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(caret)
library(rpart)
library(rpart.plot)
library(pROC)
library(scales)✏️ NOTE: Disregard the Conflicts section shown in the Console.
Step 5: Import the Datasets
Now, let's import the datasets into RStudio using the code snippet below. Make sure to adjust the paths to the locations where you saved the datasets.
# Import the datasets
trainData <- read.csv("path/to/your/dataset/customer_churn_dataset-training-master.csv")
testData <- read.csv("path/to/your/dataset/customer_churn_dataset-testing-master.csv")
✏️ NOTE: Replace "path/to/your/dataset" with the full paths on your own machine to the CSV files.
Step 6: Clean the Data
Let's prepare the data for analysis by handling missing values and converting categorical variables into factors.
# Check for missing values in training and testing datasets
sum(is.na(trainData))
sum(is.na(testData))
# Convert categorical variables to factors in both datasets
trainData$Gender <- as.factor(trainData$Gender)
trainData$Subscription.Type <- as.factor(trainData$Subscription.Type)
trainData$Contract.Length <- as.factor(trainData$Contract.Length)
trainData$Churn <- as.factor(trainData$Churn)
testData$Gender <- as.factor(testData$Gender)
testData$Subscription.Type <- as.factor(testData$Subscription.Type)
testData$Contract.Length <- as.factor(testData$Contract.Length)
testData$Churn <- as.factor(testData$Churn)
# Drop CustomerID column from both datasets as it is not a predictor
trainData <- trainData %>% select(-CustomerID)
testData <- testData %>% select(-CustomerID)💡 TIP: If at any time during these steps you get an error message, simply copy/paste that message verbatim into a Google search. You’ll likely find the fix in the first few result links. Also, checking forums like Stack Overflow can be very helpful for resolving common issues.
Step 7: Perform Exploratory Data Analysis (EDA)
Now, let's explore the dataset to understand its structure and identify any patterns. This step involves summarizing and visualizing the main characteristics of the dataset.
# Summary statistics
summary(trainData)
# Bar plot of churn distribution
ggplot(trainData, aes(x = Churn)) +
geom_bar() +
labs(title = "Churn Distribution", x = "Churn", y = "Count")
# Correlation matrix for numerical variables
numeric_vars <- trainData %>% select_if(is.numeric)
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle")📘 Explanation: The correlation matrix shown here reveals how different numerical features are related to each other. For instance, it can show if there is a strong positive correlation between 'Tenure' and 'Total Spend', indicating that customers who stay longer tend to spend more, which is a key insight for understanding customer value. In general, EDA helps uncover key insights early in the analysis, ensuring that the modeling efforts are well-informed and targeted. By understanding the underlying data structure, we can make more accurate predictions and derive meaningful insights from the decision tree and feature importance plots. This plot tells me we are good to start building our model.
Step 8: Build the Decision Tree Model
Finally, the moment we've all been waiting for. Time to build our decision tree model!
# Build the decision tree model
tree_model <- rpart(Churn ~ ., data = trainData, method = "class")
# View the model
printcp(tree_model)
📘 Explanation: Building the decision tree model involves fitting the model to the training data, where the tree learns how different features influence the target variable (churn). The rpart function is used to create the decision tree, and printcp provides a summary of the model, including details about its complexity and performance.
Step 9: Visualize the Decision Tree
Now that we have our model, let's visualize it!
# Plot the decision tree with enhanced readability
rpart.plot(tree_model,
main = "Decision Tree for Churn Prediction",
type = 4,
extra = 104,
box.palette = "RdYlGn",
shadow.col = "gray",
nn = TRUE,
cex = 0.7,
branch = 0.3,
split.cex = 1.2,
split.box.col = "lightgray",
split.border.col = "darkgray",
split.round = 0.5,
under.cex = 1.0,
digits = -1,
varlen = 0,
faclen = 0,
roundint = TRUE,
clip.right.labs = FALSE,
compress = TRUE)📘 Explanation: This code is designed to be more user-friendly that some other stuff you’ll find online. For the geeks out there, here’s how I made it more understandable:
type = 4: This type of plot includes both node labels and split labels, making it clear where each split occurs.
extra = 104: This adds the number of observations and the class proportions in the terminal nodes, making it clear how the tree makes decisions.
digits = -1: This rounds the numeric values to avoid scientific notation.
varlen = 0 and faclen = 0: This ensures that variable and factor names are not truncated.
roundint = TRUE: This rounds numeric thresholds to the nearest integer, making them easier to understand.
clip.right.labs = FALSE: This prevents the rightmost labels from being clipped.
compress = TRUE: This compacts the tree to fit better within the plot window.
Example of Plot Interpretation
Node Labels: Each node shows the decision rule (e.g., "Tenure > 30") and the proportion of churn.
Branch Labels: These indicate the criteria used to split the data at each node, with thresholds presented in a more readable format.
Node Numbers: These help reference specific nodes when discussing the tree structure.
Color Coding: The colors indicate the likelihood of churn, with red representing higher churn probability and green representing lower churn probability.
This approach should make the decision tree plot more intuitive and visually appealing, helping stakeholders easily understand the insights. By focusing on making the visualization clear and straightforward, you ensure that even those without a deep understanding of data science can grasp the key findings. It really shows how different features split the data. The rpart.plot function parameters are set to provide extra information on the nodes (such as counts and probabilities), making the tree easier to interpret.
Step 10: Evaluate the Model
Now, we have to evaluate the model to see how well it predicts churn in the real world.
# Make predictions on the test data
predictions <- predict(tree_model, newdata = testData, type = "class")
# Ensure predictions are factors with the same levels as the actual values
predictions <- factor(predictions, levels = c("0", "1"))
# Align levels for testData$Churn
testData$Churn <- factor(testData$Churn, levels = c("0", "1"))
# Confusion matrix
conf_matrix <- confusionMatrix(predictions, testData$Churn)
print(conf_matrix)📘 Explanation: This updated code ensures that the predicted values (predictions) and the actual values (testData$Churn) are factors with the same levels. This is necessary for the confusionMatrix function to work correctly, allowing us to accurately evaluate the model's performance.
By ensuring the levels match, the confusion matrix can correctly compare the predicted and actual values, providing a reliable assessment of the model's accuracy and reliability so that’s what we are looking for here.
Step 11: Feature Importance
This is the part where we find the actual insight. This is where the model tells us which features are most important in predicting churn. Aka. The answer to the original question from the Director of Customer Success.
# Extract variable importance from the decision tree model
importance <- tree_model$variable.importance
# Create a data frame for plotting
importance_df <- data.frame(Feature = names(importance), Importance = importance)
# Exclude CustomerID from the feature importance plot
importance_df <- importance_df[importance_df$Feature != "CustomerID", ]
# Plot variable importance with human-readable feature names
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance", x = "Features", y = "Importance")📘 Explanation: Feature importance analysis shows which features significantly impact the model's predictions. This helps in understanding the key drivers of churn and can guide strategic decisions to address these factors. For example, attributes like 'Tenure' and 'Subscription Type' are significant predictors of churn, allowing us to target the most at-risk users effectively.
Interpretation and Insights
Combining the Decision Tree plot and the Feature Importance plot provides a powerful visual summary of our predictive model's insights.
Decision Tree Plot Insights
Tenure: Users with shorter tenures might have higher churn rates.
Subscription Type: Certain subscription types are associated with higher churn.
Support Calls: A high number of support calls can indicate potential dissatisfaction, leading to higher churn.
Feature Importance Plot Insights
The Feature Importance plot highlights the features based on their importance scores as determined by the decision tree model. In this plot, the features 'Support Calls', 'Total Spend', 'Contract Length', 'Payment Delay', and 'Age' are shown in the order of their importance.
Reconciling the Insights
The apparent discrepancy between the Decision Tree plot and the Feature Importance plot arises due to the differences in how these visualizations represent feature importance and influence.
Decision Tree Plot: This plot shows the actual pathways and decisions made by the model to predict churn. It highlights features that play a crucial role in splitting the data and making predictions.
Feature Importance Plot: This plot quantifies the overall importance of each feature in making predictions across all decision paths in the tree.
Why Some Features Appear More Important
The Feature Importance plot now includes 'Support Calls', 'Total Spend', 'Contract Length', 'Payment Delay', and 'Age'. These features are highlighted due to their overall importance in making predictions across all decision paths in the tree.
Support Calls: Frequent support calls can be a strong indicator of user dissatisfaction, which is a significant predictor of churn.
Total Spend: Users who spend more might have higher expectations and could churn if those expectations are not met.
Contract Length: Longer contract lengths might be associated with lower churn as users are committed for a longer period.
Payment Delay: Delays in payment can signal potential churn as it indicates possible financial difficulties or dissatisfaction.
Age: Different age groups might have varying levels of engagement and loyalty, affecting churn rates.
These features provide actionable insights into user behavior and can guide strategies to reduce churn effectively.
Key Takeaway
Decision Tree Plot: Helps us understand specific decision rules and their impact on churn.
Feature Importance Plot: Provides an overall importance ranking, which can sometimes include features that are less actionable but still important from a model perspective.
Data-Driven Next Steps
Based on these insights, we can take several steps to proactively engage with users at risk of churning:
Personalized Retention Campaigns: Target users with high churn risk using personalized offers and incentives to retain them.
Enhance User Engagement: Develop features and content that cater to the needs of high-risk users, improving their overall experience and satisfaction.
Monitor Key Metrics: Continuously track and analyze the key drivers of churn to stay ahead of potential issues and adjust strategies accordingly.
Conclusion
Congratulations! You have successfully performed a decision tree analysis to predict user churn using real-world data. We walked through the entire process, from setting up your RStudio project and importing the dataset to cleaning the data, building the model, and visualizing the results. Here are the key takeaways from this project:
Data Preparation:
Properly cleaning and preparing your data is crucial for accurate model building.
Exploratory Data Analysis (EDA) helps in understanding the dataset's structure and guiding the analysis.
Model Building:
Decision trees are powerful tools for predictive modeling and are particularly useful for their interpretability and ease of use.
Splitting the data into training and testing sets ensures that the model can generalize well to new data.
Model Evaluation:
Evaluating the model with a confusion matrix helps in assessing its performance and reliability.
Feature importance analysis provides insights into the key drivers of churn.
By using this tutorial as a template for analyzing your own data, you can leverage the insights gained from the decision tree model to create more personalized and effective user engagement strategies. This approach will ultimately improve user retention and drive business success.
Feedback
How was this tutorial? Was it helpful? Easy to follow? Hard to follow? Please feel free to comment here or DM me with your feedback. Your input is invaluable as I continue to create content to help you leverage data in UX research. Thanks all!










