R for UX Researchers Series: Article #6
Tutorial: Customer Lifetime Value with Survival Analysis
Summary: Learn how to use R and RStudio to perform survival analysis and uncover the factors that influence customer lifetime value (CLTV). This article covers data preparation, building and evaluating a Cox Proportional Hazards model and creating visually appealing plots to support data-driven decision-making.
The Scenario
During a recent project with a subscription-based insurance company, the Director of Customer Success approached me and asked if I could answer this question:
"How long do customers typically remain active, and what factors influence their lifetime value to the company?"
Because their website tracks user interactions, subscription status, and insurance policy metrics, I knew that answering this question would require what's called in the industry a 'survival analysis.' I looked up a couple of different survival analysis methods, and the Cox Proportional Hazards model seemed like the perfect fit to tackle this problem. This method allows us to model the time until an event occurs (in this case, customer churn) and understand the influence of various factors on this duration.
Like all scenarios in this series, using R and RStudio for this analysis was a no-brainer due to their powerful data-handling capabilities and advanced visualization options. For complex analyses like this, relying on Excel or Google Sheets would not suffice.
Why Not Just Use Excel or Google Sheets?
To figure out customer lifetime value and understand the factors influencing it, we need software that can handle complex data and build sophisticated statistical models. Here's why R and RStudio are the superior tools for this task:
Advanced Statistical Modeling: Building a Cox Proportional Hazards model in Excel or Google Sheets is nearly impossible. These tools lack the built-in capabilities for advanced statistical modeling that R provides.
Handling Large and Complex Data: R is better than Excel and Google Sheets for manipulating large datasets.
Advanced Data Cleaning: R is better for data prep. It's way quicker to handle missing values, identify outliers, and transform data in R than Excel or Google Sheets.
Model Validation and Tuning: R provides tons of options for validating and tuning models. For example, tools like cross-validation and hyperparameter tuning are necessary for creating good models and are not available in Excel or Google Sheets.
Reproducibility and Automation: R scripts can automate the entire analysis process, ensuring consistency and saving time. This is especially useful for something like this, where you need to update analyses regularly or apply the same methods to evolving datasets.
Advanced Visualization: With R, you can produce visualizations that can be far more complex and informative than those typically produced in Excel or Google Sheets.
In this tutorial, we'll walk through setting up the analysis, interpreting the results, and creating data visualizations in RStudio. By the end, you'll be able to provide data-driven insights into customer lifetime value and support strategic decision-making. Let's go!
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Customer Lifetime Value Dataset from Kaggle.
Visit the Customer Lifetime Value Dataset page on Kaggle.
If you don't already have a Kaggle account, create one and log in.
On the dataset page, click the Download button to get the dataset files.
Save the downloaded file titles
train_BRCpofr.csvon your computer.
✏️ NOTES:
We will not be using the other files that came with
train_BRCpofr.csvin the Kaggle download in this tutorial.Make a note of the full path to the
train_BRCpofr.csvfile on your computer. You'll need that path to import the datasets in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio.
Go to File > New Project > New Directory > New Project.
Name your project (e.g., "Customer_Lifetime_Value_Analysis") and choose a location on your computer.
Click Create Project.
Set up your project structure: Within your new project directory folder on your computer, create a new folder and name it data.
Move the downloaded
train_BRCpofr.csvfile into the data folder you just created.

Step 3: Install Necessary Packages
Install the required packages by copying and pasting the code snippet below into the RStudio Console:
install.packages("tidyverse")
install.packages("survival")
install.packages("survminer")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("corrplot")
install.packages("cvTools")✏️ NOTE: Click the Yes button if any modals pop up while you are installing these packages.
Step 4: Load the Libraries
Next, load the necessary libraries in your R script:
library(tidyverse)
library(survival)
library(survminer)
library(ggplot2)
library(dplyr)
library(corrplot)
library(cvTools)✏️ NOTE: Disregard the Conflicts section shown in the Console.
Step 5: Import the Dataset
Now, import the dataset into RStudio using the following code snippet. Adjust the file path as needed based on where you saved the dataset.
# Import the dataset
file_path <- "path/to/your/dataset/data/train_BRCpofr.csv"
data <- read_excel(file_path)
# View the first few rows of the dataset
head(data)
✏️ NOTES:
Replace "path/to/your/dataset" with the full paths on your own machine to the .csv file.
I added the
# View the first few rows of the dataset head(data)so you can see the first few rows of your data in the Console. This is helpful for verifying that the dataset has been imported correctly and understanding the initial structure and content of your data before proceeding with further analysis.
Step 6: Clean the Data
Data cleaning is a critical step that prepares the dataset for further analysis by addressing inconsistencies and ensuring that the data is in a usable format. For now, just copy/paste this code snippet into RStudio. I’ll be writing an entire article about advanced data cleaning soon. Learn a little more about data cleaning in Article #5 in this series.
# Clean the data
data <- data %>%
filter(!is.na(survival_time) & !is.na(event) & !is.na(gender) & !is.na(area) &
!is.na(qualification) & !is.na(income) & !is.na(marital_status) &
!is.na(num_policies) & !is.na(policy) & !is.na(type_of_policy)) %>%
mutate(
gender = as.factor(gender),
area = as.factor(area),
qualification = as.factor(qualification),
marital_status = as.factor(marital_status),
policy = as.factor(policy),
type_of_policy = as.factor(type_of_policy)
)
# View the first few rows of the cleaned data
head(data)Step 7: Perform Exploratory Data Analysis (EDA)
Now, let's explore the dataset to understand its structure and identify any patterns. Learn more about why I almost always run a EDA in my article R for UX Researchers Series: Article #2.
# Summary statistics
summary(data)
# Visualize the distribution of survival_time
ggplot(data, aes(x = survival_time)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "black") +
labs(title = "Distribution of Survival Time", x = "Time (Months)", y = "Frequency") +
theme_minimal()
# Visualize the event occurrence
ggplot(data, aes(x = event)) +
geom_bar(fill = "steelblue", color = "black") +
labs(title = "Event Occurrence", x = "Event (1 = Churn, 0 = Active)", y = "Count") +
theme_minimal()
📘 Explanation: This EDA generates summary statistics, policy distribution, claim amount by gender, customer vintage distribution, and a correlation matrix to better understand the dataset. Let’s explore each in a little more depth:
Summary Statistics
This provides an overview of the dataset and highlight key metrics such as the mean, median, and range of each variable. Here’s what we learned about this dataset:
Gender: There are 38,895 females and 50,497 males in the dataset.
Area: There are 26,937 entries from rural areas and 62,455 from urban areas.
Qualification: The dataset includes 39,399 individuals with a Bachelor’s degree, 46,247 with a high school education, and 3,746 with other qualifications.
Income: The income is categorized into four groups:
<=2L(1,854),2L-5L(21,158),5L-10L(52,716), andMore than 10L(13,664).Marital Status: There are 37,948 entries with a marital status of 0 and 51,444 with a marital status of 1.
Vintage: The vintage (customer tenure in years) ranges from 0 to 8 years, with a mean of 4.596 years.
Claim Amount: Claim amounts range from 0 to 31,894, with a mean of 4,352.
Number of Policies: There are 29,129 customers with 1 policy and 60,263 customers with more than one policy.
Policy: The policy types are distributed as follows: A (56,644), B (24,635), and C (8,113).
Type of Policy: The type of policy distribution includes Gold (20,586), Platinum (47,796), and Silver (21,010).
CLTV: Customer Lifetime Value ranges from 24,828 to 724,068, with a mean of 97,953.
Policy Distribution
This bar plot shows the distribution of different policy types in the dataset. It helps us see the frequency of each policy type in the dataset.
Claim Amount by Gender
This is used to compare the distribution of claim amounts between different genders. It shows the median, quartiles, and potential outliers for claim amounts across genders.
Customer Vintage Distribution
This histogram visualizes how long customers have been with the company. It helps us see the distribution of customer loyalty in the dataset.
Correlation Matrix
Use this visual to explore the relationships between numerical variables in the dataset. It helps identify potential multicollinearity issues and understand the relationships between variables.
I hope you can see how this information gives us the confidence to approach the modeling process with a clear understanding of what the data looks like. Now we are all set up to define the survival time and event occurrence, the data objects we’ll need for the rest of the analysis.
Step 8: Create the Survival Object
It's time to create a survival object that will be used for the survival analysis. This involves defining the survival time and event occurrence.
# Create the survival object
data$survival_time <- data$vintage * 12 # Assuming vintage is in years, convert to months
data$event <- ifelse(data$claim_amount > 0, 1, 0) # Event indicator (1 = churn, 0 = no churn)
# View the first few rows to verify the new columns
head(data)📘 Explanation: In this step, we prepare our data for survival analysis by defining the survival time and event indicators. We assume the '
vintage' column represents the number of years a customer has been with the company, and we convert this to months for better granularity. The 'event' column is created to indicate whether a customer has churned, based on the 'claim_amount' column. By verifying the first few rows, we ensure that these new columns are correctly integrated into our dataset, setting the stage for accurate survival analysis.
Step 9: Fit the Model
Next, we'll fit the Cox Proportional Hazards model to our data. This model will help us understand the relationship between the survival time and various predictor variables. The coxph function from the survival package will be used to fit the model.
# Fit the Cox Proportional Hazards model
cox_model <- coxph(Surv(survival_time, event) ~ gender + area + qualification + income + marital_status + num_policies + policy + type_of_policy, data = data)
# View the model summary
summary(cox_model)
📘 Explanation: By fitting the model, we aim to identify which factors significantly impact customer lifetime value. The summary of the model will provide coefficients for each variable, indicating the strength and direction of their effect on the survival time. This step is crucial as it allows us to quantify the influence of different predictors on customer churn, enabling more targeted strategies for improving customer retention.
Step 10: Evaluate the Model
When you run this code snippet, it will generate 8 plots, one for each predictor variable in our Cox Proportional Hazards model. These plots help us ensure that our model assumptions hold true.
# Check proportional hazards assumption
cox.zph_test <- cox.zph(cox_model)
print(cox.zph_test)
# Plot the scaled Schoenfeld residuals
plot(cox.zph_test)📘 Explanation: Each of the 8 plots corresponds to one predictor variable in the Cox Proportional Hazards model. In this case, these plots confirmed that our model is correctly specified, allowing us to move forward with the analysis. Here’s how to interpret these plots:
Random Scatter Around Zero: If the residuals are randomly scattered around zero without any discernible pattern or trend, it indicates that the proportional hazards assumption holds for that predictor. In this case, all the plots indicate that the residuals are randomly scattered around zero, which means we are good to move forward.
Trend Over Time: If there is a noticeable trend (e.g., upward or downward) in the residuals over time, it suggests that the proportional hazards assumption might be violated for that predictor. This means the effect of that predictor on survival is not constant over time. In this case we see no trends, which means everything is all good. Woot, woot!
Step 11: Visualize the Survival Curves
Now that we have fitted our Cox Proportional Hazards model, it's time to visualize the survival curves. These curves help us understand how different policies impact customer survival over time. We will use the ggsurvplot function from the survminer package to create these plots.
# Fit survival curves for policy
surv_policy <- survfit(Surv(survival_time, event) ~ policy, data = data)
# Plot survival curves by policy
ggsurvplot(surv_policy, data = data, conf.int = TRUE, pval = TRUE, risk.table = TRUE,
ggtheme = theme_minimal(), title = "Survival Curves by Policy",
xlab = "Time (Months)", ylab = "Survival Probability",
palette = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2"))📘 Explanation: Visualizing the survival curves allows us to compare the survival probabilities of customers under different policies over time.
Here we see the survival curves for three different insurance policy types (A, B, and C). The x-axis represents time in months, and the y-axis represents the survival probability.
Real World Interpretation
Policy A (blue line) has the highest survival probability over time, suggesting that customers with this policy tend to stay with the company longer compared to the other policies.
Policy B (orange line) shows a moderate survival probability, indicating a higher churn rate than Policy A but lower than Policy C.
Policy C (green line) has the lowest survival probability, indicating that customers with this policy are more likely to churn earlier.
The p-value displayed on the plot (p < 0.0001) indicates that the differences in survival probabilities among the policies are statistically significant.
12: Evaluate Feature Importance
Understanding which features are most influential in predicting customer churn is crucial for strategic decision-making. We will extract the feature importance from the Cox Proportional Hazards model and visualize it using a bar plot.
# Extract variable importance from the Cox model
importance <- summary(cox_model)$coefficients[, "exp(coef)"]
# Create a data frame for plotting
importance_df <- data.frame(Feature = names(importance), Importance = importance)
# Plot variable importance
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#1f77b4", color = "black") +
coord_flip() +
labs(title = "Feature Importance from Cox Proportional Hazards Model", x = "Features", y = "Importance") +
theme_minimal() +
theme(legend.position = "top")📘 Explanation: This step evaluates and visualizes the importance of each feature in predicting customer churn. By examining the coefficients from the Cox Proportional Hazards model, we can identify which variables have the most significant impact on survival time.
Here you can see the x-axis represents the importance of each feature, and the y-axis lists the features themselves.
Real World Interpretation
areaUrban: This feature has the highest importance, suggesting that whether a customer lives in an urban area significantly impacts their likelihood of churning. Customers in urban areas might have different expectations or access to alternative services that influence their retention.
qualificationHigh School and qualificationOthers: These educational qualifications also have substantial effects, indicating that education level impacts customer retention. This insight could guide targeted marketing or service adjustments.
genderMale: The gender of the customer is another critical factor, with male customers showing different churn rates compared to females.
type_of_policyPlatinum and type_of_policySilver: The type of insurance policy held by the customer significantly affects churn, highlighting the importance of product offerings in customer retention.
Step 13: Predicted Risk Scores vs Actual Events
The final step in our analysis involves predicting risk scores using the Cox Proportional Hazards model and comparing these predicted risk scores to the actual event occurrences. This step helps us evaluate the model's performance and understand how well it can differentiate between customers who are likely to churn and those who are not.
# Predict the risk scores
risk_scores <- predict(cox_model, type = "risk")
# Combine with the actual event data
pred_data <- data.frame(RiskScore = risk_scores, Event = data$event)
# Plot predicted risk scores vs actual events
ggplot(pred_data, aes(x = RiskScore, fill = as.factor(Event))) +
geom_histogram(binwidth = 0.1, position = "dodge", color = "black") +
scale_fill_manual(values = c("0" = "#1f77b4", "1" = "#ff7f0e"), name = "Event") + # Using distinct colors
labs(title = "Predicted Risk Scores vs Actual Events", x = "Predicted Risk Score", fill = "Event") +
theme_minimal() +
theme(legend.position = "top")📘 Explanation: In this step, we use the Cox model to predict risk scores for each customer. These risk scores indicate the likelihood of an event (i.e., churn) occurring for each customer. By comparing these predicted scores with the actual event data, we can assess the model's predictive power.
The histogram visualizes the distribution of predicted risk scores for customers who churned (Event = 1) versus those who did not churn (Event = 0).
Real World Explanation and Interpretation
This plot shows us how well our model differentiates between customers who are likely to churn and those who are not. The orange bars represent customers who churned, while the blue bars represent those who did not. A clear separation between the blue and orange bars would indicate a strong predictive model.
From the plot, we can see that customers with higher risk scores (closer to 2.0) are more likely to churn, as indicated by the taller orange bars. Conversely, customers with lower risk scores (closer to 0.5) are less likely to churn, as shown by the taller blue bars in that range.
Interpretation of Results
Here are the key conclusions we can draw from our analysis:
Gender: Gender has a statistically significant impact on customer churn, with male customers having a slightly higher risk of churn than female customers.
Area: Customers in urban areas are more likely to churn compared to those in rural areas. This might be due to increased competition or different customer expectations in urban settings.
Qualification: Educational qualifications significantly influence churn rates, with high school graduates showing a higher risk of churn compared to those with other qualifications.
Income: Higher income levels are associated with lower churn rates, indicating that more affluent customers are more likely to stay with the service.
Marital Status: Married customers are slightly less likely to churn compared to single customers.
Number of Policies: Customers with more than one policy have a lower risk of churn, suggesting that bundling products might improve customer retention.
Policy Type: The type of policy held by customers significantly affects their likelihood of churn. Policy A has the highest survival probability, followed by Policy B, and then Policy C.
Type of Policy: Different policy types (Gold, Platinum, Silver) also impact churn rates, with Platinum and Silver policyholders showing different survival probabilities.
These insights help us understand the various factors that influence customer retention and can guide targeted interventions to reduce churn.
Responding to the Director of Customer Success's Question
Remember our stakeholder's original question: "How long do customers typically remain active, and what factors influence their lifetime value to the company?"
Here’s how we can respond based on what we’ve learned:
Customer Longevity: On average, customers remain active for approximately 4.6 years. However, the specific duration varies based on several factors.
Key Factors Influencing Customer Lifetime Value (CLTV):
Area: Customers in urban areas are more likely to churn earlier compared to those in rural areas.
Qualification: Customers with a high school education are at a higher risk of churn compared to those with a Bachelor's degree or other qualifications.
Income: Higher income levels are associated with lower churn rates, indicating that more affluent customers tend to stay longer.
Marital Status: Married customers have a slightly lower churn rate compared to single customers.
Number of Policies: Customers holding multiple policies are less likely to churn, suggesting the benefits of policy bundling.
Policy Type: The type of policy significantly affects churn rates, with Policy A showing the highest survival probability and Policy C the lowest.
Strategic Implications: To improve customer retention and enhance CLTV, we recommend targeted retention strategies focusing on high-risk groups, enhancing product offerings, implementing income-based segmentation, providing additional support to single customers, and continuously monitoring and adapting our strategies based on ongoing data analysis.
Imagine being able to report this back to your own stakeholders. This is a great example of the power of statistics.
What company wouldn’t value a UX researcher who could provide insights with this level of depth and actionability?
If you’d like to be taken seriously in your workplace, consider conducting these types of analyses on your own data.
Data-Driven Next Steps
Targeted Retention Strategies: Focus retention efforts on high-risk groups identified in our analysis. In this case, I’d recommend targeting customers in urban areas and those with lower educational qualifications. Personalized communication, tailored offers, and targeted marketing campaigns can help address their specific needs and concerns, thereby reducing churn rates.
Enhance Policy Bundling: Promote bundles of policies to encourage customers to hold multiple policies. This analysis shows that customers with more than one policy have a lower risk of churn. Creating attractive bundles or offering discounts for holding multiple policies could improve customer retention.
Income-Based Segmentation: Develop targeted marketing campaigns for different income groups. Our analysis indicates that higher-income customers are less likely to churn. Offering premium services, exclusive benefits, or loyalty programs to higher-income customers can enhance their loyalty and reduce churn rates.
Support Programs for Single Customers: Implement support programs or special offers specifically for single customers, who have a higher churn rate compared to married customers. Providing additional services or benefits tailored to single customers could improve their retention.
Policy-Specific Improvements: Evaluate and improve the features and benefits of different policy types (A, B, C). Our analysis shows that Policy A has the highest survival probability, while Policy C has the lowest. Enhancing the value proposition of policies with lower survival probabilities can make them more attractive to customers.
Continuous Monitoring and Adaptation: Continuously monitor customer behavior and churn rates. Use predictive modeling to identify at-risk customers early and take proactive measures to retain them. Regularly updating and refining the model based on new data will ensure its continued effectiveness.
Conclusion
In this tutorial, we've explored how to use R and RStudio to perform survival analysis on insurance company user data. We did this to uncover the factors influencing customer lifetime value for the companies insurance policies. By preparing and cleaning the dataset, fitting a Cox Proportional Hazards model, and visualizing the results, we gained valuable insights into customer behavior and identified key factors that impact churn rates.
Our analysis highlighted the importance of various demographic and policy-related factors in predicting customer churn. Armed with this information, you can develop targeted strategies to retain customers and improve their overall experience with your service.
Feedback
I hope you found this tutorial on analyzing customer lifetime value helpful. I encourage you to experiment with different datasets and models to further enhance your skills in R and survival analysis. If you have any questions, suggestions, or feedback, please feel free to reach out. Your input is valuable in helping me improve the articles in this series.
Thank you for following along!


















I'm just glad some people are finding these helpful. Thanks for the continued support and encouragement!
Super helpful resource! Thanks Trevor!