R for UX Researchers Series: Article #3
Tutorial: Segmenting Users with K-means Clustering
Summary: Learn how to use R and RStudio to segment users based on their shopping behavior using K-means clustering. This article covers data preparation, determining the optimal number of clusters, performing cluster analysis, and creating compelling visualizations to gain actionable insights and support your stakeholders with data.
The Scenario
As I explained in last week's article with the subtitle ‘A Step-by-Step Tutorial for Logistic Regression’, I used to work for an e-commerce company that sells a wide range of products, including electronics, clothing, and home goods. We tracked user interactions such as page views, product views, cart additions, and purchases. One day, the Director of UX came up to me and asked,
“Can you segment our users based on their shopping behavior? I want to understand different user groups so we can design more personalized workflows for them.”
Given our detailed tracking of user interactions, I realized that our data could provide the insights she needed, but it would require thorough analysis. That's why I decided to use R and RStudio instead of Excel or Google Sheets for this particular ask.
Why Not Just Use Excel or Google Sheets?
For this project our goal is to use cluster analysis to segment the website's users based on their shopping behavior. This analysis will help identify distinct user groups, allowing the UX team to tailor the website's features and enhance the overall user experience, ultimately improving usability and engagement.
But why use R and RStudio instead of Excel or Google Sheets? Here's why R and RStudio are superior for this kind of analysis:
Clustering Capabilities: You can't even do K-means clustering in Excel or Google Sheets without a bunch of custom scripts, and even with those, your options to customize are wayyyyyyy more limited than in R.
Handling Large and Complex Data: R is designed to efficiently manage and manipulate large and complex datasets, which can be cumbersome for Excel and Google Sheets. In this tutorial, we'll work with a very large sample dataset, similar to the ones you are likely to deal with in the real world.
Advanced Data Cleaning: R has powerful tools for cleaning and prepping your data. It's great at handling missing values, dealing with outliers, and transforming data in ways that can be a pain in Excel and Google Sheets.
Reproducibility and Automation: R scripts allow for the entire analysis process to be automated and reproduced easily. This is especially useful for updating analyses regularly or applying the same methods to different datasets.
Advanced Visualization: With R packages like
ggplot2, you can create detailed, customizable, and high-quality visualizations. These visualizations can be far more complex and informative than those typically produced in Excel or Google Sheets.
In this tutorial, I'll guide you through setting up the analysis, interpreting the results, applying more sophisticated clustering techniques, and creating compelling visualizations in RStudio. By the end, you should be able to present these insights to your stakeholders effectively. Let's get started!
Prerequisites
Before starting the tutorial, make sure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
To get started, download the E-Commerce Data from Kaggle:
Visit the E-Commerce Data page on Kaggle.
If you don't already have a Kaggle account, create one and log in.
On this page, click the Download button to get the dataset file
2019-Oct.csv.Save the downloaded file on your computer.
👉 UPDATE: A reader informed me that the link above wasn’t working for them. If you have trouble accessing the
2019-Oct.csvfile on Kaggle.com, you can use this link to the same file on my personal Dropbox.
Dataset credit: Data files © REES46 eCommerce Marketing Platform contributed by Michael Kechinov
Step 2: Start a New RStudio Project
Open RStudio
Go to File > New Project > New Directory > New Project
Name your project (e.g., "Ecommerce_User_Segmentation") and choose a location on your computer.
Click Create Project.
Set Up Your Project Structure: Within your new project directory folder on your computer, create a new folder and name it data.
Move the downloaded
2019-Oct.csvfile into the new data folder that is in your project directory folder. This location is referred to as the files relative path.

Step 3: Install Necessary Packages
Install the required packages by copy/pasting the code snippet below into the RStudio Console:
install.packages("tidyverse")
install.packages("cluster")
install.packages("factoextra")Step 4: Load the Libraries
In your R script, load the necessary libraries:
library(tidyverse)
library(cluster)
library(factoextra)✏️ NOTE: Disregard the Conflicts section shown in the Console.
Step 5: Import the Dataset
Use a relative path to import the dataset, ensuring portability and ease of sharing:
# Import the dataset
data <- read.csv("data/2019-Oct.csv") # Adjust the path based on your setup
# View the first few rows of the dataset
head(data)
✏️ NOTE: Unlike the last two R and RStudio tutorials in this series, this sample dataset is BIG. That means from here on out, some of the following steps may take a few minutes to compute. (Welcome to the real world of using R! hahaha.) You can follow along with the computational status of each step via the progress indicator in the Environment pane at the top right of your screen and/or the stop sign icon and infinity sign icon in the top right in the Console pane.
Step 6: Explore and Clean the Data
Now, let's check for missing values and convert categorical variables to factors:
# Check for missing values
sum(is.na(data))
# Example of handling missing values by removing rows with NAs
data <- na.omit(data)
# Convert categorical variables to factors
data$event_type <- as.factor(data$event_type)
data$category_code <- as.factor(data$category_code)
data$brand <- as.factor(data$brand)
data$user_session <- as.factor(data$user_session)
# Display the structure of the dataset
str(data)💡 TIP: If at any time during these steps you get an error message, simply copy/paste that message verbatim into a Google search. You’ll likely find the fix in the first few result links. Also, checking forums like Stack Overflow can be very helpful for resolving common issues.
Step 7: Data Aggregation
Next we need to aggregate the data to compute key metrics for each user:
# Take a smaller sample of the data for quicker computation during testing
sample_data <- data %>% sample_n(100000)
# Aggregate data to compute key metrics for each user
user_data <- sample_data %>%
group_by(user_id) %>%
summarise(
num_views = sum(event_type == "view"),
num_cart_adds = sum(event_type == "cart"),
num_purchases = sum(event_type == "purchase"),
avg_price_viewed = mean(price[event_type == "view"], na.rm = TRUE),
avg_price_purchased = mean(price[event_type == "purchase"], na.rm = TRUE),
session_duration = as.numeric(difftime(max(as.POSIXct(event_time)), min(as.POSIXct(event_time)), units = "mins"))
) %>%
filter(!is.na(user_id)) # Filter out rows with NA user_id
# View the first few rows of the aggregated data
head(user_data)
📘 Explanation: In this step, we aim to compute key metrics for each user to understand their shopping behavior. Aggregating data helps summarize user interactions in a meaningful way, making it easier to analyze and draw insights.
Sampling the Data: We use a smaller sample (
sample_n(100000)) of the data for quicker computation during testing. This approach helps speed up the development process, and you can remove this sampling step for a full analysis.Grouping by User ID: The
group_by(user_id)function groups the data by individual users, allowing us to compute metrics for each user separately.Calculating Metrics:
num_views: Total number of product views.num_cart_adds: Total number of items added to the cart.num_purchases: Total number of purchases made.avg_price_viewed: Average price of the products viewed.avg_price_purchased: Average price of the products purchased.session_duration: Duration of the user's session, calculated as the difference between the first and last recorded event times.
Handling Missing Data: The
na.rm = TRUEparameter ensures that missing values do not affect the calculations.Filtering NA User IDs: The
filter(!is.na(user_id))step ensures that any rows with missing user IDs are removed from the dataset.
By aggregating these metrics, we can obtain a comprehensive view of each user's behavior, which is crucial for effective clustering and subsequent analysis.
Step 8: Data Normalization
Time to normalize, woot woot! You are almost always going to want to normalize your data when working in RStudio:
# Normalize the data and handle NA/NaN/Inf values
user_data_normalized <- scale(user_data[,-1], center = TRUE, scale = TRUE) # Exclude the user_id column from normalization
# Replace any remaining NA/NaN/Inf values with zeros (or another strategy)
user_data_normalized[is.na(user_data_normalized)] <- 0
user_data_normalized[is.nan(user_data_normalized)] <- 0
user_data_normalized[is.infinite(user_data_normalized)] <- 0
# View the first few rows of the normalized data
head(user_data_normalized)
📘 Explanation: Normalizing data in large datasets is crucial for ensuring that all features contribute equally to the analysis, especially in clustering and machine learning projects. Without normalization, features with larger ranges can dominate the results, leading to biased outcomes. Normalization scales the data to a common range, improving the performance and accuracy of algorithms like K-means clustering. This step also helps in faster convergence and stability of the model, making it a standard best practice for achieving reliable and interpretable results in RStudio projects.
Step 9: Determine Optimal Number of Clusters
Use the elbow method to determine the optimal number of clusters:
# Determine the optimal number of clusters using the elbow method
set.seed(123)
wss <- (nrow(user_data_normalized) - 1) * sum(apply(user_data_normalized, 2, var))
for (i in 2:15) {
wss[i] <- sum(kmeans(user_data_normalized, centers = i)$withinss)
}
# Plot the elbow method
plot(1:15, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
📘 Explanation: The elbow method helps to determine the optimal number of clusters. The goal is to identify the point where adding more clusters doesn't significantly improve the fit. This point, known as the "elbow," represents the optimal number of clusters. In the plot, look for the "elbow" point where the rate of decrease sharply slows down. For instance, if the elbow occurs around 5 clusters, this suggests that 5 is the optimal number of clusters for our analysis.
Step 10: Apply K-means Clustering
Now, perform K-means clustering to segment users based on their shopping behavior. We'll assume the optimal number of clusters is 5 based on the elbow method.
# Apply K-means clustering
set.seed(123)
kmeans_result <- kmeans(user_data_normalized, centers = 5) # Assuming 5 clusters based on the elbow method
user_data$cluster <- as.factor(kmeans_result$cluster)
# View the first few rows of the data with cluster assignments
head(user_data)Step 11: Visualize the Clusters
Now for the fun part. Let’s create scatter plots to visualize the clusters. This will help in understanding how the users are grouped based on their shopping behavior.
# Visualize the clusters
fviz_cluster(kmeans_result, data = user_data_normalized, geom = "point", ellipse.type = "convex") +
labs(title = "User Segments Based on Shopping Behavior") +
theme_minimal()📘 Explanation: The fviz_cluster function from the factoextra package helps in visualizing the clusters formed by K-means clustering. The plot shows how users are grouped into distinct segments based on their shopping behavior, with each point representing a user and each color representing a different cluster. The ellipses around the clusters give an idea of the cluster boundaries.
Step 12: Summarize and Interpret the Cluster Characteristics
To provide actionable insights, we need to summarize and interpret the characteristics of each cluster. This involves analyzing the average metrics for each cluster and understanding the distinct behaviors of users in each group.
# Summarize cluster characteristics
cluster_summary <- user_data %>%
group_by(cluster) %>%
summarise(
avg_num_views = mean(num_views, na.rm = TRUE),
avg_num_cart_adds = mean(num_cart_adds, na.rm = TRUE),
avg_num_purchases = mean(num_purchases, na.rm = TRUE),
avg_price_viewed = mean(avg_price_viewed, na.rm = TRUE),
avg_price_purchased = mean(avg_price_purchased, na.rm = TRUE),
avg_session_duration = mean(session_duration, na.rm = TRUE)
)
# View the cluster summary
print(cluster_summary)📘 Explanation: The cluster_summary table provides a concise summary of each cluster's characteristics, highlighting the average number of views, cart additions, purchases, average price of viewed and purchased products, and average session duration. This summary helps in identifying the key behaviors and preferences of users in each cluster.
For example:
Cluster 1 might consist of users who view many products but rarely make purchases.
Cluster 2 could include users who frequently add items to their cart but have lower purchase rates.
Cluster 3 might represent high-value customers who view, add to cart, and purchase high-priced items frequently.
Step 13: Creating Additional Visualizations
Lastly, let’s create additional plots to provide a deeper understanding of the user segments. These visualizations include:
The Cluster Size Bar Chart
The Heatmap of Cluster Centers
The Distribution of
avg_price_viewedby Cluster PlotThe Distribution of
avg_price_purchasedby Cluster Plot
Plot 1: Cluster Size Bar Chart
This plot shows the number of users in each cluster, providing insight into the relative size of each segment.
# Bar plot of cluster sizes
cluster_sizes <- user_data %>%
group_by(cluster) %>%
summarise(size = n())
ggplot(cluster_sizes, aes(x = cluster, y = size, fill = cluster)) +
geom_bar(stat = "identity") +
labs(title = "Cluster Sizes", x = "Cluster", y = "Number of Users") +
theme_minimal()Plot 2: Heatmap of Cluster Centers
This heatmap displays the average values for key metrics within each cluster, highlighting the main characteristics that define each user segment.
# Compute cluster centers
cluster_centers <- kmeans_result$centers
# Create a heatmap of cluster centers
library(reshape2)
cluster_centers_melt <- melt(cluster_centers)
names(cluster_centers_melt) <- c("Cluster", "Metric", "Value")
ggplot(cluster_centers_melt, aes(x = Metric, y = as.factor(Cluster), fill = Value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Heatmap of Cluster Centers", x = "Metric", y = "Cluster") +
theme_minimal()Plot 3: Distribution of avg_price_viewed by Cluster Plot
This plot illustrates how the average price of products viewed varies across different clusters, offering insight into spending intentions.
# Distribution of avg_price_viewed by cluster
ggplot(user_data, aes(x = cluster, y = avg_price_viewed, fill = cluster)) +
geom_boxplot() +
labs(title = "Distribution of Average Price Viewed by Cluster", x = "Cluster", y = "Average Price Viewed") +
theme_minimal()Plot 4: Distribution of avg_price_purchased by Cluster Plot
This plot shows how the average price of purchased products differs among clusters, highlighting actual spending behavior.
# Distribution of avg_price_purchased by cluster
ggplot(user_data, aes(x = cluster, y = avg_price_purchased, fill = cluster)) +
geom_boxplot() +
labs(title = "Distribution of Average Price Purchased by Cluster", x = "Cluster", y = "Average Price Purchased") +
theme_minimal()Interpretation and Insights
When viewed together, these plots provide a holistic understanding of user segments based on their shopping behavior. The User Segment Cluster Plot offers a visual overview of the distinct groups, while the Cluster Size Bar Chart gives us an idea of the proportion of users in each segment. The Heatmap of Cluster Centers dives deeper into the characteristics that define each cluster, providing actionable insights into user behavior. The Distribution of avg_price_viewed and avg_price_purchased plots further refine our understanding by highlighting spending intentions and actual purchasing behavior.
Together, these visualizations allow us to answer the Director of UX's question by clearly showing how users can be segmented based on their shopping behavior. This segmentation helps us design more personalized workflows tailored to the unique needs and preferences of each user group.
Data-Driven Next Steps
Personalized User Experiences
Cluster 1 (High Views, Low Purchases): Users in this cluster view many products but rarely make purchases. To convert these viewers into buyers:
Implement personalized promotions and discounts tailored to their browsing history.
Highlight product reviews and ratings prominently to build trust and reduce purchase hesitation.
Cluster 2 (High Cart Adds, Low Purchases): These users frequently add items to their cart but do not complete the purchase. To reduce cart abandonment:
Send cart reminder emails to nudge them towards completing the purchase.
Offer limited-time discounts or free shipping incentives to encourage checkout.
Tailored Product Recommendations
Cluster 3 (High-Value Customers): High-value customers often browse and purchase premium products. To enhance their shopping experience:
Provide personalized product recommendations based on their browsing and purchase history.
Highlight premium products and exclusive deals to cater to their preferences.
Cluster 4 (Cost-Conscious Shoppers): These users prefer budget-friendly options. To meet their needs:
Offer product recommendations that emphasize value for money.
Highlight sales, discounts, and bundle offers to attract these price-sensitive shoppers.
Enhanced Website Features
Cluster 5 (Engaged Browsers): Engaged browsers spend significant time on the site but may not always make purchases. To improve their experience:
Optimize product search and navigation to make finding items easier.
Implement features like wishlists, product comparisons, and user reviews to enhance engagement and encourage future purchases.
Targeted Marketing Campaigns
Design marketing campaigns that address the specific needs and preferences of each user segment:
Use insights from the heatmap to craft targeted messages that resonate with each group.
Tailor your marketing strategy to the unique behaviors and preferences of each cluster to maximize effectiveness.
Ongoing User Segmentation and Analysis
Continuously monitor user behavior and update the segmentation model to adapt to changing user patterns.
Integrate real-time data streams to keep the analysis current and relevant, ensuring you always have up-to-date insights to inform your strategies.
Isn't it remarkable what insights can come from this data? Data that you most likely have access to right now. This approach will ultimately improve usability and drive sales. I hope you are envisioning how you could do this or something similar with your own data.
Conclusion
Congratulations! You have successfully performed a cluster analysis to segment users based on their shopping behavior using real-world data. We walked through the entire process, from setting up your RStudio project and importing the dataset to normalizing the data, determining the optimal number of clusters, applying K-means clustering, and visualizing and interpreting the results. Here are the four key takeaways from this data discovery project:
Data Aggregation and Normalization:
Aggregating user data to compute key metrics like views, cart additions, purchases, and session duration is crucial for understanding shopping behavior.
Normalizing the data ensures that all metrics contribute equally to the clustering process.
Determining the Optimal Number of Clusters: The elbow method is a valuable tool for identifying the optimal number of clusters, helping to balance the complexity and interpretability of the model.
Clustering and Visualization: K-means clustering effectively segments users into distinct groups based on their behavior, which can be visualized using tools like
fviz_cluster.Actionable Insights: Summarizing and interpreting cluster characteristics allows us to derive meaningful insights that can drive targeted marketing campaigns, product recommendations, and website optimization.
By using this tutorial as a template for analyzing your own data, you can leverage the insights gained from the cluster analysis to create more personalized workflows for your users, ultimately improving usability and driving sales.
Feedback
How was this tutorial? Was it helpful? Easy to follow? Hard to follow? Please feel free to comment here or DM me with your feedback. Your input is invaluable as I continue to create content to help you leverage data in UX research.
By incorporating these techniques and insights, you can enhance your ability to understand and improve the user experience on your website. Happy segmenting, everyone!











This is the content we need. Thank you for demystifying quant UX for us!