R for UX Researchers Series: Article #5
Tutorial: Product Bundling and Recommendation Engines with Market Basket Analysis
Summary: In this step-by-step tutorial, learn how to use R and RStudio to perform market basket analysis, similar to what Amazon does, to uncover frequently bought together products. This article covers data preparation, association rule mining, and creating compelling visualizations to provide actionable insights and support your cross-selling strategies.
The Scenario
Before Shopify, BigCommerce, and WooCommerce were widely used, e-commerce companies had to create their own mini versions of product recommendation engines. During a consultation project, I was asked by a client stakeholder of an e-commerce company this question, which will serve as the prompt for this tutorial.
"Which of our products are frequently purchased together, and how can we use this information for cross-selling?"
Because this company collected a lot of user data, I knew this question could be answered through 'market basket analysis.' Market basket analysis is a data mining technique that discovers patterns or associations between items purchased together within a single transaction. Those patterns or associations are then turned into 'rules,' which can be applied to the dataset to make data-driven cross-selling recommendations.
Using R and RStudio for this analysis was an obvious choice due to the powerful data-handling capabilities and advanced visualization options. For complex and large-scale analyses like this, I recommend avoiding tools like Excel or Google Sheets, which just don't have the functionality and efficiency you’ll need.
Why Not Just Use Excel or Google Sheets?
For this project, we need a solution that can handle complex data and build sophisticated association rules. Here's why R and RStudio are superior to Excel or Google Sheets for this kind of analysis:
Building Association Rules: R's built-in packages, like
rulesandarulesViz, are specifically designed for market basket analysis, allowing us to create and visualize association rules with ease. You can't do this in Excel and Google Sheets without an insane amount of complex scripts or add-ons to perform similar tasks.Handling Large and Complex Data: R is better than Excel and Google Sheets for manipulating large datasets.
Advanced Data Cleaning: R is better for data prep. It's way quicker to handle missing values, identify outliers, and transform data in R compared to Excel or Google Sheets.
Reproducibility and Automation: R scripts can automate the entire analysis process, ensuring consistency and saving time. This is especially useful for something like this were you need to update analyses regularly or apply the same methods to evolving datasets.
Advanced Visualization: With R, you can produce visualizations that can be far more complex and informative than those typically produced in Excel or Google Sheets.
Model Validation and Tuning: R offers extensive options for validating and tuning models to ensure robustness. Tools like cross-validation help in verifying the reliability of our association rules, a feature not available in Excel or Google Sheets.
In this tutorial, I'll walk you through setting up the analysis, interpreting the results, and creating compelling visualizations in RStudio. By the end, you'll be able to provide data-driven insights that will help inform cross-selling strategies and enhance user experiences. Let's get started!
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Online Retail Dataset from the UCI Machine Learning Repository.
Go to the Online Retail Dataset page.
Click the Download button to get the dataset file.
Download the file titled
Online Retail.xlsxand save it on your computer.
✏️ NOTE: Make a note of the full path to the
Retail.xlsxfile on your computer. You'll need that path to import the datasets in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio.
Go to File > New Project > New Directory > New Project.
Name your project (e.g., "Market_Basket_Analysis") and choose a location on your computer.
Click Create Project.
Set up your project structure: Within the new project directory folder on your computer, create a new folder and named it data.
Move the downloaded
Online Retail.xlsxfile into the data folder you just created.
Step 3: Install Necessary Packages
Install the required packages by copying and pasting the code snippet below into the RStudio Console:
install.packages("tidyverse")
install.packages("readxl")
install.packages("arules")
install.packages("arulesViz")
install.packages("igraph")
install.packages("reshape2")
install.packages("pheatmap")Step 4: Load the Libraries
Next, load the necessary libraries in your R script:
library(tidyverse)
library(readxl)
library(arules)
library(arulesViz)
library(igraph)
library(reshape2)
library(pheatmap)✏️ NOTE: Disregard the Conflicts section shown in the Console.
Step 5: Import the Dataset
Now, import the dataset into RStudio using the following code snippet. Adjust the file path as needed based on where you saved the dataset.
# Import the dataset
file_path <- "path/to/your/dataset/data/Online Retail.xlsx"
data <- read_excel(file_path)
# View the first few rows of the dataset
head(data)
✏️ NOTES:
Replace "path/to/your/dataset" with the full paths on your own machine to the .xlsx file.
I added the
# View the first few rows of the dataset head(data)so you can see the first few rows of your data in the Console. This is helpful for verifying that the dataset has been imported correctly and understanding the initial structure and content of your data before proceeding with further analysis.
Step 6: Clean the Data
After last week's post, a subscriber asked me to discuss data cleaning in more depth. I'll write an entire post about data cleaning later, but for this tutorial, I broke out the steps into different code snippets. This should give you a sense of the general process and will allow you to reverse-engineer some of my thoughts and best practices here.
For this dataset, let's prepare everything for analysis by handling missing values and filtering the dataset to include only relevant transactions.
Clean the Data
# Clean the data
data <- data %>%
filter(!is.na(InvoiceNo) & !is.na(StockCode) & !is.na(Description) & !is.na(Quantity) & !is.na(InvoiceDate) & !is.na(UnitPrice) & !is.na(CustomerID)) %>%
filter(Quantity > 0) %>%
mutate(InvoiceNo = as.factor(InvoiceNo), StockCode = as.factor(StockCode))
# View the cleaned data
head(data)Handling Missing Values
Before cleaning:
# Display rows with missing values
missing_data <- data %>% filter(is.na(InvoiceNo) | is.na(StockCode) | is.na(Description) | is.na(Quantity) | is.na(InvoiceDate) | is.na(UnitPrice) | is.na(CustomerID))
head(missing_data)After removing missing values:
# Remove rows with missing values
data <- data %>% filter(!is.na(InvoiceNo) & !is.na(StockCode) & !is.na(Description) & !is.na(Quantity) & !is.na(InvoiceDate) & !is.na(UnitPrice) & !is.na(CustomerID))Filtering Negative Quantities
Before filtering:
# Display rows with negative quantities
negative_quantity_data <- data %>% filter(Quantity <= 0)
head(negative_quantity_data)After filtering:
# Remove rows with negative quantities
data <- data %>% filter(Quantity > 0)Converting Columns to Factors
# Convert columns to factors
data <- data %>% mutate(InvoiceNo = as.factor(InvoiceNo), StockCode = as.factor(StockCode))
# View the structure of the cleaned data
str(data)
📘 Explanation: Proper data cleaning of this kind ensures our analysis is based on accurate and reliable information. It is a best practice, and we'll explore it further in later articles in this series. Stay tuned!
Step 7: Perform Exploratory Data Analysis (EDA)
Now, let's explore the dataset to understand its structure and identify any patterns. Learn more about why I almost always run a EDA in my article R for UX Researchers Series: Article #2.
# Summary statistics
summary(data)
# Top 10 most frequently bought products
top_products <- data %>%
count(Description, sort = TRUE) %>%
top_n(10)
# Plot top 10 products
ggplot(top_products, aes(x = reorder(Description, n), y = n)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Most Frequently Bought Products", x = "Product", y = "Frequency")📘 Explanation: Exploratory Data Analysis (EDA) helps us get a sense of the data by summarizing its main characteristics. Here, we generate summary statistics and visualize the top 10 most frequently bought products. This gives us an overview of the most popular items in the dataset.
Step 8: Convert Data into Transactions
To perform a market basket analysis, we need to convert the data into a transaction format that the arules package can work with. The arules package is a R toolkit specifically designed for mining of this kind. This transformation will enable us to identify patterns and relationships between items bought together.
# Convert data to transactions
transactions <- data %>%
select(InvoiceNo, Description) %>%
group_by(InvoiceNo) %>%
summarize(items = paste(Description, collapse = ",")) %>%
separate_rows(items, sep = ",") %>%
mutate(items = as.factor(items))
# Convert to transactions class
transactions <- as(split(transactions$items, transactions$InvoiceNo), "transactions")
# View summary of transactions
summary(transactions)📘 Explanation: We transform the dataset into a format suitable for market basket analysis by grouping items under each transaction (invoice). This step involves concatenating item descriptions for each invoice and then splitting them back into individual items. Finally, we convert this data into a transactions object, which is required by the
arulespackage in R.
Step 9: Perform Market Basket Analysis
Now, let's perform the market basket analysis to identify ‘association rules.’ An association rule is a statement like X -> Y, meaning if item X is purchased, item Y is likely to be purchased at the same time. In simple terms, it's like a product recommendation algorithm that suggests items often bought together. (e.g. exactly what our stakeholder asked of originally.)
# Perform market basket analysis
rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.8))
# View summary of rules
summary(rules)
# Inspect the top 10 rules by lift
inspect(head(sort(rules, by = "lift"), 10))📘 Explanation: We use the
apriorialgorithm from thearulespackage to generate ‘association rules.’ The parameterssupp(support) andconf(confidence) determine the minimum thresholds for the rules. Here, we set a support of 0.001 and confidence of 0.8. The generated rules are then sorted by ‘lift’ to identify the strongest associations.
Terms
Support indicates how frequently the itemset appears in the dataset. It is calculated as the number of transactions containing the itemset divided by the total number of transactions. For example, if the support for "Heart T-Light Holder" is 0.05, it means that 5% of all transactions include a Heart T-Light Holder.
Confidence is a measure of the reliability of the rule. It is calculated as the number of transactions containing both X and Y divided by the number of transactions containing X. In simpler terms, it indicates the likelihood that a customer will buy Y given that they have bought X. For instance, a confidence of 0.8 for the rule "Heart T-Light Holder -> White Hanging Star" means that 80% of the customers who bought a Heart T-Light Holder also bought a White Hanging Star.
Lift measures how much more likely item Y is to be purchased when item X is purchased compared to when item X is not purchased. It is calculated as the confidence of the rule divided by the support of item Y. For instance, if the lift for the rule "Heart T-Light Holder -> White Hanging Star" is 3, it means that customers who buy a Heart T-Light Holder are three times more likely to also buy a White Hanging Star compared to the general customer base.
Now, we have Market Basket Analysis rules for our data. These rules contain insights into the relationships between products that directly answer the stakeholder's original question: "Which of our products are frequently purchased together, and how can we use this information for cross-selling?". However, the real value comes from visualizing these rules to see how they interact and to understand their context within our dataset. Let's finally get to the fun part and visualize this analysis!
Step 10: Visualize the Rules
Visualization helps in understanding the rules better. Let's create some visualizations to interpret the results and give us what we need to answer our stakeholder’s question.
Scatter Plot for Support vs. Confidence
# Plot the scatter plot for support vs. confidence
plot(rules, measure = c("support", "confidence"), shading = "lift", jitter = 0)📘 Explanation: Our work up until now has created 271,408 rules. (That’s too many! hahaha) We need to pick the few rules that actually matter in the real world. This plot helps by visualizing the relationship between support and confidence for the association rules, with the shading indicating the lift. Higher lift values signify stronger associations, making it easier to identify significant rules.
Graph-based Visualization of Association Rules
# Create a graph-based visualization of association rules
# Convert the rules to a data frame
rule_df <- as(rules, "data.frame")
# Check if the data frame has rules
if (nrow(rule_df) > 0) {
# Get the top 10 rules by lift
top_rules <- head(sort(rules, by = "lift"), 10)
# Create a data frame for the top rules
top_rule_df <- as(top_rules, "data.frame")
# Create the edges data frame
edges <- data.frame(
from = unlist(lapply(strsplit(as.character(top_rule_df$rules), "=>"), function(x) x[1])),
to = unlist(lapply(strsplit(as.character(top_rule_df$rules), "=>"), function(x) x[2])),
weight = top_rule_df$lift
)
# Create a graph from data frame
g <- graph_from_data_frame(edges, directed = TRUE)
# Set vertex attributes
V(g)$color <- "skyblue"
V(g)$size <- 15
V(g)$label.cex <- 0.8
# Plot the graph
plot(g, vertex.label = V(g)$name, vertex.label.color = "black",
edge.width = E(g)$weight, edge.color = "grey",
main = "Graph-based Visualization of Association Rules")
} else {
print("No rules found to plot.")
}
📘 Explanation: The graph-based visualization displays the network of item associations, where ‘nodes’ represent items and edges represent the rules. This helps in understanding the interconnectedness of items and identifying key product pairs. This shows the top 10 product pairs for cross-selling. Don’t worry, you can display more than 10 when you do this on your own dataset. The limited number of 10 was used for presentation purposes only.
Heatmap of Association Rules
# Heatmap of association rules
top_rules <- head(sort(rules, by = "lift"), 20)
plot(top_rules, method = "matrix", measure = "lift")📘 Explanation: The heatmap visualizes the lift of the top association rules. The color intensity represents the strength of the associations, with darker colors indicating stronger rules. This provides a quick overview of the most significant product pairs. By glancing at the heatmap, you can identify which pairs of products have the strongest associations, guiding effective cross-selling strategies.
Item Pair Matrix
# Get the top 20 rules by lift
top_rules <- head(sort(rules, by = "lift"), 20)
# Create a data frame for the top rules
top_rule_df <- as(top_rules, "data.frame")
# Extract the left-hand side (lhs) and right-hand side (rhs) items
lhs_items <- labels(lhs(top_rules))
rhs_items <- labels(rhs(top_rules))
# Combine lhs and rhs into a data frame
item_pairs <- data.frame(
lhs = unlist(lhs_items),
rhs = unlist(rhs_items),
confidence = top_rule_df$confidence,
lift = top_rule_df$lift
)
# Load necessary library
library(reshape2)
# Reshape the data frame for the heatmap
item_pair_matrix <- dcast(item_pairs, lhs ~ rhs, value.var = "lift", fill = 0)
# Convert to a matrix
item_pair_matrix <- as.matrix(item_pair_matrix[-1])
# Load necessary library
library(pheatmap)
# Plot the heatmap
pheatmap(item_pair_matrix, cluster_rows = FALSE, cluster_cols = FALSE, display_numbers = TRUE, main = "Item Pair Matrix")📘 Explanation: This visualization, called an item pair matrix, shows the confidence levels for the top association rules in a matrix format. The color and text indicate the confidence of each rule. This detailed view helps in pinpointing key product pairs and understanding their relationships in a more granular way. By examining the matrix, you can determine which product pairs are most likely to be purchased together, aiding in making data-driven decisions for cross-selling.
Interpretation of Results
By analyzing the association rules, we can derive valuable insights into product combinations that are frequently bought together. These insights can be used to inform cross-selling strategies.
Scatter Plot for Support vs. Confidence: This plot shows the relationship between support and confidence for the rules. Higher lift values indicate stronger associations.
Graph-based Visualization: This visualization helps in understanding the network of item associations, highlighting the most significant rules.
Heatmap of Association Rules: This heatmap shows the strength of associations between different product pairs, making it easy to identify strong rules.
Item Pair Matrix: This matrix provides a detailed view of the confidence levels for the top association rules, helping to pinpoint key product pairs for cross-selling.
These visualizations and the interpretation of the results provide a comprehensive view of the product associations within the dataset, enabling data-driven decisions to enhance cross-selling strategies.
Data-Driven Next Steps
Based on the insights derived from the market basket analysis, we can implement the following five strategies to enhance cross-selling and improve user experience.
1. Personalized Recommendations
Action: By identifying the products that are frequently bought together, we can create personalized product recommendations for users.
Example: If a customer purchases a "Heart T-Light Holder," we can recommend related items such as the "White Hanging Star" or "Jumbo Bag Red Retrospot" in the "Customers Who Bought This Item Also Bought" section.
Implementation Steps:
Integrate the recommendation engine into your e-commerce platform.
Use the association rules to feed the recommendation engine with product pairs.
Test the recommendations on a subset of users and gather feedback.
Refine the recommendations based on user interactions and feedback.
Benefit: This enhances the shopping experience by showing customers items that complement their purchases, potentially increasing the average order value.
2. Bundling Products
Action: Create product bundles based on the identified associations and offer discounts on these bundles.
Example: Bundle a "Hot Water Bottle" with a "Woolly Hottie" and offer a small discount on the combined purchase.
Implementation Steps:
Identify top product pairs from the analysis.
Create bundle offers in your e-commerce system.
Promote the bundles on your website and through marketing campaigns.
Monitor sales data to assess the effectiveness of the bundles.
Benefit: Bundling incentivizes customers to purchase more items at once, increasing sales volume and improving customer satisfaction by offering better value.
3. Targeted Marketing Campaigns
Action: Use the association rules to design targeted marketing campaigns.
Example: Customers who purchase a "Heart T-Light Holder" can be targeted with email promotions for related products like the "White Hanging Star" and "Coat Hanger."
Implementation Steps:
Segment your customer base based on purchase history.
Design email campaigns that feature the associated products.
Schedule and send the emails to the targeted segments.
Track the campaign performance and adjust based on results.
Benefit: Targeted campaigns are more likely to resonate with customers, leading to higher engagement rates and increased sales.
4. Store Layout Optimization
Action: For physical stores, understanding which products are frequently bought together can help in optimizing store layout.
Example: Place "Heart T-Light Holder" near the "White Hanging Star" in the store layout to make it easier for customers to find and purchase these items together.
Implementation Steps:
Analyze the store's current layout and product placement.
Use the association rules to determine optimal product placements.
Rearrange the store layout based on the analysis.
Monitor customer behavior and sales to evaluate the impact of the changes.
Benefit: Optimized store layouts can lead to a more intuitive shopping experience, encouraging customers to buy more products and spend more time in the store.
5. Inventory Management
Action: Use the identified product associations to inform better inventory management practices.
Example: Ensure that frequently bought together products, such as "Heart T-Light Holder" and "White Hanging Star," are always stocked together.
Implementation Steps:
Review inventory levels for the associated products.
Adjust reorder quantities based on the association rules.
Implement automatic restocking alerts for these products.
Continuously monitor inventory levels and sales data.
Benefit: This reduces the chances of losing sales due to stockouts and improves overall customer satisfaction by consistently meeting demand.
By implementing even one of these five strategies, businesses can leverage insights from market basket analysis to create more personalized and effective user engagement strategies. This approach answers the crucial question: "Which of our products are frequently purchased together, and how can we use this information for cross-selling?"
Reality Check
I want to take a quick pause to look back at what we just did. Imagine you answering your own stakeholders' questions using the recommended action themes above. Even if only one of the five strategies aligns with your real-world needs, it's still a huge win!!!! I read A LOT of UX researchers talking about how no one in our companies respects our work. How we never have a seat at the table. How it’s a shame that in mid-size and smaller organizations, we aren’t given the platforms or tools to do things like we just did. I hope this tutorial inspires you to think more scrappy and jump at upskilling opportunities.
In just a few hours, you could provide this level of insight to your stakeholders. Just because the leaders we work with ask simple questions doesn't mean your answers have to be simple or not based on data.
They want complex answers; that's why they hired you. They want data-driven confidence in reporting. That's what the world expects from UX Research. RStudio is free. This blog series is free. You can do this! You can replicate what the large FAANG companies do without multi-billion dollar backing. You can rinse and repeat this tutorial to provide a monthly report to help drive actual dollars in your companies. If you are claiming to be a truly mixed methods researcher, data analysis like we have done in this series so far, should 100% be a part of your toolkit? So what are you waiting for? Stop faking it until you make it. You've got this. I believe in you!
Conclusion
Congratulations! You have successfully performed a market basket analysis to uncover product associations using real-world data. We walked through the entire process, from setting up your RStudio project and importing the dataset to cleaning the data, building the model, and visualizing the results. Here are the key takeaways from this project:
Data Preparation: Properly cleaning and preparing your data is crucial for accurate analysis.
Exploratory Data Analysis (EDA): Helps in understanding the dataset's structure and guiding the analysis.
Market Basket Analysis: Association rule mining is an effective method for identifying relationships between products.
Visualization: Visualizing the rules helps in better understanding and communicating the insights.
Actionable Insights: The identified associations can be used to drive personalized recommendations, product bundling, targeted marketing, store layout optimization, and inventory management.
By following this tutorial, you can now conduct a market basket analysis to create more personalized and effective user engagement strategies. This approach will ultimately enhance cross-selling efforts and drive business success in the real world.
Feedback
How was tutorial #5 for you? Did you find it helpful? Easy to follow? Hard to follow? Feel free to comment here or DM me with your thoughts. Thanks for your continued engagement and support.
For those of you who wonder when I will be getting back to writing about other UX research-related topics that don't involve R and RStuido, don't worry. I only have a handful more articles planned for this series. After that, we'll get back to our regularly scheduled programming. Hahahaha. Thanks again!













