R for UX Researchers Series: Article #2
Tutorial: Predicting Purchase Behavior with Logistic Regression
Summary: In this step-by-step tutorial, learn how to use R and RStudio, utilizing a real-world scenario and dataset. This article covers performing data analysis, building predictive models, and creating compelling visualizations to gain actionable insights and support your stakeholders with data.
The Scenario
I used to work for an e-commerce company that operates an online shopping platform. There, we tracked basic user interactions such as page views, product clicks, session duration, and purchase behavior. One day, the Director of UX came up to me and asked,
"Can you figure out what factors influence whether users make a purchase during their session via our site analytics?"
Knowing how our site was instrumented, I immediately knew the data could answer her question, but I'd have to do some analysis first. That's why, for this business question, I chose to use R and RStudio rather than Excel or Google Sheets.
Disclaimer: For the record, I'm not a quant genius. I am a true mixed-methods researcher who gravitates toward qualitative methods by nature. I've been described as a people person and have always enjoyed the qual side of the profession. But my nine years working at Minitab, the world's most sophisticated desktop statistical software package, really gave me an appreciation and understanding of the power and necessity of quant. When telling my career story, I refer to my time at Minitab as being immersed in the X-Men school for stats, which I hope provides some context around where I'm coming from. Learn more about why crafting a career narrative is important in my article "Experienced UX Research Resume Writing: A Storytelling Approach."
That said, my intention here is to provide a more real-world and practical scenario that applies 1-to-1 to questions asked of us as UX Researchers in our everyday work. In this article, we'll walk through how to do all of this in RStudio step by step. I hope you work alongside me and actually follow the steps on your own computer.
Before we dive into the step-by-step tutorial, let's talk more about why I choose R and RStudio instead of more familiar tools like Excel or Google Sheets.
Why Not Just Use Excel or Google Sheets?
R and RStudio are way better options for this scenario because they offer several advantages over tools like Excel and Google Sheets:
Handling Large and Complex Data: R can efficiently manage and manipulate large and diverse datasets, which Excel and Google Sheets may struggle with. (We'll use a sample dataset in this tutorial, but it closely resembles the datasets I used at the e-commerce company.)
Advanced Data Cleaning: R provides powerful functions for precise and efficient data preparation. It allows for sophisticated handling of missing data, outliers, and data transformations that would be cumbersome in Excel and Google Sheets.
Building a Predictive Model: R allows for advanced statistical modeling, such as logistic regression, which provides detailed summaries and diagnostics that are possible but not as easily accessible or customizable in Excel and Google Sheets.
Model Validation and Tuning: R offers extensive options for validating and tuning models to ensure robustness. Tools like caret provide functionality for cross-validation, hyperparameter tuning, and more.
Reproducibility and Automation: R scripts automate the entire process, ensuring consistency and saving time. This is particularly useful when you need to update your analysis regularly or apply it to different datasets.
Advanced Visualization: R packages like ggplot2 create detailed, customizable, and high-quality visualizations. These visualizations can be far more complex and informative than those typically produced in Excel or Google Sheets.
Documentation and Reporting: RMarkdown integrates code, output, and narrative into comprehensive, reproducible reports. This makes it easier to document your analysis and share it with stakeholders in a clear and professional manner.
In this tutorial, I'll show you how to set up the analysis, interpret the results, incorporate some more complex modeling techniques, and create compelling visualizations in RStudio. By the end, you'll be ready to persuade your stakeholders with these predictive powers. Let's get started!
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Online Shoppers Purchasing Intention Dataset from Kaggle.
Save the dataset as a CSV file to a convenient location on your computer.
✏️ NOTE: Make a note of the full path to the
online_shoppers_intention.csvfile on your computer. You'll need this path to import the dataset in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio
Go to File > New Project > New Directory > New Project
Name your project (e.g., "User_Purchase_Analysis") and choose a location on your computer
Click Create Project

Step 3: Install Necessary Packages
Install the required packages by copy/pasting the code snippet below into the RStudio Console:
install.packages("tidyverse")
install.packages("corrplot")
install.packages("caret")
install.packages("e1071")Step 4: Load the Libraries
Next, in your R script, load the necessary libraries:
library(tidyverse)
library(corrplot)
library(caret)
library(e1071)✏️ NOTE: Disregard the Conflicts section shown in the Console.
Step 5: Import the Dataset
Now, import the dataset into RStudio using the code snippet below.
data <- read.csv("path/to/your/dataset/online_shoppers_intention.csv")
# Replace "path/to/your/dataset" with the full path to your CSV file
✏️ NOTE: You will have to replace the "path/to/your/dataset" with the full path on your own machine to the CSV file.
Step 6: Clean the Data
Now, let's prepare the data for analysis by handling missing values and converting categorical variables into factors.
# Check for missing values
sum(is.na(data))
# Convert categorical variables to factors
data$Month <- as.factor(data$Month)
data$VisitorType <- as.factor(data$VisitorType)
data$Weekend <- as.factor(data$Weekend)
data$Revenue <- as.factor(data$Revenue)💡 TIP: If at any time during these steps you get an error message, simply copy/paste that message verbatim into a Google search. You’ll likely find the fix in the first few result links. Also, checking forums like Stack Overflow can be very helpful for resolving common issues.
Step 7: Perform EDA
EDA stands for Exploratory Data Analysis. It is a step in the data analysis process that involves summarizing and visualizing the main characteristics of a dataset. The goal of EDA is to understand the data's structure, spot outliers, identify patterns, and gain insights for further analysis and modeling. I usually run this code, or something similar, at the start of most projects when I'm using R.
Copy and paste the following EDA code snippet into your R script and run it section by section.
# Summary statistics
summary(data)
# Histograms for numerical variables
ggplot(data, aes(x = ProductRelated_Duration)) + geom_histogram(binwidth = 30)
ggplot(data, aes(x = PageValues)) + geom_histogram(binwidth = 10)
# Density plots for numerical variables
ggplot(data, aes(x = ProductRelated_Duration)) + geom_density()
ggplot(data, aes(x = PageValues)) + geom_density()
# Boxplots for numerical variables by a categorical variable
ggplot(data, aes(x = Revenue, y = ProductRelated_Duration)) + geom_boxplot()
ggplot(data, aes(x = Revenue, y = PageValues)) + geom_boxplot()
# Scatter plot to examine relationships between numerical variables
ggplot(data, aes(x = ProductRelated_Duration, y = PageValues)) + geom_point()
# Bar chart for a categorical variable
ggplot(data, aes(x = Month, fill = Revenue)) + geom_bar(position = "fill")
# Correlation matrix and heatmap
cor_matrix <- cor(data %>% select_if(is.numeric))
corrplot(cor_matrix, method = "circle")
# Contingency table for categorical variables
table(data$VisitorType, data$Revenue)
Step 8: Interpret the EDA Results
Let's take a closer look at the EDA results. Review the summary statistics, histograms, density plots, boxplots, scatter plots, bar charts, correlation matrix, and contingency table. You'll start to see patterns emerge, like how ProductRelated_Duration and PageValues might influence whether a user makes a purchase. These insights are crucial for understanding our data's story.
Step 9: Decide on a Model Type
Seeing the binary nature of the Revenue variable (purchase or no purchase), it became clear that a logistic regression would be the best fit for our analysis. This model will help us predict the probability of a purchase based on various predictors. Let's get our dataset ready for the logistic regression to ensure everything runs smoothly.
Step 10: Split the Data into Sets
Next, we'll have to split the data into training and testing sets so we can build and validate our model, ensuring its predictions are accurate and generalizable to new data. (I did a quick Google search to find a template for this code snippet.)
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(data$Revenue, p = .8, list = FALSE, times = 1)
trainData <- data[trainIndex,]
testData <- data[-trainIndex,]
Step 11: Build the Logistic Regression Model
Now, we need to build a logistic regression model to predict whether a user will make a purchase, helping us understand the influence of various factors. Copy/paste this code snippet to build the model.
model <- train(Revenue ~ ., data = trainData, method = "glm", family = "binomial")
summary(model)
Step 12: Evaluate the Model
Now, we'll evaluate the model to see how well it predicts purchases and identify any areas where it may need improvement. (I did a quick Google search to find this code snippet as well.)
predictions <- predict(model, newdata = testData)
confusionMatrix(predictions, testData$Revenue)Since the output has been verified, we have no need for further improvement; let's move on to creating visuals, the fun stuff!
Step 13: Predicted vs. Actual Values Scatter Plot
This plot will show us how well the model's predictions match the actual values so we can assess its accuracy.
# Plotting the predicted vs actual values
ggplot(testData, aes(x = Revenue, y = predictions)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, col = "red") +
labs(title = "Predicted vs Actual Revenue", x = "Actual Revenue", y = "Predicted Revenue") +
theme_minimal()Conclusions From the Scatter Plot
Points on the Line: The points that lie on the red diagonal line represent instances where the model's predictions perfectly match the actual outcomes. In your case, there are points at (False, False) and (True, True), indicating that the model correctly predicted some instances of both non-purchases and purchases.
Points off the Line: The two points that do not lie on the red diagonal line represent instances where the model's predictions do not match the actual outcomes.
This means the model is good enough to explore more visuals. Let's create enhanced visualizations to help us communicate our findings to the Director of UX.
Step 14: Enhanced Interpretation
I created two additional plots, a feature importance plot and a density plot of predictions, to explore more persuasive data visualizations.
Plot 1: Feature Importance Plot
This plot shows the importance of each feature in predicting the target variable. It helps in understanding which features are most influential in determining whether a user makes a purchase during their session.
# Feature importance plot using caret
importance <- varImp(model, scale = FALSE)
ggplot(importance, aes(x = reorder(Overall, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance", x = "Features", y = "Importance") +
theme_minimal()Observation
The PageValues feature has the highest importance, meaning it has the biggest impact on the model's predictions. The MonthJul, MonthSep, and SpecialDay features had the least importance.
Insight
Knowing which features are most important helps us understand what affects revenue predictions the most. For example, focusing on the PageValues feature can have a significant impact on our predictions.
Plot 2: Density Plot of Predictions
This plot compares the distribution of predicted and actual values, illustrating how well the model's predictions align with the actual outcomes.
# Density plot of predicted vs actual values
ggplot() +
geom_density(aes(x = predictions, fill = "Predicted"), alpha = 0.5) +
geom_density(aes(x = testData$Revenue, fill = "Actual"), alpha = 0.5) +
labs(title = "Density Plot of Predicted vs Actual Revenue", x = "Revenue", fill = "Legend") +
theme_minimal()Observation
The overlap between the density curves of predicted and actual values shows how well the distributions match.
Insight
The density curves indicates good model performance.
Step 15: Finalizing Impactful Insights
This model is highly likely to give the Director of UX confident answers to her original question: "Can you figure out what factors influence whether users make a purchase during their session via our site analytics?"
I included the Boxplots from Step 8 and the two enhanced plots, the Density Plot and the Feature Importance Plot, in my report to the Director of UX. These plots clearly showed how ProductRelated_Duration and PageValues might influence whether a user makes a purchase, emphasizing the importance of these features in our analysis.
These visualizations do a better job of telling the story of the predictive analysis. After seeing these, I used them in my report because they provided clearer insights into the factors influencing user purchases.
The main insight is that concentrating the UX team's efforts on the PageValues feature will impact user purchase behavior most.
This data gave the Director of UX a clear direction as to where to allocate the team's efforts to increase conversions.
Step 16: Using R Markdown and Persuasive Reporting
Creating detailed and interactive visualizations in RStudio is a powerful way to communicate your findings, but doing this is its own challenge and requires a whole article itself. Because of this, I've decided to cover how to connect your analysis to a dynamic dashboard or create an HTML report in a future article.
In the meantime, let's quickly go over how to create a shareable HTML report for the analysis we've just completed:
R Markdown
First, make sure you have the R Markdown package installed. If not, you can install it by running:
install.packages("rmarkdown")Now, you can create your report:
Create a New R Markdown File:
In RStudio, go to File > New File > R Markdown...
In the dialog box that appears, give your document a title, author name, and choose HTML as the output format.
Click OK to create the new R Markdown file.
Copy and Paste Your Analysis: Replace the default content of the new R Markdown file with the entire code from your analysis. Make sure to include all the steps, code snippets, and narrative explanations.
Render the HTML Report:
Click the Knit button at the top of the RStudio editor window.
RStudio will process the R Markdown file and generate an HTML report.
The HTML file will open in your default web browser and will be saved in your project directory.
Share the HTML Report:
You can now share the HTML file with your stakeholders. It contains all the code, visualizations, and narrative explanations, making it easy for others to understand and reproduce your analysis.
In an upcoming article, I will dive deep into how to use R Markdown to create reproducible reports and integrate R with various dashboard tools to make your data presentations even more impactful. Stay tuned for a detailed guide on reporting out of RStudio that will streamline your process and enhance your reporting skills.
Conclusion
By following these steps, you have successfully analyzed factors influencing user purchases using the Online Shoppers Purchasing Intention Dataset. Woot, woot! You can celebrate. This tutorial not only demonstrated how to handle and clean data but also how to build a predictive model and create compelling visualizations in RStudio. These skills are crucial for any UX researcher aiming to leverage data to improve digital products. I hope that, as you went through this tutorial, you were envisioning how your own data would work in the real world.
I also hope this tutorial demonstrated how, with R and RStudio, you can perform complex analyses and create powerful visualizations that are hard to achieve with Excel or Google Sheets. I've found that this kind of thing goes a long way when persuading stakeholders and helping them make data-driven and user-centered decisions.
Feedback
How was this tutorial? Was it helpful? Easy to follow? Hard to follow? Please feel free to comment here or DM me with your feedback. I've never created anything like this before, and I have no clue if it's any good, so any and all feedback is encouraged. Thanks all!











Hi Trevor, thank you for the wonderful tutorial. One thing that wasn't clear when reading this part
> Points off the Line: The two points that do not lie on the red diagonal line represent instances where the model's predictions do not match the actual outcomes.
What do I do if that's the case?