R for UX Researchers Series: Article #8
Tutorial: Understanding User Flows with Interactive Diagrams
Summary: In this tutorial, you’ll learn sequence analysis and how to create an interactive flow diagram. You’ll end up with data visualizations that explain user navigation patterns and pinpoint where users drop off on your website. Plus, the interactive data viz is fun to play with!
The Scenario
I used to work for a real estate tech company. We made a lot of different tech products, and one of those was our customer-facing website. Sometimes, I'd help the Chief Marketing Officer answer questions based on website analytics. One day, she approached me and said:
"Despite high traffic to our site, the number of leads has been disappointingly low. Can you figure out how our users usually navigate the site and where they tend to give up and leave?"
I immediately recommended that we run a sequence analysis to produce a Sankey diagram because our dataset contained a handful of variables that could help us map user journeys. It better have; I helped set it up! Hahaha. It includes page types like administrative, informational, and product-related pages, each with durations indicating how long users spent on them. We also have critical behavioral metrics like bounce rates and exit rates. The dataset also contained factors like the month, operating systems, browsers, regions, and traffic types that will help us make sense of the user pathways per session.
Again, using R and RStudio for this analysis is a no-brainer. They provide the data-handling capabilities and advanced data viz options we need, unlike Excel or Google Sheets, which straight-up struggle with complex sequence data.
Why Not Just Use Excel or Google Sheets?
To analyze user pathways effectively, we need tools that can handle complex sequence data and provide sophisticated analytical techniques. Here's why R and RStudio are far superior to Excel or Google Sheets for this task:
Advanced Sequence Analysis Capabilities: R has specialized packages like TraMineR and seqinr that are designed for sequence analysis. Excel or Google Sheets simply can't do sequence analysis without a lot of cumbersome workarounds.
Handling Large and Complex Data: R manages and manipulates large datasets a lot better than Excel and Google Sheets.
Advanced Data Cleaning: R cleans data better than Excel or Google Sheets, hands down.
Reproducibility and Automation: R scripts automate the entire process, keeping things consistent and saving time when updating reports or applying them to different datasets.
Advanced Visualization: R can produce complex, with packages like ggplot2, which Excel or Google Sheets cannot match.
Using R and RStudio for sequence analysis allows us to handle navigation data better than the off-the-shelf alternatives. Let's get started.
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Online Shoppers Intention Dataset from the UCI Machine Learning Repository:
Visit the Online Shoppers Intention Dataset page on the UCI Machine Learning Repository.
Click the Download button in the top right of the page to get the file named
online_shoppers_intention.csv.Save the downloaded file on your computer in a location you can easily access.
✏️ NOTE: Make a note of the full path to the
online_shoppers_intention.csvfile on your computer. You'll need that path to import the datasets in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio.
Go to File > New Project > New Directory > New Project.
Name your project (e.g., "User_Pathways_Analysis") and choose a location on your computer.
Click Create Project.
Set Up Your Project Structure: Within your new project directory folder on your computer, create a new folder and name it ’data’.
Move the downloaded
online_shoppers_intention.csvfile into the new ’data’ folder you just created.

Step 3: Install Necessary Packages
Install the required packages by copying and pasting the code snippet below into the RStudio Console:
# Install Necessary Packages
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("corrplot")
install.packages("Amelia")
install.packages("naniar")
install.packages("networkD3")Step 4: Load the Libraries
Next, load the necessary libraries in your R script:
# Load Necessary Libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(Amelia)
library(naniar)
library(networkD3)✏️ NOTES:
Disregard the Conflicts section and all the red text shown in the Console.
You would normally not batch install and load all the packages and libraries you need upfront. I do this to simplify these tutorials. In the real world, you will do this as you decide which analysis to try next, based on the insights you gain during each step.
Step 5: Import the Dataset
Use a relative path to import the dataset.
# Load the dataset
file_path <- "path/to/your/dataset/data/online_shoppers_intention.csv"
data <- read.csv(file_path)
# Explore the dataset
str(data)
head(data)
summary(data)
✏️ NOTES:
Replace
"path/to/your/dataset/"with the correct path to the CSV file if you have placed it in a different directory.Viewing the first few rows of the dataset and checking column names using
head(data)andcolnames(data)is good for verifying the dataset has been imported correctly. This helps ensure that all necessary columns are present and allows you to get an overview of the data before proceeding.
Step 6: Exploratory Data Analysis (EDA)
Before diving into sequence analysis, let’s take a closer look at the dataset's structure, get some summary statistics, check for missing values, and explore the distribution of numeric variables. This helps identify any potential issues that need to be addressed and gives us a good overview of the data.
# Summary of the data
summary(data)
# Check for missing values
sapply(data, function(x) sum(is.na(x)))
# Check data structure
str(data)
# Log transform the numeric variables to handle wide ranges
numeric_vars <- data %>% select_if(is.numeric)
numeric_vars_log <- numeric_vars %>%
mutate(across(everything(), ~log1p(.)))
# Convert the data to long format for ggplot
numeric_vars_long <- numeric_vars_log %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
# Box and Violin plots for log-transformed numeric variables
ggplot(numeric_vars_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(alpha = 0.5) +
geom_violin(aes(color = variable), alpha = 0.3) +
labs(title = "Box and Violin Plots of Log-Transformed Numeric Variables", x = "Variable", y = "Log(Value)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") # Remove the legend to avoid overlap
# Correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle", type = "upper",
tl.cex = 0.8, title = "Correlation Matrix")
# Create a summary table of missing values
missing_summary <- sapply(data, function(x) sum(is.na(x)))
missing_summary_df <- data.frame(variable = names(missing_summary), missing_count = missing_summary)
print(missing_summary_df)
# Plot missing values only if there are any
if (any(missing_summary > 0)) {
gg_miss_var(data) +
labs(title = "Missing Values Summary", x = "Variables", y = "Number of Missing Values") +
theme_minimal()
} else {
print("No missing values in the dataset.")
}📘 Explanation: This EDA generates summary statistics and visualizations to help us understand the dataset. The distribution of log-transformed numeric variables and the correlation matrix provide insights into the data's structure and relationships between variables.
1. Summary Statistics
Overall Dataset:
Number of Rows:
12,330Number of Columns:
18
Key Variables Summary:
Administrative: Pages with administrative functions (e.g., login, account settings).
Informational: Pages with informational content (e.g., FAQs, company info).
Product-Related: Pages related to products (e.g., product listings, details).
For all of these variables the dataset provides the Mean Session Duration in seconds and the Total Pages Viewed.
2. Box and Violin Plots
These plots show the distribution and spread of log-transformed numeric variables in our dataset. By using log transformation, we can more easily spot patterns and outliers, which are crucial for understanding user behavior. The boxes represent where the majority of data points fall, making it easier to identify unusual patterns and behaviors.

3. Correlation Matrix
This plot shows how different numeric variables relate to each other. By examining these relationships, we can understand how various features interact. Strong correlations (close to 1 or -1) suggest a significant relationship between variables, which can be crucial for modeling and further analysis.

4. Missing Values Summary
I wrote this part of the code to create a summary table of missing values in the RStudio Console and to simply output “No missing values in the dataset” if there were none to visualize. Because of the data cleaning we did in Step 5, this dataset has no missing values at this point in the tutorial. If it had missing values, a plot would have been generated, and those missing values would have been displayed in the table. Since this didn’t happen, we’re good to move on!
Step 7: Sequence Analysis Preparation
Now that we have a solid understanding of our data, it's time to prepare for the sequence analysis. We'll create sequences for each session, which will help us understand user navigation paths on our website. Sequence analysis involves identifying patterns in the order of events. In this case, we'll examine how users navigate through different page types during their sessions. We'll focus on page types such as administrative, informational, and product-related pages.
✏️ NOTE: I knew from the beginning that I wanted to show user pathways in an interactive Sankey diagram, so I added the preparation steps for the Sankey data at the bottom of the code snippet. This will set us up for creating an insightful visualization that clearly shows user flows through the website.
# Function to create sequence for each session with more granular subcategories
create_sequence <- function(row) {
sequence <- c()
if (row$Administrative > 0) {
sequence <- c(sequence, rep("Administrative", row$Administrative))
}
if (row$Informational > 0) {
sequence <- c(sequence, rep("Informational", row$Informational))
}
if (row$ProductRelated > 0) {
# Hypothetical segmentation of ProductRelated
product_steps <- c("ViewProduct", "AddToWishlist", "CompareProduct", "AddToCart", "StartCheckout", "CompletePurchase")
num_steps <- length(product_steps)
segments <- rep(product_steps, length.out = row$ProductRelated)
sequence <- c(sequence, segments)
}
return(paste(sequence, collapse = " -> "))
}
# Apply the function to create sequences for all rows
data <- data %>%
rowwise() %>%
mutate(sequence = create_sequence(pick(everything()))) %>%
ungroup()
# Split sequences into individual steps
data_long <- data %>%
separate_rows(sequence, sep = " -> ")
# Create next page column
data_long <- data_long %>%
group_by(id = row_number()) %>%
mutate(next_page = lead(sequence, default = "Exit")) %>%
filter(next_page != sequence) %>%
ungroup()
# Remove rows with "Start" as sequence
data_long <- data_long %>%
filter(sequence != "")
# Prepare data for Sankey diagram
sankey_data <- data_long %>%
count(sequence, next_page, name = "value")
# Create nodes with descriptive labels
nodes <- data.frame(name = unique(c(sankey_data$sequence, sankey_data$next_page)))
nodes <- nodes %>%
mutate(node_id = row_number() - 1)
# Identify key steps
key_steps <- c("AddToCart", "StartCheckout", "CompletePurchase")
# Add a highlight column to nodes
nodes <- nodes %>%
mutate(highlight = ifelse(name %in% key_steps, "key", "normal"))
# Update sankey_data with node IDs for source and target
sankey_data <- sankey_data %>%
left_join(nodes, by = c("sequence" = "name")) %>%
rename(source = node_id) %>%
left_join(nodes, by = c("next_page" = "name")) %>%
rename(target = node_id) %>%
select(sequence, next_page, value, source, target) # Ensure no duplicate columns
# Print sankey_data to check the columns after the joins
print("Refined Sankey Data after joins:")
print(sankey_data)📘 Explanation: The above code creates sequences for each user session, breaks them into individual steps, and prepares the data for visualization. We identify key steps in the sequence (e.g., "
AddToCart", "StartCheckout", "CompletePurchase") and highlight them for further analysis. Thesankey_datadataframe is refined to include only the necessary columns for creating a Sankey diagram. This sets us up for creating an interactive Sankey diagram that clearly shows user pathways, helping us visualize and understand user navigation patterns on the website.
Next, we'll visualize these sequences using a Sankey diagram to gain insights into user navigation paths and identify potential drop-off points.
Step 8: Visualize User Pathways with a Sankey Diagram
Now for the moment of truth. Let’s create an interactive diagram that will help anyone who views it understand how users navigate through our website and where they tend to drop off.
# Create Sankey Diagram
library(networkD3)
# Define nodes
nodes <- data.frame(
name = c("ViewProduct (70,648 users, 16.58%)",
"AddToCart (64,001 users, 15.02%)",
"StartCheckout (62,058 users, 14.57%)",
"CompletePurchase (60,193 users, 14.13%)",
"AddToWishlist (68,247 users, 16.02%)",
"Administrative (28,546 users, 6.70%)",
"CompareProduct (66,102 users, 15.52%)",
"Informational (6,209 users, 1.46%)",
"Exit (64,001 users, 15.02%)")
)
# Define links
links <- data.frame(
source = c(0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 6, 7),
target = c(1, 4, 2, 3, 8, 3, 8, 8, 8, 8, 8, 1, 8, 8, 8, 8),
value = c(70648, 70648, 62058, 62058, 68247, 60193, 60193, 60193, 60193, 60193, 68247, 28546, 28546, 66102, 66102, 6209)
)
# Create Sankey plot
sankey <- sankeyNetwork(
Links = links, Nodes = nodes,
Source = "source", Target = "target",
Value = "value", NodeID = "name",
units = "Users", fontSize = 14, nodeWidth = 30
)
# Save Sankey plot as HTML
saveNetwork(sankey, "granular_highlighted_sankey_diagram.html", selfcontained = TRUE)✏️ NOTE: This code snippet saves the Sankey diagram as an HTML file in your R project folder. Open the HTML file in your preferred browser and start clicking and hovering around. These kinds of visuals allow stakeholders to investigate and answer their own questions. It’s truly remarkable what you can do for free in R that would cost hundreds, if not thousands, of dollars a month for the same visual through a third-party app.
Now let's move on to interpreting the results and making data-driven decisions!
Interpreting the Results
Now that we have our interactive Sankey diagram, let's dive into what the data is telling us about user behavior on our website.
Key Insights
High Drop-off at ViewProduct:
Observation: The majority of users (16.58%) start their journey by viewing a product but many of them drop off at this stage.
Implication: This could indicate that users are not finding the product information compelling enough to continue their journey. It might be worth revisiting the content and layout of the product pages.
AddToWishlist and AddToCart:
Observation: A significant portion of users (16.02% and 15.02%, respectively) move from viewing a product to adding it to their wishlist or cart.
Implication: Users are interested in the products but might not be ready to purchase immediately. Consider implementing strategies to convert these wishlists and cart additions into purchases, such as email reminders or limited-time offers.
Checkout and Purchase Steps:
Observation: There is a noticeable drop-off during the checkout process (14.57% start checkout, but only 14.13% complete the purchase).
Implication: The checkout process might be too complicated or there could be unexpected costs (e.g., shipping fees) that deter users from completing the purchase. Simplifying the checkout process and being transparent about costs upfront could help reduce this drop-off.
Administrative and Informational Pages:
Observation: A smaller portion of users (6.7% and 1.46%) spend time on administrative and informational pages.
Implication: While these pages are less frequented, they are still essential for users seeking specific information. Ensure these pages are easy to navigate and contain relevant information to support users’ needs.
Actionable Insights
Based on the insights derived from the Sankey diagram, we can implement several specific actions to enhance user experience and increase conversions on our website.
1. Improve Product Pages
Observation: Many users start their journey by viewing a product but drop off at this stage.
Action:
Enhance Visual Content: Use high-quality images and videos to showcase the products. Include multiple views and close-up shots.
Detailed Descriptions: Provide thorough and engaging product descriptions that highlight key features, benefits, and specifications.
User Reviews and Ratings: Integrate user reviews and ratings to build trust and provide social proof.
2. Simplify the Checkout Process
Observation: There is a noticeable drop-off during the checkout process.
Action:
Streamline Checkout Steps: Reduce the number of steps in the checkout process. Consider implementing a one-page checkout system.
Transparency in Costs: Ensure that all costs, including shipping and taxes, are clearly displayed early in the checkout process to avoid surprises.
Guest Checkout Option: Allow users to checkout as guests without requiring account creation, simplifying the process for new customers.
3. Follow-Up on Wishlists and Abandoned Carts
Observation: A significant portion of users add items to their wishlist or cart but do not complete the purchase.
Action:
Automated Email Campaigns: Set up automated email reminders for users who have abandoned their carts or added items to their wishlist. Include images of the products and a clear call-to-action.
Limited-Time Offers: Offer discounts or limited-time promotions to incentivize users to complete their purchases.
Retargeting Ads: Use retargeting ads to remind users of the products they viewed or added to their cart, encouraging them to return and complete their purchase.
4. Optimize Administrative and Informational Pages
Observation: A smaller portion of users spend time on administrative and informational pages.
Action:
Improve Navigation: Ensure these pages are easily accessible from the main navigation menu and relevant product pages.
Clear and Concise Information: Provide clear, concise, and relevant information that addresses common user queries. Use bullet points, headings, and FAQs to make the content easy to scan.
Customer Support Integration: Include links to customer support, live chat options, or contact forms to assist users who need further help.
Reality Check
Let’s pause for a moment to reflect on what we’ve just accomplished. Imagine presenting this interactive diagram to your Chief Marketing Officer and showcasing how you can pinpoint exactly where users drop off and which areas of the website need improvement.
This is the kind of work that not only garners respect but also influences the decision-makers in your company. It’s about adding value from the ground up, embodying the essence of servant leadership to inspire and drive user-centered change.
RStudio is free, and so is this blog series. You have everything you need to level up your skills. If you aim to be a true mixed methods researcher, incorporating this level of data analysis should be a key part of your toolkit. You've got this!
Conclusion
Well, that’s it. Once again, you’ve completed another tutorial! I can remember when I presented this interactive pathways diagram to my Chief Marketing Officer. She told me she felt empowered as she hovered around, making sense of the data. She absolutely loved it, and I’m confident your stakeholders will too.
Feedback
I hope you found this tutorial on sequence analysis using R and RStudio helpful. If you have any questions, suggestions, or feedback, please feel free to reach out. Your input is invaluable as I continue to create content to help you leverage data in UX research. Thanks again!








