R for UX Researchers Series: Article #9
Tutorial: Fraud Detection Using Machine Learning (Random Forest Modeling)
Summary: This tutorial guides you through using Random Forest models in R to detect fraudulent transactions. You'll learn how to subset the data, train a model, and evaluate its performance. We'll also go through the process of visualizing the model's effectiveness and producing plots that help tell the story hidden in a huge dataset. By the end, you'll have completed an advanced machine learning technique that can be applied in real-world scenarios.
About This Tutorial (A Disclaimer)
✏️ NOTE: Skip to 'The Scenario' below if you are familiar with R.
This tutorial is intended as the final case study in the R for UX Researchers Series. I designed the series to help you learn R by simulating nine real-world projects, all based on my own experience. The prior articles in this series have progressively built up to this advanced look into machine learning. I learned all this stuff when I worked at Minitab, the world's leading statistical software company, and I'm happy to share.
This tutorial gets advanced quickly, and I struggled to explain some foundational ideas as the tutorial progresses through its steps. My goal was to avoid turning this into an academic document and instead provide a practical, real-world explanation. If you're unfamiliar with R, I recommend reviewing the earlier articles in this series before diving into this one. Here they are, in order for you to explore:
Okay, enough of this nonsense. Let's get into the tutorial!
The Scenario
I once did some UX quant work for a small regional bank as a consultant. One day, the Chief Technology Officer (CTO) reached out to me with a problem:
"We're experiencing an uptick in fraudulent transactions. Can you help me identify specific transaction characteristics that will help us flag possible fraudulent behavior moving forward?"
This was a big deal. Fraudulent activities were not just causing financial losses but also hurting customer trust. To tackle this, I suggested using a Random Forest model to analyze the transaction data. This model could pinpoint the features likely to signal future fraud, giving us the insights needed to improve the bank's fraud detection systems. The CTO wanted the ability to predict potential fraudulent transactions in the future, and this was the way I could provide him with the data he needed.
With R and RStudio in mind, I set out to build the bank a fraud detection model. This tutorial walks through the steps I took and shows how to replicate this analysis with your own data.
Why Random Forest?
When the CTO asked for help in detecting fraudulent transactions, my first thought was to use a Random Forest model. Why? Because Random Forest is like having a whole team of decision trees working together to make the best possible predictions. This is something I learned while working at Minitab. Random Forest is a machine learning method that can handle large datasets with many features, just like the one we were dealing with at the bank.
A Random Forest model works by creating a bunch of decision trees during training time and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. Each tree in the forest considers a random subset of the features, which helps in making the model robust and reducing the chance of overfitting.
This means it's great for finding patterns in data, even if those patterns are complicated or subtle.
For our fraud detection problem, the Random Forest could help identify which transaction characteristics were most indicative of fraud. It would look at various aspects like transaction amounts, balances, and transaction types, and determine their importance in predicting fraudulent activity. This way, the bank could better understand what to watch out for and improve their fraud detection processes.
Why Not Just Use Excel or Google Sheets?
To analyze fraud detection effectively, we need tools that can handle complex data and provide sophisticated analytical techniques. Here's why R and RStudio are far superior to Excel or Google Sheets for this task:
Advanced Machine Learning Capabilities: R has specialized packages like
randomForestthat are designed for machine learning tasks. Excel or Google Sheets simply can't do machine learning without a lot of painful workarounds. Don't even try it!Handling Large and Complex Data: R manages and manipulates large datasets a lot better than Excel and Google Sheets.
Advanced Data Cleaning: R cleans data better than Excel or Google Sheets, hands down.
Reproducibility and Automation: R scripts automate the entire process, keeping things consistent and saving time when updating reports or applying them to different datasets.
Advanced Visualization: R can produce complex plots with packages like ggplot2, which Excel or Google Sheets cannot match.
Using R and RStudio for fraud detection lets us handle transaction data better than the off-the-shelf alternatives. Okay, let's gooooo!
Prerequisites
Before starting the tutorial, ensure you have completed the basic setup for R and RStudio as described in the "Getting Started with R & RStudio Tutorial." Additionally, if you are using Windows, you will need to install Rtools.
Step 1: Download the Dataset
First, download the Synthetic Financial Datasets For Fraud Detection from the kaggle.com:
Visit the Synthetic Financial Datasets For Fraud Detection page on Kaggle.
Click the Download button in the top right of the page to get the file named
PS_20174392719_1491204439457_log.csv.Save the downloaded file on your computer in a location you can easily access.
✏️ NOTE: Make a note of the full path to the
PS_20174392719_1491204439457_log.csvfile on your computer. You'll need that path to import the datasets in step 5 below.
Step 2: Start a New RStudio Project
Open RStudio.
Go to File > New Project > New Directory > New Project.
Name your project (e.g., "Fraud_Detection_RF") and choose a location on your computer.
Click Create Project.
Set Up Your Project Structure: Within your new project directory folder on your computer, create a new folder and name it ’data’.
Move the downloaded
PS_20174392719_1491204439457_log.csvfile into the new ’data’ folder you just created.

Step 3: Install Necessary Packages
Install the required packages by copying and pasting the code snippet below into the RStudio Console:
# Install Necessary Packages
install.packages("dplyr")
install.packages("caret")
install.packages("randomForest")
install.packages("ggplot2")
install.packages("pROC")
install.packages("precrec")Step 4: Load the Libraries
Next, load the necessary libraries in your R script:
# Load Necessary Libraries
library(dplyr)
library(caret)
library(randomForest)
library(ggplot2)
library(pROC)
library(precrec)✏️ NOTE: Disregard the Conflicts section and all the red text shown in the Console.
Step 5: Import the Dataset
Use a relative path to import the dataset. We're also going to read the dataset and take a quick look at its structure to understand what we're dealing with.
# Load the full dataset
data <- read.csv("path/to/your/dataset/PS_20174392719_1491204439457_log.csv")
# View the first few rows of the dataset
head(data)
# Summary of the dataset
summary(data)
# Check the structure of the dataset
str(data)
# Ensure the 'isFraud' column is a factor for stratified sampling
data$isFraud <- as.factor(data$isFraud)✏️ NOTE: Replace
"path/to/your/dataset/"with the correct path to the CSV file if you have placed it in a different directory.

📘 Explanation: This step 5 dives a bit deeper than the import data step of all the other articles in this series. After importing, I added some code to view the first few rows of the dataset like I usually do in these tutorials.
Next, I generate a statistical summary to get an overview of the dataset, including key metrics like mean, median, and ranges. This helps me understand the distribution and central tendencies of the data before subsetting it for the Random Forest model. We'll look at this again after subsetting to see what changed and if the structure is intact in later steps.
Finally, converting the 'isFraud' column to a factor ensures that it is treated as a categorical variable, which is essential for accurate modeling and analysis in subsequent steps. These prep steps ensure the data is clean and ready for analysis.
Step 6: Data Subsetting
Since the dataset is HUGE, it is essential to subset it for efficient processing while maintaining the fraud/non-fraud ratio.
# Stratified sampling to maintain fraud/non-fraud ratio
set.seed(123)
trainIndex <- createDataPartition(data$isFraud, p = 0.01, list = FALSE, times = 1)
subset_data <- data[trainIndex, ]
# Filter the data to include only CASH_OUT and TRANSFER types
subset_data <- subset_data %>% filter(type %in% c("CASH_OUT", "TRANSFER"))
# Create dummy variables for the type column
type_dummies <- model.matrix(~type - 1, data = subset_data) %>% as.data.frame()
# Combine the dummy variables with the rest of the data
subset_data <- cbind(type_dummies, subset_data[,-which(names(subset_data) %in% c("type"))])
# Ensure isFraud is a factor for modeling
subset_data$isFraud <- as.factor(subset_data$isFraud)
📘 Explanation: I subsetted the data at this point to maintain the same fraud/non-fraud ratio. Stratified sampling helps achieve this balance. Then, I focused only on the "CASH_OUT" and "TRANSFER" transaction types since they're more relevant for fraud detection (things I learned during step 5). Creating dummy variables for these transaction types allows for better integration into the model. This streamlines everything, making it more manageable for analysis.
Step 7: Exploratory Data Analysis (EDA)
Before getting into the model building, let's understand the data better through EDA. We'll visualize key aspects of the dataset to gain insights into the transaction patterns.
# Summary of the data after data subsetting
summary(data)
# Plot the distribution of transaction types with counts and adjusted labels
ggplot(subset_data, aes(x = isFraud, fill = isFraud)) +
geom_bar(position = "dodge") +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = 1.5, color = "white", data = subset_data %>% filter(isFraud == 0)) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", data = subset_data %>% filter(isFraud == 1)) +
scale_x_discrete(labels = c("Non-Fraud", "Fraud")) +
scale_y_continuous(labels = scales::comma_format()) +
labs(x = "Transaction Status", y = "Count", title = "Transaction Types by Class") +
theme_minimal() +
theme(legend.position = "none")
# Plot the distribution of transaction amounts
max_transaction_amount <- max(subset_data$amount[subset_data$amount < quantile(subset_data$amount, 0.99)])
ggplot(subset_data %>% filter(amount <= max_transaction_amount), aes(x = amount, fill = isFraud)) +
geom_histogram(bins = 50, position = "identity", alpha = 0.65) +
scale_x_continuous(labels = scales::dollar_format(), limits = c(0, max_transaction_amount)) +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Percentage", title = "Transaction Amount Distribution by Class") +
theme_minimal() +
theme(legend.position = "none")📘 Explanation: This will help us visualize the balance between fraudulent and non-fraudulent transactions and understand the distribution of transaction amounts.
In the first plot, we visualize the counts of non-fraudulent and fraudulent transactions, making it easier to see the imbalance.
The second plot shows the distribution of transaction amounts, with a focus on the most common ranges, to provide a clearer picture of typical transaction values.
Step 8: Train and Evaluate the Random Forest Model
Now, let's move on to training and evaluating the Random Forest model. This step will include building the model, predicting the outcomes, and evaluating its performance.
# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(subset_data$isFraud, p = 0.8, list = FALSE)
train_data <- subset_data[trainIndex, ]
test_data <- subset_data[-trainIndex, ]
# Ensure isFraud is a factor for modeling
train_data$isFraud <- as.factor(train_data$isFraud)
test_data$isFraud <- as.factor(test_data$isFraud)
# Build the Random Forest model
set.seed(123)
rf_model <- randomForest(isFraud ~ ., data = train_data, ntree = 100, mtry = 3, importance = TRUE)
# Predict on the test data
test_predictions <- predict(rf_model, test_data, type = "response")
# Create confusion matrix
confusionMatrix(test_predictions, test_data$isFraud)📘 Explanation: First, I split the data into training and testing sets, ensuring an 80-20 split to train the model effectively and test its performance. Converting 'isFraud' to a factor again ensures it is correctly treated as a categorical variable during modeling. Building the Random Forest model involves setting parameters like the number of trees (ntree) and the number of variables considered at each split (mtry), which helps in optimizing the model's performance.
After training, I use the model to predict outcomes on the test data.
Finally, the confusion matrix evaluates the model's performance by comparing predicted values with actual outcomes, providing insights into its accuracy and reliability.
More About the Confusion Matrix
5 Takeaways From The Confusion Matrix:
Accuracy: The model achieved an accuracy of 99.86%, indicating that it correctly predicted the class (fraud or non-fraud) for most of the transactions.
Sensitivity (Recall for Fraud): The sensitivity is 1.0000, meaning the model identified all actual fraud cases correctly. This is critical for fraud detection, as missing a fraud case can be costly.
Specificity: The specificity is 0.5000, indicating that half of the non-fraud cases were correctly identified. This relatively lower specificity suggests the need for improvement in distinguishing non-fraud cases.
Kappa: The kappa value of 0.666 shows moderate agreement beyond chance between the predicted and actual classifications, which is good but indicates there is room for improvement.
Next Steps Context: The confusion matrix suggests high accuracy and perfect recall for fraud cases, but lower specificity points to potential false positives. This led to further analysis and tuning steps to balance precision and recall better, as detailed in the following sections.
Step 9: Plot the ROC Curve
In this step, we'll generate a Receiver Operating Characteristic (ROC) curve to visualize the performance of our fraud detection model. The ROC curve helps us understand how well our model distinguishes between fraudulent and non-fraudulent transactions by plotting the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity).
# Predict probabilities instead of classes
test_probabilities <- predict(rf_model, test_data, type = "prob")[,2]
# Create ROC curve
roc_curve <- roc(test_data$isFraud, test_probabilities)
# Plot ROC curve
ggplot() +
geom_point(aes(x = 1 - roc_curve$specificities, y = roc_curve$sensitivities), color = "coral1", size = 2.25) +
geom_line(aes(x = 1 - roc_curve$specificities, y = roc_curve$sensitivities), color = "coral1", linewidth = 1.25) +
geom_abline(linetype = "dashed", color = "steelblue3", linewidth = 1) +
annotate("text", x = 0.75, y = 0.25, label = paste("AUC =", round(roc_curve$auc, 3)), size = 5, hjust = 0) +
labs(
title = "ROC Curve for Fraud Detection",
x = "False Positive Rate",
y = "True Positive Rate"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(linewidth = 0.5, linetype = 'dotted', colour = "grey"),
panel.grid.minor = element_line(linewidth = 0.25, linetype = 'dotted', colour = "grey")
)📘 Explanation: Instead of predicting class labels directly, we predict the probabilities of the test data being classified as fraud. This allows us to plot the ROC curve. This visualization will help stakeholders quickly grasp the effectiveness of our fraud detection model, providing a clear visual representation of the model's performance.
Step 10: Plot Precision-Recall Curve
The Precision-Recall (PR) curve is another vital metric for evaluating the performance of a classification model, especially when dealing with imbalanced datasets like fraud detection. It provides a more informative picture of the classifier's performance by focusing on the minority class (fraudulent transactions) and showing the trade-off between precision and recall.
Precision measures the accuracy of the positive predictions made by the model. It's the ratio of true positive predictions to the total positive predictions (both true and false positives).
Recall (Sensitivity) measures the model's ability to capture all the actual positive instances in the dataset. It's the ratio of true positive predictions to the total actual positives (true positives and false negatives).
Here's how you can create and plot the Precision-Recall curve:
# Predict probabilities instead of classes
test_probabilities <- predict(rf_model, test_data, type = "prob")[,2]
# Create Precision-Recall curve
pr_curve <- precrec::evalmod(scores = test_probabilities, labels = test_data$isFraud)
# Extract AUC value for the Precision-Recall curve
auc_value <- as.numeric(pr_curve$aucs[2])
# Extract the PRC data
prc_data <- data.frame(
recall = pr_curve$prcs[[1]]$x,
precision = pr_curve$prcs[[1]]$y
)
# Plot Precision-Recall curve
ggplot(prc_data, aes(x = recall, y = precision)) +
geom_point(color = "coral1", size = 2.25) +
geom_line(color = "coral1", linewidth = 1.25) +
geom_abline(linetype = "dashed", color = "steelblue3", linewidth = 1) +
annotate("text", x = 0.75, y = 0.25, label = round(auc_value, 3), size = 5, hjust = 0) +
labs(
title = "Precision-Recall Curve for Fraud Detection",
x = "Recall (Sensitivity)",
y = "Precision"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(linewidth = 0.5, linetype = 'dotted', colour = "grey"),
panel.grid.minor = element_line(linewidth = 0.25, linetype = 'dotted', colour = "grey")
)📘 Explanation: This plot has 5 main elements to it:
The x-axis represents Recall (Sensitivity).
The y-axis represents Precision.
The dashed line represents the baseline model performance (random guessing).
The coral1 line represents the model's Precision-Recall curve.
The annotated text shows the Area Under the Curve (AUC) value of the PR curve, providing a single scalar value to summarize the plot.
This data viz helps stakeholders understand the trade-off between precision and recall for the fraud detection model and how well the model performs in identifying fraudulent transactions.
Step 11: Feature Importance Analysis
In this final step, we will analyze the importance of various features in our dataset for predicting fraudulent transactions. This will help us understand which transaction attributes are most indicative of fraud. We will use the Random Forest model we built earlier to determine these feature’s importance.
# Feature importance using Random Forest
importance_matrix <- importance(rf_model)
# Convert importance to a data frame and rescale the importance values
importance_df <- data.frame(
Feature = rownames(importance_matrix),
MeanDecreaseAccuracy = importance_matrix[,1] / max(importance_matrix[,1]) * 100, # Rescale to percentage
MeanDecreaseGini = importance_matrix[,2] / max(importance_matrix[,2]) * 100 # Rescale to percentage
)
# Rename the features to be more human-understandable
feature_labels <- c(
"amount" = "Dollar Amount",
"step" = "Time of Transaction",
"oldbalanceOrg" = "From Old Balance",
"newbalanceOrig" = "From New Balance",
"newbalanceDest" = "To New Balance",
"typeTRANSFER" = "Money Transfers",
"typeCASH_OUT" = "Money Cashed Out",
"nameDest" = "Name on Accounts"
)
importance_df$Feature <- recode(importance_df$Feature, !!!feature_labels)
# Remove 'isFlaggedFraud' from the data frame
importance_df <- importance_df[importance_df$Feature != "isFlaggedFraud",]
# Plot with meaningful feature names and rescaled importance
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseAccuracy), y = MeanDecreaseAccuracy)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top Feature Importances for Fraud Detection",
x = "Transaction Features",
y = "Likely to be Fraudulent (%)"
) +
theme_minimal()📘 Explanation: We calculate the importance of each feature using the
importance()function from the randomForest package. TheMeanDecreaseAccuracyandMeanDecreaseGinivalues indicate how much each feature contributes to the model's accuracy and the purity of the nodes in the trees, respectively. This step helps us identify which transaction characteristics are most useful for predicting fraud, providing valuable insights for stakeholders to enhance fraud detection strategies.
Interpreting the Results
After running our Random Forest model and evaluating its performance, we gathered several insights:
Confusion Matrix: This matrix shows us how many fraudulent and non-fraudulent transactions were correctly and incorrectly classified. High accuracy and a balanced sensitivity and specificity indicate that our model performs well in distinguishing between fraud and non-fraud.
ROC Curve: The ROC curve demonstrates the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity). The area under the curve (AUC) value is close to 1, meaning our model is excellent at distinguishing between fraudulent and non-fraudulent transactions.
Precision-Recall Curve: This curve focuses on the performance of the model in identifying fraudulent transactions. High precision and recall values indicate that the model not only captures a significant proportion of actual frauds but also minimizes false positives.
Feature Importance: This analysis reveals which transaction features are most predictive of fraud. For example, the "Dollar Amount" and "Time of Transaction" are significant indicators. This information can help the bank focus on monitoring and analyzing these key attributes more closely.
Actionable Insights
Based on our analysis, here are some actionable insights for the bank:
Monitor High-Risk Transactions: Transactions with large amounts and those occurring at specific times are more likely to be fraudulent. The bank can implement additional verification steps for these transactions.
Enhance Fraud Detection Systems: Using the identified key features, the bank can refine its fraud detection algorithms to prioritize these indicators, improving overall detection rates.
Customer Education: Educate customers about the importance of monitoring their accounts and reporting suspicious activity, especially for large transactions or unusual transaction times.
Regular Model Updates: Continuously update and retrain the fraud detection model with new data to ensure it adapts to evolving fraud patterns.
Conclusion
Well, that's it. If you've reached this point, you have finished the nine foundational R for UX Researchers Series tutorials. Woot, woot! Take time to reflect and celebrate. That wasn't easy, and you did it! I will be posting a few more supplementary tutorials over the next few weeks that address subscriber requests, but this concludes my organizational plan for the series.
I hope that going through these tutorials, one at a time, taught you R by actually using R in the real world.
For this tutorial, I hope you learned how leveraging the power of Random Forest models in R and RStudio can be super powerful and add a HUGE value. This approach helps identify fraudulent transactions more accurately and provides valuable insights into the key characteristics of fraudulent behavior. Imagine how you could use the power of machine learning in R for your own projects!
Feedback
I'd love to hear from you! How did this tutorial work for you? How did the full series work for you? Did you find the steps clear and the insights valuable? Your feedback helps me improve and tailor the content here on UX Researcher in the Wild. Please DM me or leave comments and suggestions below.













