Know Your Customers: Market Basket Analysis with R

A step-by-step guide to Market Basket Analysis

Joel Choe
8 min readJun 23, 2021
Source: https://www.kissclipart.com/grocery-store-transparent-clipart-grocery-store-sh-nozxps/

Introduction

As retail business owners, knowing your customers is imperative. But what would you do and where do you start?

In this article, we will be introducing a way you can get to know your customers using hard data and a widely-known concept known as Market Basket Analysis.

What is Market Basket Analysis?

Market Basket Analysis is a technique commonly used by retailers to identify associations between different items within a transaction database. Often, these associations can be analyzed so that retailers can better understand patterns in the purchasing behaviors of their customers.

Depending on the business context, applications of such analyses can vary. Possible applications include:

  1. Promotions
  2. Customizing the store layout
  3. Online Recommendation Engines

Done rightly, a proper analysis can bring much value to a company. Hence, today, I will be covering the basics behind Market Basket Analysis and how you can replicate a similar analysis.

Preprocessing

For today’s exercise, we will be leveraging on 4 different libraries in R to aid us in our analysis. The purpose of the different libraries is as follows:

tidyverse : Data Manipulation
data.table : Converting a Data Frame into a Transaction dataset
arules / arulesViz : Market Basket Analysis

library(tidyverse)
library(arules)
library(arulesViz)
library(data.table)

Concerning this exercise, I have obtained a retail dataset from the UCI Machine Learning Repository. The dataset contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

Prior to our analysis, we would first need to load the dataset and preprocess the data.

In the context of this example, I will only be analyzing the transactions that have taken place within the United Kingdom. Depending on the dataset you are working with, you might have to preprocess your data differently. I would highly suggest that you check the structure of your data.

# Check Structure
str(retail_data)
united_kingdom <- retail_data %>%
mutate(InvoiceDate = as.Date(InvoiceDate, “%Y-%m-%d %H:%M:%S”)) %>%
filter(Country == “United Kingdom”)

Suppose you are using the dataset from the UCI repository, the resultant dataset should look something similar to this.

Table 1: Sample Dataset after preprocessing

The two columns that are of importance to us are InvoiceandDescription. The details of the two columns are as follows:

Invoice : Unique 6-digit integral number assigned to each transaction
Description : Item name

Having preprocessed the dataset, we would need to convert our dataset into a ‘transactions’ dataset. To do so, we use the dcast() function from the data.table library.

invoiced_items <- dcast(setDT(united_kingdom %>% group_by(Invoice) %>% select(Invoice, Description) %>% distinct(Description, .keep_all = TRUE)), Invoice~rowid(Invoice)) %>%
select(!Invoice)
write.csv(invoiced_items, ”invoiced_items.csv”, quote = FALSE, row.names = FALSE, col.names = FALSE, na = “”)transaction <- read.transactions(‘invoiced_items.csv’, format = ‘basket’, sep = ’,’)

The resultant ‘transactions’ dataset should look something similar to this. Each row includes the items corresponding to the different invoices.

Table 2: Transactions Dataset

We can now proceed over to the analysis portion.

Exploratory Analysis

First, we can identify the most frequent items bought to determine their popularity.

itemFrequencyPlot(transaction, topN = 10, type = ’absolute’)
Fig 1: Item Frequency Plot for top 10 Items

From our frequency plot above (Fig 1), it is clear that the popular items include ‘WHITE HANGING HEART T-LIGHT HOLDER’, ‘REGENCY CAKESTAND 3 TIER’ and ‘JUMBO BAG RED RETROSPOT’.

Possible questions one might ask having generated the plot includes: Am I maximizing my revenue by selling these popular items at the current price? Who are the customers that tend to buy certain products? etc.

Through answering these questions, we can mine insights to drive business decisions. However, is there a way to take our analysis one step further? One way is to apply the aforementioned Market Basket Analysis. This allows us to dig deeper to identify possible patterns that exist between the different items.

Market Basket Analysis — Theory

Apriori Algorithm

Before we begin our Market Basket Analysis, it is pertinent to introduce the Apriori Algorithm. The Apriori Algorithm is the key idea behind our entire analysis. The algorithm leverages on three main concepts — Support, Confidence and Lift — which I will go into greater detail below.

To aid our understanding of the Apriori Algorithm, I have listed the definition of several commonly used terms below.

Itemset: A collection of one or more items e.g. {COFFEE, SUGAR}
Candidate Rule: An expression of the form A → B where A and B are itemsets, e.g. {COFFEE, SUGAR} → {MILK}

How does the Apriori Algorithm work?

The Apriori algorithm takes an iterative approach to uncovering the different possible itemsets within a set of transactions.

Once done, the algorithm identifies the various itemsets that are ‘frequent’ based on a minimum support criterion. This is important because we wouldn’t want to consider an itemset that only appears in a small proportion of recorded transactions.

Support
The support of an itemset, A, is the percentage of transactions that contain A. For example, if 70% of all transactions contain itemset {COFFEE}, intuitively, the support of {COFFEE} is 0.7.

With the collection of ‘frequent’ itemsets, a collection of candidate rules can be formed based on these itemsets. For example, a ‘frequent’ itemset {COFFEE, SUGAR} may suggest candidate rules {COFFEE} → {SUGAR} and {SUGAR} → {COFFEE}.

Having obtained our different candidate rules, we can evaluate the suitability of these rules using the concept of confidence.

Confidence
Confidence is defined as the measure of certainty associated with each discovered rule. The confidence for candidate rule A → B is the percent of transactions that contain both A and B out of all the transactions that contain A. Mathematically, it looks like this: Confidence(A → B) = Support(A ∧ B) / Support(A)

Often, a relationship can be considered as interesting when the algorithm identifies it to have a measure of confidence greater than or equal to a predefined threshold. This predefined threshold is also known as the minimum confidence.

When choosing a minimum confidence, it is helpful to keep in mind that a higher confidence often indicates that the rule is more significant.

However, using solely the concept of confidence is not sufficient. This is because given a rule A → B, confidence considers only A and the cooccurrence of A and B; it does not take into consideration the effects of B. One method used to address this issue is the concept called lift.

Lift
Lift measures how many times more often A and B occur together than expected if they are statistically independent of each other. Lift is a measure of how A and B are really related rather than coincidentally happening together. Mathematically, it looks like this: Lift(A → B) = Support(A ∧ B) / (Support(A) × Support(B ))

Typically, a lift that is approximately equal to 1 suggests that A and B are statistically independent of each other. Whereas, a lift for a rule that is greater than 1 often suggests that there is some usefulness to the rule.

To find out more about Market Basket Analysis, you can check out the following link.

Market Basket Analysis — Code & Insights

For the purpose of this exercise, I chose to run the Apriori Algorithm on a minimum support of 0.01 and minimum confidence of 0.7. I have also set the maximum length of an itemset to be 3 and sorted the rules by their confidence. Feel free to adjust the parameters as you wish!

rules <- apriori(transaction, parameter = list(supp = 0.01, conf = 0.7, maxlen = 3))
rules <- sort(rules, by = ’confidence’, decreasing = TRUE)
Table 3: Summary of Candidate Rules

Once the algorithm produces the set of candidate rules, we should take some time to interpret and understand these rules. The interpretation of the rules are pretty intuitive. Let’s take the first rule, {SUGAR} → {COFFEE}, as an example.

From Table 3, we can observe that the itemset, {SUGAR, COFFEE}, appears in approximately 1.2% of recorded invoices. From the confidence, we can see that 100% of customers who purchase ‘SUGAR’ also purchased ‘COFFEE’. Looking at the lift, we can also see that there is a strong positive correlation between ‘COFFEE’ and ‘SUGAR’.

How exactly can we use this insight?

One way this insight can be used by retailers is in the form of online recommendation engines. Having seen that customers are extremely likely to buy ‘COFFEE’ if they buy ‘SUGAR’, retailers can recommend ‘COFFEE’ to customers if they have already placed ‘SUGAR’ in their checkout basket. This is one way retailers can subtly entice customers to make more purchases.

I would highly encourage you to spend some time understanding the rules that have been generated. After all, it is the insights that we uncover from understanding these rules that ultimately add value to an organization.

Suppose you would like a visual representation of the rules plotted according to their support, confidence and lift, you can do so by calling the plot() function.

plot(rules)
Fig 2: Scatter Plot for the 12 Candidate Rules

If you would like to visualize some of these associations in the form of a graph, you can do so by setting the method to “graph”.

plot(rules, method = “graph”)
Fig 3: Visual Representation of the Candidate Rules

Finally, if you would prefer something more interactive, adjusting the code slightly by setting the engine to “interactive”, does the trick.

plot(rules, method = “graph”, engine = “interactive”, shading = NA)
Fig 4: Visual Representation of the Candidate Rules

Final Thoughts

To conclude, Market Basket Analysis is a form of unsupervised learning method called Association Rules. It is just one way we can leverage on the concept of Association Rules in today’s dynamic world. Done in conjunction with the right industry knowledge, this can bring much value to an organization.

However, it is worth noting that the concept of Association Rules is not limited to Market Basket Analysis. Association Rules can also be used for other purposes such as stock analysis and medical diagnosis — the possibilities are endless.

With this, thank you for your time and I hope that this article was beneficial to you! Do reach out if you have questions regarding this article or even the implementation of the codes :~)

You can check out the full project on GitHub here.

--

--

Joel Choe

A final year Data Science and Analytics undergraduate specializing in Artificial Intelligence