Know Your Customers: Market Basket Analysis with R

A step-by-step guide to Market Basket Analysis



In this article, we will be introducing a way you can get to know your customers using hard data and a widely-known concept known as Market Basket Analysis.

What is Market Basket Analysis?

Depending on the business context, applications of such analyses can vary. Possible applications include:

  1. Promotions
  2. Customizing the store layout
  3. Online Recommendation Engines

Done rightly, a proper analysis can bring much value to a company. Hence, today, I will be covering the basics behind Market Basket Analysis and how you can replicate a similar analysis.


tidyverse : Data Manipulation
data.table : Converting a Data Frame into a Transaction dataset
arules / arulesViz : Market Basket Analysis


Concerning this exercise, I have obtained a retail dataset from the UCI Machine Learning Repository. The dataset contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

Prior to our analysis, we would first need to load the dataset and preprocess the data.

In the context of this example, I will only be analyzing the transactions that have taken place within the United Kingdom. Depending on the dataset you are working with, you might have to preprocess your data differently. I would highly suggest that you check the structure of your data.

# Check Structure
united_kingdom <- retail_data %>%
mutate(InvoiceDate = as.Date(InvoiceDate, “%Y-%m-%d %H:%M:%S”)) %>%
filter(Country == “United Kingdom”)

Suppose you are using the dataset from the UCI repository, the resultant dataset should look something similar to this.

Table 1: Sample Dataset after preprocessing

The two columns that are of importance to us are InvoiceandDescription. The details of the two columns are as follows:

Invoice : Unique 6-digit integral number assigned to each transaction
Description : Item name

Having preprocessed the dataset, we would need to convert our dataset into a ‘transactions’ dataset. To do so, we use the dcast() function from the data.table library.

invoiced_items <- dcast(setDT(united_kingdom %>% group_by(Invoice) %>% select(Invoice, Description) %>% distinct(Description, .keep_all = TRUE)), Invoice~rowid(Invoice)) %>%
write.csv(invoiced_items, ”invoiced_items.csv”, quote = FALSE, row.names = FALSE, col.names = FALSE, na = “”)transaction <- read.transactions(‘invoiced_items.csv’, format = ‘basket’, sep = ’,’)

The resultant ‘transactions’ dataset should look something similar to this. Each row includes the items corresponding to the different invoices.

Table 2: Transactions Dataset

We can now proceed over to the analysis portion.

Exploratory Analysis

itemFrequencyPlot(transaction, topN = 10, type = ’absolute’)
Fig 1: Item Frequency Plot for top 10 Items

From our frequency plot above (Fig 1), it is clear that the popular items include ‘WHITE HANGING HEART T-LIGHT HOLDER’, ‘REGENCY CAKESTAND 3 TIER’ and ‘JUMBO BAG RED RETROSPOT’.

Possible questions one might ask having generated the plot includes: Am I maximizing my revenue by selling these popular items at the current price? Who are the customers that tend to buy certain products? etc.

Through answering these questions, we can mine insights to drive business decisions. However, is there a way to take our analysis one step further? One way is to apply the aforementioned Market Basket Analysis. This allows us to dig deeper to identify possible patterns that exist between the different items.

Market Basket Analysis — Theory

Apriori Algorithm

To aid our understanding of the Apriori Algorithm, I have listed the definition of several commonly used terms below.

Itemset: A collection of one or more items e.g. {COFFEE, SUGAR}
Candidate Rule: An expression of the form A → B where A and B are itemsets, e.g. {COFFEE, SUGAR} → {MILK}

How does the Apriori Algorithm work?

Once done, the algorithm identifies the various itemsets that are ‘frequent’ based on a minimum support criterion. This is important because we wouldn’t want to consider an itemset that only appears in a small proportion of recorded transactions.

The support of an itemset, A, is the percentage of transactions that contain A. For example, if 70% of all transactions contain itemset {COFFEE}, intuitively, the support of {COFFEE} is 0.7.

With the collection of ‘frequent’ itemsets, a collection of candidate rules can be formed based on these itemsets. For example, a ‘frequent’ itemset {COFFEE, SUGAR} may suggest candidate rules {COFFEE} → {SUGAR} and {SUGAR} → {COFFEE}.

Having obtained our different candidate rules, we can evaluate the suitability of these rules using the concept of confidence.

Confidence is defined as the measure of certainty associated with each discovered rule. The confidence for candidate rule A → B is the percent of transactions that contain both A and B out of all the transactions that contain A. Mathematically, it looks like this: Confidence(A → B) = Support(A ∧ B) / Support(A)

Often, a relationship can be considered as interesting when the algorithm identifies it to have a measure of confidence greater than or equal to a predefined threshold. This predefined threshold is also known as the minimum confidence.

When choosing a minimum confidence, it is helpful to keep in mind that a higher confidence often indicates that the rule is more significant.

However, using solely the concept of confidence is not sufficient. This is because given a rule A → B, confidence considers only A and the cooccurrence of A and B; it does not take into consideration the effects of B. One method used to address this issue is the concept called lift.

Lift measures how many times more often A and B occur together than expected if they are statistically independent of each other. Lift is a measure of how A and B are really related rather than coincidentally happening together. Mathematically, it looks like this: Lift(A → B) = Support(A ∧ B) / (Support(A) × Support(B ))

Typically, a lift that is approximately equal to 1 suggests that A and B are statistically independent of each other. Whereas, a lift for a rule that is greater than 1 often suggests that there is some usefulness to the rule.

To find out more about Market Basket Analysis, you can check out the following link.

Market Basket Analysis — Code & Insights

rules <- apriori(transaction, parameter = list(supp = 0.01, conf = 0.7, maxlen = 3))
rules <- sort(rules, by = ’confidence’, decreasing = TRUE)
Table 3: Summary of Candidate Rules

Once the algorithm produces the set of candidate rules, we should take some time to interpret and understand these rules. The interpretation of the rules are pretty intuitive. Let’s take the first rule, {SUGAR} → {COFFEE}, as an example.

From Table 3, we can observe that the itemset, {SUGAR, COFFEE}, appears in approximately 1.2% of recorded invoices. From the confidence, we can see that 100% of customers who purchase ‘SUGAR’ also purchased ‘COFFEE’. Looking at the lift, we can also see that there is a strong positive correlation between ‘COFFEE’ and ‘SUGAR’.

How exactly can we use this insight?

I would highly encourage you to spend some time understanding the rules that have been generated. After all, it is the insights that we uncover from understanding these rules that ultimately add value to an organization.

Suppose you would like a visual representation of the rules plotted according to their support, confidence and lift, you can do so by calling the plot() function.

Fig 2: Scatter Plot for the 12 Candidate Rules

If you would like to visualize some of these associations in the form of a graph, you can do so by setting the method to “graph”.

plot(rules, method = “graph”)
Fig 3: Visual Representation of the Candidate Rules

Finally, if you would prefer something more interactive, adjusting the code slightly by setting the engine to “interactive”, does the trick.

plot(rules, method = “graph”, engine = “interactive”, shading = NA)
Fig 4: Visual Representation of the Candidate Rules

Final Thoughts

However, it is worth noting that the concept of Association Rules is not limited to Market Basket Analysis. Association Rules can also be used for other purposes such as stock analysis and medical diagnosis — the possibilities are endless.

With this, thank you for your time and I hope that this article was beneficial to you! Do reach out if you have questions regarding this article or even the implementation of the codes :~)

You can check out the full project on GitHub here.

A penultimate-year Data Science and Analytics undergraduate specializing in Artificial Intelligence