Product Recommendation: An Analytical Approach

Azeem Mumtaz
10 min readJul 4, 2021

Project Goal (Hypothetical)

  • ABC is a company sells consumer products.
  • ABC wants to increase the revenue by up-selling and cross-selling products to their existing customers.

Analytics Plan

  • Component of Plan: ABC’s online ordering app.
  • Discovery Business Problem: How can ABC increase the revenue by up-selling and cross-selling products to their existing customers?
  • Initial Hypothesis: Patterns in the past purchase history may suggest interesting relationships to up-selling and cross-selling products.
  • Data and Scope: Randomly generated purchase history for 5 years.
  • Model Planning: Association rules will give ABC a way to discover relationships in past purchases.
  • Results: When a customer checkout, the online ordering site can show a list of products derived from the association rules as recommendations before confirming the order.
  • Business Impact: ABC can use Tealium or Mixpanel to capture events when customers add such suggested products to their cart and checkout. ABC can determine how much additional revenue they gain through these suggestions and understand the model’s usefulness.

Approach

  • A randomly generated dataset with large number of observations with six variables.
  • Each observation in the dataset contains an item customer bought.
  • Pre-processed the database to 134,845 observations with six variables.
  • Developed an association rules model using apriori algorithm in R language by using the arules, arulesViz and other R packages.
  • Loaded the CSV file into a data frame in R (i.e., originalOrderDF) for processing.

Data Pre-Processing

The following command will describe the structure of the dataset.

str(originalOrderDF)

The summary of the dataset indicates few issues with some observations.

summary(originalOrderDF)
Figure 1 — Summary of the Dataset

There were products with negative quantities and prices, which means deleted, or returned products; hence such observations were removed as they are not required for the goal of the analysis.

orderDF <- originalOrderDF## Let's remove line items with negative qty_cases
orderDF <- orderDF[orderDF$qty_case > 0, ]
## Let's remove line items with negative qty_each
orderDF <- orderDF[orderDF$qty_each >= 0, ]
## Let's remove line items with negative price
orderDF <- orderDF[orderDF$price > 0, ]

The dataset contains multiple observations, which constitutes a transaction. Each observation represents a product in an “order”. Therefore, the dataset was transformed and reduced to contain one transaction as an observation as following.

## Functions to reduce multiple observations from orderDF to one 
## to form an order transaction / cart
## Note: Paste function in R is used to collapse multiple
## product descriptions into one variable separated by a comma.
reduceProductToOne <- function(orderDF) {
paste(orderDF$product_desc, collapse = ",")
}
## Calculate the sum of the products
reducePriceToOne <- function(orderDF) {
sum(orderDF$price)
}
## Calculate the total quantity ordered by customers
reduceQtyToOne <- function(orderDF){
sum(orderDF$qty_case) + sum(orderDF$qty_each)
}
## Select the ordered date (assumption: order date is same for all
## observations with same order_id.
reduceDateToOne <- function(orderDF) {
as.Date(as.POSIXct(head(orderDF$order_date,1)/1000, origin = "1970-01-01", tz = "GMT"))
}
## Calculate the ordered year (useful to understand the purchase by
## year
reduceYearToOne <- function(orderDF) {
format(as.Date(as.POSIXct(head(orderDF$order_date,1)/1000, origin = "1970-01-01", tz = "GMT")), format="%Y")
}
## Let's apply reduce functions and create a dataframe with order_id
## and combin product_description using a comma
orderProductDF <- ddply(orderDF, ~ order_id, reduceProductToOne)
colnames(orderProductDF) <- c("order_id", "product_descs")
## Let's apply reduce functions and create dataframe with order_id
## and sum of total
orderPriceDF <- ddply(orderDF, ~ order_id, reducePriceToOne)
colnames(orderPriceDF) <- c("order_id", "total")
## Let's apply reduce functions and create DF with order_id and sum
## of quantity
orderQtyDF <- ddply(orderDF, ~ order_id, reduceQtyToOne)
colnames(orderQtyDF) <- c("order_id", "qty")
## Let's apply reduce functions and create DF with order_id and
## select one ordered date from the head of multiple observations
orderDateDF <- ddply(orderDF, ~ order_id, reduceDateToOne)
colnames(orderDateDF) <- c("order_id", "date")
## Let's apply reduce functions and create DF with order_id and year
## of the ordered date
orderYearDF <- ddply(orderDF, ~ order_id, reduceYearToOne)
colnames(orderYearDF) <- c("order_id", "year")
## Merge all above data frames to form the final orderDFfinalOrderDF <-
merge(
merge(
merge(
merge(
orderProductDF,
orderQtyDF,
by = c("order_id")),
orderPriceDF,
by = c("order_id")),
orderDateDF,
by = c("order_id")),
orderYearDF,
by = c("order_id"))

After, each observation contains the total value, total quantity, id, date, year, and products. The product description variable contains all products in order separated by a comma.

The dataset does not contain any missing values.

plot_missing(orderDF)
Figure 2— Missing Values in the Dataset

Then, the dataset was analysed to remove any noises or outliers. Outliers are data with very different behaviours from expectations (Han, et al., 2012). When analysed the dataset through boxplot, there were many instances where customers placed orders for a large quantity to cater to special needs (i.e., charity). Abnormal values on this field may not be appropriate for the analysis as those outliers might skew the output and model performance (Han, et al., 2012). Interquartile Range (IQR) is one strategy used on this dataset to remove outliers.

## Let's remove outliers in qtyboxplot(finalOrderDF$qty, ylab = "qty", main = "Boxplot for the qty")
outliers <- boxplot(finalOrderDF$qty, ylab = "qty", main = "Boxplot for the qty")$out
qt <- quantile(finalOrderDF$qty, probs = c(0.25, 0.75))
span <- 1.5 * IQR(finalOrderDF$qty)
finalOrderDF$qty <- ifelse((finalOrderDF$qty < (qt[1] - span) | finalOrderDF$qty > (qt[2] + span)), NA, finalOrderDF$qty)
finalOrderDF <- finalOrderDF[complete.cases(finalOrderDF), ]
boxplot(finalOrderDF$qty, ylab = "qty", main = "Boxplot for the qty")
Figure 3— Boxplot on Qty (with noise)
Figure 4— Boxplot on Qty (after)

Finally, after the data pre-processing, the dataset contains 134,845 observations with six variables.

Association Rules — Why?

Now that the dataset has been pre-processed, surveyed, and visualised, an unsupervised learning model will be implemented to identify rules among products in the dataset. The model which will be used on this dataset is the “association rule”. The association rule is a descriptive, not predictive, method often used to discover interesting relationships hidden in a large dataset like the purchase history (Dietrich, et al., 2015).

This is also known as “market basket analysis”.

Association rules will give ABC a way to discover relationships in past purchases. These relationships indicate products that customers frequently bought together.

ABC has three options with this information.

  1. When a customer creates a cart based on the products in the current cart, the online ordering app can show a list of products derived from the association rules as recommendations before the customer checkout.
  2. Based on the customer’s past purchased products, the online ordering app can show those recommendations under the “Suggestion for you” carousel in the UI.
  3. With the order confirmation email or any marketing emails, ABC can include these recommendations to the customer for future considerations.

If these recommendations contain any past purchased item of a customer, if those products are purchased within the last three months, then those need to be filtered out from the recommendation (out of scope for this blog). All three ways allow ABC to increase revenue by up-selling and cross-selling products.

Implementation of Recommendation using Association Rules

Next, the association rules were mined from the dataset using the Apriori algorithm to identify product recommendations, which is one of the earliest and the most fundamental algorithms for generating association rules (Dietrich, et al., 2015).

The implementation is done in R language by using arules package.

The dataset, processed in the previous section, was loaded as transactions via the “read. transaction” function in the arules library as finalOrderTX.

## Let's remove unnecessary variables for the association 
## rules analysis.
productDF <- finalOrderDF
productDF$order_id = NULL
productDF$qty = NULL
productDF$total = NULL
productDF$date = NULL
productDF$year = NULL
## Let's write the productDF to a CSV file for further analysis
write.csv(productDF, transactionFile, quote = FALSE, row.names = FALSE)
## Let's load productDF data from CSV in to a transaction form from
## arules.
## Note: The format is basket, because I saved the product_desc in
## one column as a comma seperated value.
finalOrderTX <- read.transactions(transactionFile, format = 'basket', sep = ',')

The summary of the finalOrderTX gives a list of products as the most frequently bought by ABC’s customers and the number of transactions contains those products.

summary(finalOrderTX)

Rule Generation

The apriori algorithm takes an iterative, bottom-up approach and pruning strategy to identify frequent itemsets and rules based on the minimum support criterion (Dietrich, et al., 2015) to generate association rules finalOrderTX transactions.

## Let's generate the association rules.
## Note: maxlen is default 10
## Note: support and confidence values are calculated after trying
## out many value combinations.
associationRules <- apriori(finalOrderTX,
parameter = list(
minlen = 2,
maxtime = 0,
support = 0.017,
confidence = 0.9,
target = "rules"))

The above function was executed multiple times by changing the support and the confidence value to identify the optimal number of rules. The rule generation started with 0.005 as the minimum support criterion. Since there are 134,846 transactions, an item will be considered a frequent itemset if purchased at least 675 times.

+---------+------------+-----------------+
| Support | Confidence | Number of Rules |
+---------+------------+-----------------+
| 0.005 | 0.9 | 1567483 |
| 0.009 | 0.9 | 72015 |
| 0.01 | 0.9 | 37507 |
| 0.012 | 0.9 | 10240 |
| 0.014 | 0.9 | 2171 |
| 0.015 | 0.95 | 639 |
| 0.017 | 0.9 | 63 |
+---------+------------+-----------------+

After multiple executions, the following confidence and support values were chosen, which yields 63 rules as an optimal value for this hypothetical analysis.

  • Support = 0.017
  • Confidence = 0.9

The summary of the generated associated rules for the above values shows 63 rules, along with their support, confidence and lift distributions. Notice that the minimum support and confidence values are greater or equal to what is defined above.

summary(associationRules)
Figure 5 — Summary of the Association Rules

The following code will print the top 10 rules by the support value.

inspect(head(sort(associationRules, by = "support"), 10))

The following graph displays the scatterplot of 63 rules with minimum support criterion 0.017 with 0.9 confidence.

plot(associationRules)
Figure 6— Scatterplot for 63 Rules

The lift value generally increases with the highest confidence and lower support. This means the lift is proportional to the confidence when the support is constant because “Lift” equals the Confidence/Support(Y) (Dietrich, et al., 2015).

plot(associationRules@quality)

This behaviour is shown in the following graph in the 4th row and 2nd column as well.

Figure 7— Scatterplot on All Quality Metrics

The following graph shows the confidence and support over the number of products in rules. It shows that more products in the rules have lesser support.

plot(associationRules, method = "two-key plot")
Figure 8— Two Way Scatterplot

The following code will print the top 10 rules by lift value.

inspect(head(sort(associationRules, by = "lift"), 10))

A graph plot shows the top 5 rules with the highest lift value.

## Let's plot the high lift rules in a graph plotassociationRulesByLift <- sort(associationRules, by = "lift")
rulesWithHighestLift <- head(associationRulesByLift, 5)
inspect(rulesWithHighestLift)
plot(rulesWithHighestLift, method = "graph")
Figure 9— Graph on Top 5 Rules by Lift

In it, the following rule is the top rule.

{BEAN, BEEF} => {CHICKEN}

Itemset Generation

An itemset refers to a collection of products that contains some relationship (Dietrich, et al., 2015). Apriori function in the arules package can generate frequent itemsets using the same support and confidence values.

## Let's generate the frequent itemset
## Let's use the same support and confidence value which used in the
## association rule mining
## Note: When the maxlen parameter is not set, the algorithm
## continues each iteration until it runs out of support or until k
## reaches the default maxlen=10
itemsets <- apriori(finalOrderTX,
parameter = list(
minlen = 2,
maxtime = 0,
support = 0.017,
confidence = 0.9,
target="frequent itemsets"))

The summary shows that the support distributes from 0.01702 to 0.01861for k-itemsets where k is four, and the minimum value is 2.

summary(itemsets)
Figure 10 — Summary of the Itemset

We can inspect the itemsets by their support value.

inspect(head(sort(itemsets, by = "support"), 10))

Now we have the model, how to generate recommendations based on the cart?

When a customer checkout, the online ordering app can show a list of products derived from the association rules as recommendations before confirming the order.

Suppose the customer has the following products in the cart.

  • CHEESE
  • CHICKEN
test <- apriori(finalOrderTX, 
parameter = list(
minlen = 2,
maxtime = 0,
support = 0.017,
confidence = 0.9),
appearance = list(
lhs = c(
"CHEESE",
"CHICKEN"),
default = "rhs"))

When the lift value sorts the “test”, based on the output, the online ordering app can suggest the following two products.

{CHEESE, CHICKEN} => {BEEF, RICE}
  • BEEF
  • RICE
associationRulesForTestByLift <- sort(test, by = "lift")
rulesWithHighestLiftForTest <- head(
associationRulesForTestByLift, 5)
inspect(rulesWithHighestLiftForTest)

The reason why “test” is sorted by lift is that it indicates usefulness to the rule and a higher value suggests a greater strength of the association between antecedent (i.e., left-hand side of the rule) and consequent (i.e., right-hand side of the rule) (Dietrich, et al., 2015).

plot(rulesWithHighestLiftForTest, method = "graph")

Further, ABC can use Tealium or Mixpanel to capture events when customers add such suggested products to their cart and checkout. In this way, ABC can determine how much additional revenue they gain through these suggestions and understand the model’s usefulness.

ABC should regenerate the model once a day in a blue/green deployment strategy to meet the SLA with zero outage.

Sourcecode

https://github.com/azeemigi/product-rec-using-apriori-algorithm

Note: the source code contains a sample dataset with anonymized values.

References and Further Reading

Dietrich, D., Heller, B. & Beibei, Y., 2015. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. 1st Edition ed. Indianapolis: John Wiley & Sons, Inc.

Han, J., Kamber, M. & Pei, J., 2012. Data Mining Concepts and Techniques. Third ed. Waltham, MA: Elsevier Inc.

Hahsler, M. & Chelluboina, S., 2018. Visualizing Association Rules: Introduction to the R-extension Package arulesViz. [Online]
Available at: https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
[Accessed 23 May 2021].

--

--