library(tidyverse)
library(dplyr)
library(arules)
library(arulesViz)
data("Groceries")
Association rules
For this task, we will be using data from Groceries, a dataset that can be found with the arules package. Each row in the file represents one buyer’s purchases. We will generate item frequency plots, identify strong association rules involving a specific product, and visualize rules using scatter and graph-based methods.
This work is part of an assignment for the AD699 Data Mining course.
Groceries is of class transactions (sparse matrix). The data consists of 9835 rows, and 169 columns.
summary(Groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda
2513 1903 1809 1715
yogurt (Other)
1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
17 18 19 20 21 22 23 24 26 27 28 29 32
29 14 14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels level2 level1
1 frankfurter sausage meat and sausage
2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage
The bar plot below displays frequent items, that meet the support value. The minimum support threshold is set at 7.25%, meaning that only items appearing in at least 7.25% of all transactions are considered frequent. As a result we got 16 frequent products.
itemFrequencyPlot(Groceries,
support=0.0725,
horiz = TRUE,
col = "ivory",
main = "Frequent Grocery items (Support > 7.25%)")
Let’s create subset of rules that contain my grocery item - cream cheese.
<- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) rules
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.5 0.1 1 none FALSE TRUE 5 0.001 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 9
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.01s].
writing ... [5668 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
# summary(rules)
# inspect(rules[1:5])
# itemLabels(Groceries)
<- subset(rules, lhs %in% "cream cheese ")
lhs_rules <- subset(rules, rhs %in% "cream cheese ")
rhs_rules
summary(lhs_rules)
set of 233 rules
rule length distribution (lhs + rhs):sizes
3 4 5
59 137 37
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 3.000 4.000 3.906 4.000 5.000
summary of quality measures:
support confidence coverage lift
Min. :0.001017 Min. :0.5000 Min. :0.001118 Min. : 1.957
1st Qu.:0.001118 1st Qu.:0.5556 1st Qu.:0.001729 1st Qu.: 2.642
Median :0.001220 Median :0.6154 Median :0.002034 Median : 3.138
Mean :0.001536 Mean :0.6480 Mean :0.002494 Mean : 3.563
3rd Qu.:0.001729 3rd Qu.:0.7143 3rd Qu.:0.002847 3rd Qu.: 3.982
Max. :0.006609 Max. :1.0000 Max. :0.012405 Max. :11.041
count
Min. :10.00
1st Qu.:11.00
Median :12.00
Mean :15.11
3rd Qu.:17.00
Max. :65.00
mining info:
data ntransactions support confidence
Groceries 9835 0.001 0.5
call
apriori(data = Groceries, parameter = list(supp = 0.001, conf = 0.5))
summary(rhs_rules)
set of 1 rules
rule length distribution (lhs + rhs):sizes
5
1
Min. 1st Qu. Median Mean 3rd Qu. Max.
5 5 5 5 5 5
summary of quality measures:
support confidence coverage lift
Min. :0.001017 Min. :0.5882 Min. :0.001729 Min. :14.83
1st Qu.:0.001017 1st Qu.:0.5882 1st Qu.:0.001729 1st Qu.:14.83
Median :0.001017 Median :0.5882 Median :0.001729 Median :14.83
Mean :0.001017 Mean :0.5882 Mean :0.001729 Mean :14.83
3rd Qu.:0.001017 3rd Qu.:0.5882 3rd Qu.:0.001729 3rd Qu.:14.83
Max. :0.001017 Max. :0.5882 Max. :0.001729 Max. :14.83
count
Min. :10
1st Qu.:10
Median :10
Mean :10
3rd Qu.:10
Max. :10
mining info:
data ntransactions support confidence
Groceries 9835 0.001 0.5
call
apriori(data = Groceries, parameter = list(supp = 0.001, conf = 0.5))
There is 233 rules that contain my product - on the left hand side. 59 rules involve three product subset, 137 four product subset, and 37 five product subset. And we get only 1 rule with cream cheese on right hand side, which is in the subset of five products. Indicating that cream cheese appears in combination with other products.
Let’s look at the first rule: If a person buys other vegetables, curd, yogurt, and whipped/sour cream this person 14.83 times more likely to buy cream cheese than a random customer in store. The support number is 0.001, meaning that 0.1% of all transactions studied had exact same item sets. Confidence is 0.59 => If someone buys other vegetables, curd, yogurt, and whipped/sour cream, there’s a 59% chance that they also buy cream cheese. Coverage number gives an idea how often the rule can be applied, in this case it equals to 0.002. This rule applies to 0.2% of all transactions in the dataset.
inspect(sort(rhs_rules, by="lift"))
lhs rhs support confidence coverage lift count
[1] {other vegetables,
curd,
yogurt,
whipped/sour cream} => {cream cheese } 0.001016777 0.5882353 0.001728521 14.83409 10
The next rule: If a person buys citrus fruit, other vegetables, whole milk, and cream cheese he/she is 9.12 times more likely to buy domestic eggs than a random purchaser in store. The support number is 0.001, this rule applies to only 0.1% of all transactions. This rule also have high confidence, saying that if customer buys citrus fruit, other vegetables, whole milk, and cream cheese, there is 58% chance they also buy domestic eggs. Coverage number is 0.002, meaning that this combination occurs in 0.2% of all transactions.
inspect(sort(lhs_rules, by="lift")[2])
lhs rhs support confidence coverage lift count
[1] {citrus fruit,
other vegetables,
whole milk,
cream cheese } => {domestic eggs} 0.001118454 0.5789474 0.001931876 9.124916 11
From these rules we see that certain sets of products are frequently purchased together. In combination they may be ingredients for salads, or other recipes. Cream cheese, in particular, is commonly used in baking and is a key ingredient in cheesecake. Despite that, cream cheese widely used for frosting, spreads, pasta sauces, dips, making it a versatile ingredient in a variety of dishes.
By identifying frequent combinations with cream cheese, the store can strategically place those items nearby—such as positioning cream cheese close to the vegetables/fruits section or within the dairy aisle for convenient access. Also, offering special discounts can boost sales. For example, if a customer buys cream cheese, offering a discount on berries or bagels can encourage bundled purchases. Additionally, analyzing product pairs allows the store to anticipate demand and adjust inventory accordingly, ensuring high-demand combinations are well-stocked ahead of time, especially during peak shopping seasons.
inspect(lhs_rules[7:9])
lhs rhs support confidence
[1] {cream cheese , frozen meals} => {whole milk} 0.001016777 0.7142857
[2] {hard cheese, cream cheese } => {other vegetables} 0.001118454 0.5789474
[3] {hard cheese, cream cheese } => {whole milk} 0.001016777 0.5263158
coverage lift count
[1] 0.001423488 2.795464 10
[2] 0.001931876 2.992090 11
[3] 0.001931876 2.059815 10
plot(lhs_rules[7:9])
From the plot above, we observe the distribution of three association rules based on confidence (y-axis), support (x-axis), and lift (represented by color). First rule: {cream cheese , frozen meals} => {whole milk} has the highest confidence and lift values but a low support. This means the rule highly reliable: there is 71.43% chance that a customer who buys cream cheese and frozen meals will also buy whole milk. And the strength of the association is strong. However, it applies to 0.10% of the transactions. Second rule: {hard cheese, cream cheese } => {other vegetables} has a higher support and lift compared to the first rule, but its confidence is lower (57.9%). This suggests that while this combination of products occurs more frequently, it may not be as strong in predicting the purchase of vegetables when a customer buys both hard cheese and cream cheese. Third rule: {hard cheese, cream cheese } => {whole milk} has a similar support as a first rule, meaning that it applies to the same small proportion of transactions. However, it has lowest confidence and lift values, making it less reliable and significant to customer behaviour. This rule can demonstrate rare combination of items.
plot(lhs_rules[7:9], method = "graph", engine="htmlwidget")
Now the plot shows the relationship between rules as a graph. The central node represents cream cheese, which appears in all three rules, indicating that it is a key item in these associations. Hard cheese and whole milk appear in two rules each, showing that these items are associated with more than one combination of products. Frozen meals and other vegetables only appear in one rule each, which indicates that they are more specific to particular product combinations. Also the color differentiation in the plot corresponds to the lift value of each rule. The rule 3, which has lowest lift value in represented in light red. Meanwhile, Rules 1 and 2 are highlighted in bold red, suggesting that they have higher lift values and stronger associations between the items. Compared to previous plot, this visual shows elements of rules, allowing to quickly identify the central elements, and the relative strength of the rules. This plot also displays measures of rules (confidence, support) if we click on rule node. However, if we want to select strong rule the scatter plot is more useful because it clearly shows rules with higher support and confidence in a more clearer way. Therefore, the choice of plot depends on the purpose.