Conditional Probability Application: Association Rule Learning
2022-12-05

"Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. ... In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected." -- Wikipedia

This post briefly covers the following metrics:

  • Support is the evidence of how frequently an item appears in the data given

  • Confidence is defined by how many times the if-then statements are found true

  • Lift is used to compare the expected Confidence (assume X and Y are independent) and the actual Confidence (think of the lift formula and divide for both the numerator and denominator)

Support

The support of with respect to is defined as the proportion of transactions in the dataset which contains the itemset (or item) :

where is the transaction ID and is its full itemset.

Example, the support of :

Confidence

With respect to , the confidence value of an association rule, often denoted as , is the ratio of transactions containing both and to the total amount of values present, where is the antecedent and is the consequent.

Confidence can also be interpreted as an estimate of the conditional probability .

Example

Data (5 Transactions and 5 Items)

Transaction ID milk bread butter egg fruit
1 1 1 0 0 1
2 0 0 1 1 1
3 0 0 0 0 0
4 1 1 1 1 1
5 0 1 0 0 0

Support and Confidence

if Antecedent then Consequent supp conf supp X conf
if buy milk, then buy bread
if buy milk, then buy eggs
if buy bread, then buy fruit
if buy fruit, then buy eggs
if buy milk and bread, then buy fruit
  • Itemset has a support of 0.4 since it occurs in 40% of all transactions.

  • The rule has a confidence value of , suggesting butter is bought 50% of the times when milk and bread are bought.

Lift

The ratio of the observed support to that expected if and were independent:

For example, the rule has a lift of

Implied Relation
Independent
The degree to which those two occurrences are dependent on one another
The degree to which the items are substitute to each other

Summary

  • If the rules were built from analyzing all the possible itemsets from the data then there would be so many rules that they wouldn't have any meaning. That is why Association rules are typically made from rules that are well-represented by the data

    • When using Association rules, you are most likely to only use Support and Confidence. However, this means you have to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time.
  • Benefit

    Find the pattern that helps understand the correlations and co-occurrences between data sets. A very good real-world example that uses Association rules would be medicine. Medicine uses Association rules to help diagnose patients. [symptoms => illness]

  • Downfalls

    • Find the appropriate parameter and threshold settings for the mining algorithm

    • Have a large number of discovered rules, for which the algorithm does not guarantee the relevancy/reliability

R Package and Tutorial

CRAN - Package arules

arules: Association Rule Mining with R - A Tutorial (PDF File)