"Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. ... In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected." -- Wikipedia
This post briefly covers the following metrics:
Support is the evidence of how frequently an item appears in the data given
Confidence is defined by how many times the if-then statements are found true
Lift is used to compare the expected Confidence (assume X and Y are independent) and the actual Confidence (think of the lift formula and divide
for both the numerator and denominator)
Support
The support of
where
Example, the support of
Confidence
With respect to
Confidence can also be interpreted as an estimate of the conditional
probability
Example
Data (5 Transactions and 5 Items)
Transaction ID | milk | bread | butter | egg | fruit |
---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 1 | 1 | 1 | 1 |
5 | 0 | 1 | 0 | 0 | 0 |
Support and Confidence
if Antecedent then Consequent | supp | conf | supp X conf |
---|---|---|---|
if buy milk, then buy bread | |||
if buy milk, then buy eggs | |||
if buy bread, then buy fruit | |||
if buy fruit, then buy eggs | |||
if buy milk and bread, then buy fruit |
Itemset
has a support of 0.4 since it occurs in 40% of all transactions. The rule
has a confidence value of , suggesting butter is bought 50% of the times when milk and bread are bought.
Lift
The ratio of the observed support to that expected if
For example, the rule
Implied Relation | |
---|---|
Independent | |
The degree to which those two occurrences are dependent on one another | |
The degree to which the items are substitute to each other |
Summary
If the rules were built from analyzing all the possible itemsets from the data then there would be so many rules that they wouldn't have any meaning. That is why Association rules are typically made from rules that are well-represented by the data
- When using Association rules, you are most likely to only use Support and Confidence. However, this means you have to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time.
Benefit
Find the pattern that helps understand the correlations and co-occurrences between data sets. A very good real-world example that uses Association rules would be medicine. Medicine uses Association rules to help diagnose patients. [symptoms => illness]
Downfalls
Find the appropriate parameter and threshold settings for the mining algorithm
Have a large number of discovered rules, for which the algorithm does not guarantee the relevancy/reliability
R Package and Tutorial
arules: Association Rule Mining with R - A Tutorial (PDF File)
View / Make Comments