4.7 Apriori Algorithm
v Apriori Algorithm: -
Ø The Apriori algorithm is an unsupervised
machine learning algorithm used for association rule learning. Association rule
learning is a data mining technique that identifies frequent patterns,
connections and dependencies among different groups of items called itemsets in
data.
Ø It is also known a Downward closure property.
Ø The Association rule is a strategy for
detecting patterns in huge data sets. it involves finding relationships between
variables in the data and using those relationships to make predictions or
decisions. The purpose of an association rule is to find rules that define the
association between distinct elements in a data set.
Ø It is a classic association rule
mining technique used to discover frequent itemsets (groups of items that
often appear together) in transactional databases.
Ø It is widely used in market basket
analysis, recommendation systems, and pattern recognition.
Ø It can also be used in the healthcare field
to find drug reactions for patients.
Ø It is designed to find relationships (or
associations) between items in a dataset, like which products are frequently
purchased together.
Ø It was proposed by R. Agrawal and Srikant
in 1994.
The algorithm uses the "Apriori property", which
implies that if an itemset is frequent, its subsets must also be frequent.
Frequent Itemset
Ø
Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are
the frequent itemsets together, then individually A and B should also be the
frequent itemset.
Ø
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7},
in these two transactions, 2 and 3 are the frequent itemsets.
Ø
The
algorithm identifies "frequent itemsets," which are sets of items
that appear together more often than a specified minimum support
threshold.
Steps for Apriori Algorithm : The steps for the Apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database,
and select the minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value
than the minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence
value than the threshold or minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.
Example:
Suppose we have the following dataset that has various
transactions, and from this dataset, we need to find the frequent itemsets and
generate the association rules using the Apriori algorithm:
Step-1: Calculating C1 and L1:
·
In the first step, we will create a table that contains support count
(The frequency of each itemset individually in the dataset) of each itemset in
the given dataset. This table is called the Candidate set or C1.
·
Now, we will
take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
·
In this step, we will generate C2 with the help of L1. In C2, we will
create the pair of the itemsets of L1 in the form of subsets.
·
After creating the subsets, we will again find the support count from
the main transaction table of datasets, i.e., how many times these pairs have
occurred together in the given dataset. So, we will get the below table for C2:
·
Again, we
need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the
table C2. It will give us the below table for L2
Step-3: Candidate generation C3, and L3:
·
For C3, we
will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from
the dataset.
·
Now we will create the L3 table. As we can see from the above C3 table,
there is only one combination of itemset that has support count equal to the
minimum support count. So, the L3 will have only one combination, i.e., {A,
B, C}.
Step-4: Finding the association rules for the subsets:
·
To generate the association rules, first, we will create a new table
with the possible rules from the occurred combination {A, B.C}.
·
For all the rules, we will calculate the Confidence using formula sup(
A ^B)/A.
·
After calculating the confidence value for all rules, we will exclude
the rules that have less confidence than the minimum threshold (50%).
As the given threshold or minimum confidence is 50%, so the first
three rules A ^B → C, B^C → A, and A^C → B can be considered
as the strong association rules for the given problem.
Advantages of Apriori Algorithm
·
This is easy to understand algorithm
·
The join and prune steps of the algorithm can be easily implemented on
large datasets.
Disadvantages of Apriori Algorithm
·
The apriori algorithm works slow compared to other algorithms.
·
The overall performance can be reduced as it scans the database for
multiple times.
·
The time complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.
Applications:
·
Market Basket Analysis: Identifying which products are frequently bought together in
supermarkets.
·
Recommendation Systems: Suggesting products or services based on past purchases or
browsing history.
·
Healthcare: Identifying patterns in disease outbreaks or patient
behavior.
·
Bioinformatics: Analysing DNA and protein sequences.