Finding-Frequent-Itemset

Overview

In this project, I implemented SON Algorithm using the Apache Spark Framework to find frequent item sets.

One of the major tasks is to find all the possible combinations of the frequent itemsets in a given input file using A-Priori algorithms. The project involves working of SON algorithm on two different datasets, one simulated dataset and one real-world generated dataset.

Apart from input file, 2 separate inputs are provided: • Filter threshold: Integer that is used to filter out qualified users • Support: Integer that defines the minimum count to qualify as a frequent itemset

The steps for finding frequent itemset includes:

Finding the candidates of frequent itemset (as singletons, pairs, triples, etc.) that maybe qualified as frequent given a support threshold (that maps to a frequent bucket).
Calculating the combinations of frequent itemset (as singletons, pairs, triples, etc.) that are actually frequent given a support threshold.

The code is optimized to run efficiently under 500 seconds for support 50 and filter threshold 20. The printed itemsets are sorted in lexicographical order.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DataSet		DataSet
FrequentItemSetsSimulatedData.py		FrequentItemSetsSimulatedData.py
FrequentItemSetsTaFengData.py		FrequentItemSetsTaFengData.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding-Frequent-Itemset

Overview

About

Releases

Packages

Languages

laidasani/Finding-Frequent-Itemset

Folders and files

Latest commit

History

Repository files navigation

Finding-Frequent-Itemset

Overview

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages