You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the number of distinct items is larger than the max itemset size this library gives incorrect results.
I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).
Here's a test:
#all combinations of length-4 from 20 different items
database = [frozenset(t) for t in itertools.combinations(range(20), r=4)]
fitemlists_w_sup = fp_growth.find_frequent_itemsets(database, minimum_support=2,
include_support=True)
fitemlists_w_sup = list(fitemlists_w_sup) # I will reuse the generator
#brute force way calculate support
counts = []
for transaction in database:
psets = powerset(transaction)
psets = [pset for pset in psets if len(pset) > 0] # remove empty set
psets = map(frozenset, psets)
counts.extend(psets)
counts = collections.Counter(counts)
#check that the brute-force method agrees with the fp-growth library
fpd = dict([(frozenset(x[0]), x[1]) for x in fitemlists_w_sup])
for fitemlist_w_sup in fitemlists_w_sup:
if fitemlist_w_sup[1] != counts[frozenset(fitemlist_w_sup[0])]:
print 'fp support is wrong. fp support: {0}, true support: {1}'\
.format(fitemlist_w_sup, counts[frozenset(fitemlist_w_sup[0])])
#check that the apriori principle is satisfied
for subpset in powerset(fitemlist_w_sup[0]):
if len(subpset) > 0:
if fpd[frozenset(subpset)] < fpd[frozenset(fitemlist_w_sup[0])]:
print 'apriori?, subset and support: {0} {1}, superset and support: {2}'\
.format(subpset, fpd[frozenset(subpset)],fitemlist_w_sup)
For reference, here's how I calculated powerset:
def powerset(iterable):
xs = list(iterable)
return itertools.chain.from_iterable(
itertools.combinations(xs, n) for n in range(max_size + 1))
The text was updated successfully, but these errors were encountered:
When the number of distinct items is larger than the max itemset size this library gives incorrect results.
I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).
Here's a test:
For reference, here's how I calculated powerset:
The text was updated successfully, but these errors were encountered: