Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect results when the number of distinct items is large #6

Open
rcompton opened this issue Oct 2, 2014 · 0 comments
Open

incorrect results when the number of distinct items is large #6

rcompton opened this issue Oct 2, 2014 · 0 comments

Comments

@rcompton
Copy link

rcompton commented Oct 2, 2014

When the number of distinct items is larger than the max itemset size this library gives incorrect results.

I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).

Here's a test:

#all combinations of length-4 from 20 different items
database = [frozenset(t) for t in itertools.combinations(range(20), r=4)]

fitemlists_w_sup = fp_growth.find_frequent_itemsets(database, minimum_support=2,
                                                    include_support=True)
fitemlists_w_sup = list(fitemlists_w_sup)  # I will reuse the generator

#brute force way calculate support
counts = []
for transaction in database:
    psets = powerset(transaction)
    psets = [pset for pset in psets if len(pset) > 0]  # remove empty set
    psets = map(frozenset, psets)
    counts.extend(psets)
counts = collections.Counter(counts)

#check that the brute-force method agrees with the fp-growth library
fpd = dict([(frozenset(x[0]), x[1]) for x in fitemlists_w_sup])
for fitemlist_w_sup in fitemlists_w_sup:
    if fitemlist_w_sup[1] != counts[frozenset(fitemlist_w_sup[0])]:
        print 'fp support is wrong. fp support: {0}, true support: {1}'\
            .format(fitemlist_w_sup, counts[frozenset(fitemlist_w_sup[0])])

        #check that the apriori principle is satisfied
        for subpset in powerset(fitemlist_w_sup[0]):
            if len(subpset) > 0:
                if fpd[frozenset(subpset)] < fpd[frozenset(fitemlist_w_sup[0])]:
                    print 'apriori?, subset and support: {0} {1}, superset and support: {2}'\
                        .format(subpset, fpd[frozenset(subpset)],fitemlist_w_sup)

For reference, here's how I calculated powerset:

def powerset(iterable):
    xs = list(iterable)
    return itertools.chain.from_iterable(
        itertools.combinations(xs, n) for n in range(max_size + 1))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant