Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of apriori algorithm by replacing iterator with matrix operations #566

Closed
10 of 14 tasks
jmayse opened this issue Jul 18, 2019 · 0 comments · Fixed by #567
Closed
10 of 14 tasks

Optimization of apriori algorithm by replacing iterator with matrix operations #566

jmayse opened this issue Jul 18, 2019 · 0 comments · Fixed by #567

Comments

@jmayse
Copy link

jmayse commented Jul 18, 2019

Currently, the apriori implementation constructs a set of possible item combinations then iterates through this set to determine which combinations have a support above min_support. This iteration is slow and can be replaced by matrix equality operations and/or arithmetic. I propose the following changes:

    while max_itemset and max_itemset < (max_len or float('inf')):
        next_max_itemset = max_itemset + 1
        combin = np.array(list(generate_new_combinations(itemset_dict[max_itemset])))

        if combin.size == 0:
            break

        if verbose:
            print('\rProcessing %d combinations | Sampling itemset size %d' %
                  (combin.size[0], next_max_itemset), end="")

        if is_sparse:
            all_ones = np.ones((int(rows_count), 1))
            _bools = X[:, combin[:, 0]] == all_ones
            for n in range(1, combin.shape[1]):
                _bools = _bools & (X[:, combin[:, n]] == all_ones)
        else:
            _bools = np.all(X[:, combin], axis=2)

        support = _support(np.array(_bools), rows_count, is_sparse)
        _mask = (support >= min_support).reshape(-1)

        if any(_mask):
            itemset_dict[next_max_itemset] = np.array(combin[_mask])
            support_dict[next_max_itemset] = np.array(support[_mask])
            max_itemset = next_max_itemset
        else:
            break

In practice, this implementation is much faster than the current implementation, albeit at the cost of a slightly increased memory footprint:

https://gist.github.com/jmayse/ad688d6a7fd842269996a701d7cecd4c

However, it usually remains slower than the fpgrowth implementation, as is expected:

https://gist.github.com/jmayse/7c76a2d838ac164b923a47b29527f2ed

  • Open a new "issue" on GitHub to discuss the new feature / bug fix
  • Fork the mlxtend repository from GitHub (if not already done earlier)
  • Create and check out a new topic branch (please don't make modifications in the master branch)
  • Implement the new feature or apply the bug-fix
  • Add appropriate unit test functions in mlxtend/*/tests
  • Run nosetests ./mlxtend -sv and make sure that all unit tests pass
  • Check/improve the test coverage by running nosetests ./mlxtend --with-coverage
  • Check for style issues by running flake8 ./mlxtend (you may want to run nosetests again after you made modifications to the code)
  • Add a note about the modification/contribution to the ./docs/sources/changelog.md file
  • Modify documentation in the appropriate location under mlxtend/docs/sources/
  • Push the topic branch to the server and create a pull request
  • Check the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend
  • Check/improve the unit test coverage at https://coveralls.io/github/rasbt/mlxtend
  • Check/improve the code health at https://landscape.io/github/rasbt/mlxtend
@jmayse jmayse mentioned this issue Jul 18, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant