-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apriori optimization #567
Apriori optimization #567
Conversation
Thanks a lot for the PR! 3-6x is definitely a substantial improvement and very welcome :). Regarding the implementation, it'll probably be more memory hungry since everything is done within an array instead of iterating over a generator, but I suppose it's not an issue on big datasets based on your experience with this code? (I assume, big datasets where this would be prohibitive would be so large that the iterative version would run ~forever anyways). Otherwise, the alternative would be to enable both options via a flag " |
I don't really think that is overkill at all. The user should be able to specify a lightweight option in terms of memory usage for applications on low memory platforms (raspberry pi, for instance). Additionally the previous method using the iterator will always work - if given infinite time. There's value in that on its own. I'll add it back in under a flag with testing around the flag behavior. |
print(out.getvalue()) | ||
except TypeError: | ||
# If there is no low_memory argument, don't run the test. | ||
assert True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like the try-except-else
pattern, but it's common in Python and works here. I'm very, very open to suggestions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, since it's a function, that's a tricky one ... one solution, which is not that much better, would be
import inspect
def test_low_memory_flag(self):
if 'low_memory' in inspect.getargspec(self.fpalgo)[0]:
with captured_output() as (out, err):
_ = self.fpalgo(self.df, low_memory=True, verbose=1)
print(out.getvalue())
else:
# If there is no low_memory argument, don't run the test.
assert True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I like that better. I've changed this test to use inspect.signature
and removed that pesky debug print
that snuck through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
Sounds good then. Also, the low-resource env makes sense. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks great, there are just a few stylistic issues I noted below.
@@ -105,6 +111,19 @@ def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0): | |||
http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/ | |||
|
|||
""" | |||
|
|||
def _support(_x, _n_rows, _is_sparse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really important since private functions don't show up in the docs, but usually, we use the NumPy-style documentation style. That's because I had written a Python doc to markdown parser that automatically generates the documentation for the HTML website documentation. No need to change it here though because like I said, it won't appear on the website
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is definitely the time to change it if you want me to. I'm happy to do it to conform to the overall style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, if you don't mind, let me just do that real quick...even if it doesn't appear on the website we should be consistent to the overall style. Why build in inconsistencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -> 04e1e31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure, please go ahead then. Thanks!
I think everything should be good to go now (once the tests pass). Just added an additiona tests such that the whole suite runs for both Unless you have further suggestions, I would merge it once the tests pass. Will then also make a new version release later tonight or tomorrow -- it's about time ;). Thanks a lot for this PR! |
Happy to! Thanks for maintaining the library! |
Description
Currently, the implementation of the apriori algorithm uses a slow iteration to examine each item combination for above-threshold support. This iteration can be replaced by matrix operations that are generally faster, but use slightly more memory in some cases.
While still generally slower than
fpgrowth
, this implementation of apriori is 3-6x faster:https://gist.github.com/jmayse/ad688d6a7fd842269996a701d7cecd4c
https://gist.github.com/jmayse/7c76a2d838ac164b923a47b29527f2ed
The current use of the
verbose
operator doesn't really make sense without the iterator; hence, I have replaced it with a statement that outputs the current number of combinations being compared.The exit behavior for the main
while
loop has also changed from settingmax_itemset = 0
to usingbreak
. If this is undesirable, we can easily reverse this change.Related issues or pull requests
Fixes #566
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend