-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group time series split #915
Conversation
Sorry, mistakenly added 2 files (settings.json and conda_requirements.txt), removed it using additional commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Wow, this looks really good and super professional! While looking over the code, I had these two thoughts here:
mlxtend/evaluate/__init__.py
Outdated
@@ -40,4 +41,5 @@ | |||
"RandomHoldoutSplit", "PredefinedHoldoutSplit", | |||
"ftest", "combined_ftest_5x2cv", | |||
"proportion_difference", "bias_variance_decomp", | |||
"accuracy_score", "create_counterfactual"] | |||
"accuracy_score", "create_counterfactual", | |||
"time_series"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "time_series"
should be "GroupTimeSeriesSplit"
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was wrong, corrected and commited changes.
import numpy as np | ||
import pytest | ||
from mlxtend.evaluate import GroupTimeSeriesSplit | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole unit test suite here looks pretty comprehensive to me. I wonder if we could add a computationally cheap scikit-learn-related test though. E.g., plugging it into cross_val_score
or GridSearchCV
as cv
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test to check the usage with cross_val_score based on DummyClassifier with "most_frequent" strategy.
So with this split we will have 3 splits and the following values/accuracy for test true and predicted targets:
y | y_pred | accuracy |
---|---|---|
[1 1] | [0 0] | 0 |
[0 1] | [1 1] | 0.5 |
[1 0 0 0] | [1 1 1 1] | 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, thanks! Mainly why I suggested it is to check for API compatibility. I am pretty sure it is compatible with GridSearchCV
and cross_val_score
, but you never know. It's also to make sure in case we modify it in the future that it still works, or in case the scikit-learn API changes, it still works.
What I had in mind was something like:
from mlxtend.evaluate import GroupTimeSeriesSplit
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression()
cv = GroupTimeSeriesSplit(...)
cross_val_score(lr, X, y, cv=cv)
Hi Sebastian, During the implementaion of requested changes, I have some questions:
Thank you. |
Hello @labdmitriy! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-05-25 06:50:32 UTC |
I've implemented additional changes:
|
There are several updates of __init__.py because imports are automatically sorted by VS Code (using built-in isort tool) whereas imports in the file are not sorted, so for not changing your imports order I did it separately in plain text editor. |
Thanks a lot for the PR! Sorry, I recently got really swamped with tasks due to some paper reviews and following up on other PRs! I will hopefully have more time again soon!
This is a good point. Unless it affects your code, there is nothing to worry about. GitHub will automatically flag conflicts if they appear and we can try to deal with them then in the webinterface. Btw I resolved one of the issues, and you may have to pull on your end.
That's a tricky one. Personally, I feel like it's fine to check e.g., input arguments etc. It's not strictly scikit-learn consistent, but I don't see any harm in it to be honest. So, I am just checking the
I'd say it's best to discuss it here so that we don't have to jump back and forth too much.
No worries, it's all good! :). I try to keep on top of things, but I am a little bit time constrained this week. |
Good point regarding the pep8/black discrepancy. I was thinking about this a bit, but maybe it's just time to adjust to more modern times and use the 88 character limit rather than the 79. Then, if users are strict about 79 chars, it should still be okay when they submit it. On the other hand, if they are a the 88 lines, they don't get a complain either. I think this will make the PRs also a bit more frictionless while still maintaining recommended styles. I will adjust the pep8 checker via #920 |
Hi Sebastian,
No problem, I will also have more free time 2 next weeks, so I can answer more quickly too.
I guess you mean black reformatting and resolving conflict in __init__.py - I will pull it in my local feature branch.
Great! Then I will keep it 'as is'.
Great, ok!
Ok, thank you!
Thank you. |
To be honest, the code looks really good to me now. The next thing on my list is to check out the Jupyter Nb then. Yeah, I personally also always (often) separated standard lib imports and 3rd party imports (when I don't forget). So, I like the idea of adding the
Oh, I definitely don't want to drop this. It's a really nice and useful PR. I also learned a lot regarding black & isort. Very useful! I will try to review the Nb either later tonight or tomorrow early morning to give some more feedback. |
Great!
I made the comment in the relevant pull request about updated configuration, probably mlxtend was implied not biopandas.
Excellent!
I've been thinking about it and would like to suggest returning to this task when you have time, if there is still interest for you. |
The documentation is a great start. It looks very comprehensive, and I love the plots. What's nice about them is that they are automatically generated, so that this allows us and the users to create the plots for all kinds of scenarios. However, regarding the documentation of use cases, I don't think it needs to be exhaustive and show all possible ways you can use it. This would be very overwhelming. My suggestions are to focus on a few, but give the users the tools to explore the other ones if they are interested. (I.e., if we have a few well explained examples, users can copy and modify them). So, concretely, here are a few suggestions:
Your first example could be:
Then, the second one could be
and that's it.
|
Hi Sebastian, Thanks a lot for your feedback, I will prepare the notebook based on your requirements and will push all the changes. |
Hi Sebastian, I've made changes in Jupyter notebook and the code based on your comments, and have a few questions/notes: Notes
Questions
Again I am asking too many questions 😄, but I think (and hope 🤞) it can be useful. Thank you. |
Hi @rasbt, Could you please tell do I need to improve something for this PR? Thank you. |
I really like this restructured version! A few points that I think can be improved 1It's nice to start off with a general problem introduction (just as you did). However, considering that there is https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html, people might also be curious about the relationship to 2A few words about the example data would be helpful. First, I would put the features and targets first, and then start with something like "For the following examples, we are creating an example dataset consisting of 16 training data points ...". And then you can explain that we create 6 different groups so that the first training example belongs to group 0, the next 4 to group 1, and so forth. Btw. side questions about the implementation: do the groups do have to be consecutive order? Or could it be
3For each example here, it would also be nice to start with a few words describing what we are looking at here: Otherwise it is pretty good and much more accessible than before! Thanks for the update! |
Sure, let's revisit this when we have the final version. I can test this in my local mkdocs version then.
That would not be necessary. I will also remove the other folders in the future to save space on GitHub Good points regarding the CI/Workflow setups. Regarding line-length I just wanted to be explicit about that as a visual aide (so that it is clear that this is 88 without knowing the defaults. I think some people are still new to black and expect it to be 79 perhaps. I think I had some issues with multiline which is why I added it but I don't remember to be honest. Do you know if you can run it The inconsistency you mentioned refers to the missing |
Hi @rasbt,
I described the advantages of this implementation over scikit-learn's TimeSeriesSplit
I create dataset with features and specify months as index to have more clear description for usage examples, therefore I decided to define groups and months before features and target.
They could be as in your example and I have corresponding tests and description in the first section of the notebook, and I supposed that this order (not in ascending order but with continuous group values) is "consecutive".
Done
Thank you!
Great!
Unfortunately I don't have such experience but probably when I will progress in articles about development, I can tell more exact.
Here I just tried to share my thoughts that CI and pre-commit hooks has probably little different configurations for isort and black, and the decision what to add or delete depends on the desired target configuration. To be more succinct, my another notes were about the following:
Thank you. |
@rasbt |
Hi @rasbt, Could you please tell if there are no plans to finish this pull request over the next few weeks? Thank you. |
Thanks for updating the docs, and thanks for your patience. I currently have too many ongoing projects, so I can't check in every day. So, if you want to revisit this in summer, I can totally understand. On the other hand, I think this PR is very close. It's just a bit of polishing the docs. Btw when going over the docs, there was mainly only one thing that I found a bit confusing. I.e., in example 4, I wasn't sure what exactly the expanding window size was? E.g., here an image from Example 3: What's exactly expanded here? Do you mean that there are now 4 instead of 3 training groups (but this is specified) or is there something else I am missing? My best guess was that you mean that the splits depend on the training and test sizes, so I moved this illustration up to example 1 where we talk about the train and test group sizes. I think this way it is a more natural reading order for the user. What do you think? |
Hi @rasbt, Sorry if you feel that I am forcing you, it was not my intention. I thought that several last times I disturbed you, and decided to switch to another tasks until you have more free time. Thank you. |
No worries! It's all good :). I am still very excited about this PR and sometimes just wish the day has more hours 😅 |
Hi @rasbt, Thank you for your update, your description is much clearer and cleaner than mine, of course it is good for me. I just fixed one my typo.
Probably I didn't understand your recommendation from here correctly:
I thought that it is about expanding the size of training or test dataset, but maybe you mean another type of window (expanding).
I added new entry to Changelog for GroupTimeSeriesSplit. Thank you. |
Thanks, I think this PR is ready to merge now. Thanks so much for the hard and dedicated work on this! And sorry for me not being as responsive, it's a really busy time for me right now. However, I am super glad about this PR. I will maybe try to make even a new version release tonight so that it can be used. Regarding the
comment. We can always fine-tune the documentation later. It's not coupled to the main code. But what I meant was that I found that example 4 was basically a natural extension of example 1, so I merged them. From a reader's perspective, I was thinking that this is more intuitive this way. |
Thank you for your patience while answering to all my questions, it was very useful and interesting experience!
You are totally correct, it is much better now. I you don’t mind, I will write the article about this experience, because I know a lot of people (like me until recently) thinking that contributing is too difficult to even try it. |
Actually, this was really great. During this process, we added lots of useful things like pre-commit hooks, black, isort, etc. :)!
Sure, I think this is worthwhile and will be interesting for many people!
Cool! I will keep that in mind! |
Code of Conduct
Description
Add group time series cross-validator implementation.
Add tests with 100% coverage using pytest.
I decided to create pull request before creating documentation and change log modification, to discuss current implementation and further steps to implement.
Related issues or pull requests
Fixes #910
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend