-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a periodic encoder to the DatetimeEncoder #1235
base: main
Are you sure you want to change the base?
Conversation
skrub/_datetime_encoder.py
Outdated
self._required_transformers[_case] = _t | ||
if _case == "year": | ||
# TODO: In theory we should check that the year is leap | ||
_t = PeriodicEncoder(kind="circular", period=366) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it a bit surprising that the encoding is different for day and year? maybe it could be splines with 4 splines or something like that?
@@ -256,10 +261,19 @@ class DatetimeEncoder(SingleColumnTransformer): | |||
timezone used during ``fit`` and that we get the same result for "hour". | |||
""" # noqa: E501 | |||
|
|||
def __init__(self, resolution="hour", add_weekday=False, add_total_seconds=True): | |||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not generate the values when we are doing the periodic encoding? eg if we have spline encoding of the time of day, maybe we shouldn't by default output the "hour" feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe users still need the additional columns as features for some reason? Though I guess that by the point they get to vectorizing the table they're already done with exploring the data which is what keeping those features would help with 🤔
I'll remove the old features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure what would the best interface to make the skrub code simple and to make it easy for users to specify which features they want / figure out which features they will get 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A parameter like "keep_original"
would make sense to me. This reminds me somehow of coalesce
in polars join https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html, which pandas doesn't have.
I finished the base implementation and split the I am working locally on the examples, and now I need to write all the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @rcap107 ! here is a first few comments
Period to be used as basis of the trigonometric function. | ||
""" | ||
|
||
def __init__(self, period=24): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we shouldn't have a default for the period. eg 24 only makes sense for hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I'll remove it
skrub/_datetime_encoder.py
Outdated
@@ -26,6 +29,8 @@ | |||
"nanosecond", | |||
] | |||
|
|||
_DEFAULT_ENCODING_PERIODS = {"year": 4, "month": 30, "weekday": 7, "hour": 24} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 4 for year? shouldn't we set the period to 12 and use the month number or something like that? or set the period to 366 if we use the day of year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning I had 365 and the spline as default, but having 365 features for each year was too much, and I didn't have any better idea than setting it to a random number (4 seasons I guess)
Now that circular is the default, we can set it to 366
skrub/_datetime_encoder.py
Outdated
Select the strategy used to encode days in a week. By default, use a | ||
CircularEncoder with period=7. If None, no encoding is performed. | ||
|
||
hour_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming is inconsistent: does it indicate the time unit that gets encoded, or the one that defines the range?
for example here year_encoding
encodes the position within the year, so the name corresponds to the range, but hour_encoding
encodes the position within the day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the latter option, so hour_encoding
will be sin(hour / 24 * 2* pi)
whereas day_of_year_encoding
will be sin(day_of_year / 366 * 2 * pi)
skrub/_datetime_encoder.py
Outdated
_enc_attr = [attr for attr in self.__dict__ if attr.endswith("_encoding")] | ||
for _enc_name in _enc_attr: | ||
_enc = self.__getattribute__(_enc_name) | ||
_enc_case = _enc_name.split("_")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "case"?
skrub/_datetime_encoder.py
Outdated
# parameters | ||
if self.add_periodic: | ||
_enc_attr = [attr for attr in self.__dict__ if attr.endswith("_encoding")] | ||
for _enc_name in _enc_attr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detail: local variable names don't often start with an underscore. no special reason but it's one more character and I don't see what it adds
skrub/_datetime_encoder.py
Outdated
X_out = sbd.concat_horizontal(X_out, *_new_features) | ||
|
||
# Censoring all the null features | ||
X_out = sbd.where_row(X_out, self.not_nulls, _null_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the circular and spline encoders could do that themselves on the numpy output and we wouldn't need the where_row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
X_out[~self.not_nulls] = np.nan
works with pandas, but doesn't work with polars, is there another way?
|
||
def __init__(self, period=24, n_splines=None, degree=3): | ||
self.period = period | ||
self.n_splines = n_splines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where you had year period = 4 before I think maybe what you wanted is period of 366 (or 12 if we use the month) and number of splines = 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, correct
The SplineEncoder and the CircularEncoder are not documented in the API page, and thus they don't render correctly in the docs: |
skrub/_datetime_encoder.py
Outdated
integer. | ||
""" | ||
|
||
def __init__(self, period, n_splines=None, degree=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scikit-learn design, it is very frowned upon to have a parameter without a default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it after @jeromedockes suggested that having a default parameter that has nothing to do with the value to encode (e.g., 24 for month or year) would not be useful
I can add it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep sorry about that @rcap107 . Indeed I had said there was no sensible default here, but yeah you can add it back
skrub/_datetime_encoder.py
Outdated
def _more_tags(self): | ||
return {"preserves_dtype": []} | ||
|
||
def __sklearn_tags__(self): | ||
tags = super().__sklearn_tags__() | ||
tags.transformer_tags = TransformerTags(preserves_dtype=[]) | ||
return tags | ||
|
||
|
||
class SplineEncoder(SingleColumnTransformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a design perspective, this class seems to me really like a super thin wrapper on the scikit-learn SplineTransformer.
Would it not be possible to get rid of it, and use the SplineTransformer? I would like to minimize almost-redundant functionnality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just fold the logic of both encoders back in the main DatetimeEncoder then? CircularEncoder is just wrapping a call to np.sin/np.cos
I was thinking that it could be useful to have periodic encoders for non-datetime features, but maybe I'm wrong
skrub/_datetime_encoder.py
Outdated
Period to be used as basis of the trigonometric function. | ||
""" | ||
|
||
def __init__(self, period): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about scikit-learn really liking defaults. Put 365 for instance
@@ -26,6 +29,8 @@ | |||
"nanosecond", | |||
] | |||
|
|||
_DEFAULT_ENCODING_PERIODS = {"year": 366, "month": 30, "weekday": 7, "hour": 24} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive question: years and months don't actually have a periodicity of 366 and 30. How do we deal with this?
Is a solution to use the actually average length (365.25 days/year, I believe)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The period must be an integer
I don't have a good answer, but my gut feeling is that choosing between period 365 and 366 (and even between 28 and 31 for months) is not going to make a noticeable difference downstream
I don't understand why, when I look at the generated example gallery, it looks like the relevant examples did not run:
I would really like to see the examples running. I don't have a feeling for merging a PR without the examples running. |
skrub/_datetime_encoder.py
Outdated
|
||
year_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular" | ||
Select the strategy used to encode days in a year. By default, use a | ||
CircularEncoder with period=365. If None, no encoding is performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I follow things right, this is used only if "add_periodic" is True above. We should mention this.
skrub/_datetime_encoder.py
Outdated
CircularEncoder with period=365. If None, no encoding is performed. | ||
|
||
month_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular" | ||
Select the strategy used to encode days in a month. By default, use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here: this is used only if "add_periodic" is True above
skrub/_datetime_encoder.py
Outdated
add_periodic : bool, default=False | ||
Add periodic features with different granularities. By default, use | ||
trigonometric (circular) encoding of features. Spline encoding and | ||
custom encoders are also supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a simpler API that removes the "*_encoding" arguments below and changes this argument to accepting False, "circular", "spline"?
It would reduce the flexibility of the object, but my hunch is that it would cover most of the usecases. And it would remove the inter-dependency between arguments (aka "add_periodic=False" means that the arguments below are ignored), which make user mistake more likely and hyper-parameter selection easier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Gael. Too many hyper-parameter options might insensitive users to grid search them all while it would probably not bring much performance lift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users might want to gridsearch at least some parameters for different variables (like the number of splines for encoding a year)
Either way, I can remove everything and just hardcode the defaults
If we go with the simplified implementation, there is really no reason to have the separate encoders, is there
Co-authored-by: Gael Varoquaux <[email protected]>
I haven't gotten to the examples yet |
skrub/_datetime_encoder.py
Outdated
add_periodic : bool, default=False | ||
Add periodic features with different granularities. By default, use | ||
trigonometric (circular) encoding of features. Spline encoding and | ||
custom encoders are also supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Gael. Too many hyper-parameter options might insensitive users to grid search them all while it would probably not bring much performance lift.
@@ -256,10 +261,19 @@ class DatetimeEncoder(SingleColumnTransformer): | |||
timezone used during ``fit`` and that we get the same result for "hour". | |||
""" # noqa: E501 | |||
|
|||
def __init__(self, resolution="hour", add_weekday=False, add_total_seconds=True): | |||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A parameter like "keep_original"
would make sense to me. This reminds me somehow of coalesce
in polars join https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html, which pandas doesn't have.
Co-authored-by: Vincent M <[email protected]>
Should I just fold the logic of both encoders back in the main DatetimeEncoder then? CircularEncoder is just wrapping a call to np.sin/np.cos
My gut feeling would be to do this
|
I had a IRL discussion with @jeromedockes about the encoder and there are some issues that we might want to discuss in person. The main point of discussion is whether we want the user to have the flexibility to choose different parameters for the circular/spline encoders, or if it's better to keep it simple and have only My opinion is that the flexibility is an advantage, and if we decide to remove everything we will have to either hardcode variables that a user might want to tweak (specifically, the number of splines), or expose parameters to change them anyway. If we keep the code "as-is" (bar some refactoring), meaning with
If instead we simplify everything, the result is folding the logic of
Ultimately, I will be implementing fixes for either version, what I really want to avoid is picking one version, writing all the code, and then have to rewrite it again right before merging. |
The first value in a column may be a null, and the column itself might be all nulls.
Here, I have an example that would benefit greatly from this feature when I compare a linear model with a gradient boosting: It is really close from the bike rental example from scikit-learn: |
Draft for #907
Main points:
SplineTransformer
from scikit-learnQuestions:
Of course, tests and examples are all missing.