Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a periodic encoder to the DatetimeEncoder #1235

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Feb 7, 2025

Draft for #907

Main points:

  • Adding transformers for hours in a day and days in a year
  • Adding the ordinal day (day in a year) as feature
  • Implement transformers using SplineTransformer from scikit-learn

Questions:

  • Should I also implement a CircularTransformer?
  • The main reason I am considering the CircularTransformer is because I am not sure what would be better for encoding days in a year. Should I just not bother, and keep SplineTransformer with a limited number of knots?
  • What should the default parameters be?
  • If we decide to encode the year, there is the problem of having both leap and non-leap years in a dataset. Is that a problem? I don't think it would affect the features that much.
  • Any more features that should be added in this PR?

Of course, tests and examples are all missing.

self._required_transformers[_case] = _t
if _case == "year":
# TODO: In theory we should check that the year is leap
_t = PeriodicEncoder(kind="circular", period=366)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it a bit surprising that the encoding is different for day and year? maybe it could be splines with 4 splines or something like that?

@@ -256,10 +261,19 @@ class DatetimeEncoder(SingleColumnTransformer):
timezone used during ``fit`` and that we get the same result for "hour".
""" # noqa: E501

def __init__(self, resolution="hour", add_weekday=False, add_total_seconds=True):
def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not generate the values when we are doing the periodic encoding? eg if we have spline encoding of the time of day, maybe we shouldn't by default output the "hour" feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe users still need the additional columns as features for some reason? Though I guess that by the point they get to vectorizing the table they're already done with exploring the data which is what keeping those features would help with 🤔

I'll remove the old features

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what would the best interface to make the skrub code simple and to make it easy for users to specify which features they want / figure out which features they will get 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parameter like "keep_original" would make sense to me. This reminds me somehow of coalesce in polars join https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html, which pandas doesn't have.

@rcap107
Copy link
Contributor Author

rcap107 commented Feb 10, 2025

I finished the base implementation and split the PeriodicEncoder into SplineEncoder and CircularEncoder

I am working locally on the examples, and now I need to write all the tests

@rcap107 rcap107 marked this pull request as ready for review February 13, 2025 10:16
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @rcap107 ! here is a first few comments

Period to be used as basis of the trigonometric function.
"""

def __init__(self, period=24):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we shouldn't have a default for the period. eg 24 only makes sense for hours

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I'll remove it

@@ -26,6 +29,8 @@
"nanosecond",
]

_DEFAULT_ENCODING_PERIODS = {"year": 4, "month": 30, "weekday": 7, "hour": 24}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 4 for year? shouldn't we set the period to 12 and use the month number or something like that? or set the period to 366 if we use the day of year?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning I had 365 and the spline as default, but having 365 features for each year was too much, and I didn't have any better idea than setting it to a random number (4 seasons I guess)

Now that circular is the default, we can set it to 366

Select the strategy used to encode days in a week. By default, use a
CircularEncoder with period=7. If None, no encoding is performed.

hour_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming is inconsistent: does it indicate the time unit that gets encoded, or the one that defines the range?
for example here year_encoding encodes the position within the year, so the name corresponds to the range, but hour_encoding encodes the position within the day

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the latter option, so hour_encoding will be sin(hour / 24 * 2* pi) whereas day_of_year_encoding will be sin(day_of_year / 366 * 2 * pi)

_enc_attr = [attr for attr in self.__dict__ if attr.endswith("_encoding")]
for _enc_name in _enc_attr:
_enc = self.__getattribute__(_enc_name)
_enc_case = _enc_name.split("_")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "case"?

# parameters
if self.add_periodic:
_enc_attr = [attr for attr in self.__dict__ if attr.endswith("_encoding")]
for _enc_name in _enc_attr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detail: local variable names don't often start with an underscore. no special reason but it's one more character and I don't see what it adds

X_out = sbd.concat_horizontal(X_out, *_new_features)

# Censoring all the null features
X_out = sbd.where_row(X_out, self.not_nulls, _null_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the circular and spline encoders could do that themselves on the numpy output and we wouldn't need the where_row

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this

X_out[~self.not_nulls] = np.nan

works with pandas, but doesn't work with polars, is there another way?


def __init__(self, period=24, n_splines=None, degree=3):
self.period = period
self.n_splines = n_splines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where you had year period = 4 before I think maybe what you wanted is period of 366 (or 12 if we use the month) and number of splines = 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct

@GaelVaroquaux
Copy link
Member

The SplineEncoder and the CircularEncoder are not documented in the API page, and thus they don't render correctly in the docs:
https://output.circle-artifacts.com/output/job/cc1b279e-769f-4515-bd08-30d0ebebd0f6/artifacts/0/doc/reference/generated/skrub.DatetimeEncoder.html
image

integer.
"""

def __init__(self, period, n_splines=None, degree=3):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scikit-learn design, it is very frowned upon to have a parameter without a default value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it after @jeromedockes suggested that having a default parameter that has nothing to do with the value to encode (e.g., 24 for month or year) would not be useful

I can add it back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sorry about that @rcap107 . Indeed I had said there was no sensible default here, but yeah you can add it back

def _more_tags(self):
return {"preserves_dtype": []}

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.transformer_tags = TransformerTags(preserves_dtype=[])
return tags


class SplineEncoder(SingleColumnTransformer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a design perspective, this class seems to me really like a super thin wrapper on the scikit-learn SplineTransformer.

Would it not be possible to get rid of it, and use the SplineTransformer? I would like to minimize almost-redundant functionnality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just fold the logic of both encoders back in the main DatetimeEncoder then? CircularEncoder is just wrapping a call to np.sin/np.cos

I was thinking that it could be useful to have periodic encoders for non-datetime features, but maybe I'm wrong

Period to be used as basis of the trigonometric function.
"""

def __init__(self, period):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about scikit-learn really liking defaults. Put 365 for instance

@@ -26,6 +29,8 @@
"nanosecond",
]

_DEFAULT_ENCODING_PERIODS = {"year": 366, "month": 30, "weekday": 7, "hour": 24}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question: years and months don't actually have a periodicity of 366 and 30. How do we deal with this?
Is a solution to use the actually average length (365.25 days/year, I believe)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The period must be an integer

I don't have a good answer, but my gut feeling is that choosing between period 365 and 366 (and even between 28 and 31 for months) is not going to make a noticeable difference downstream

@GaelVaroquaux
Copy link
Member

I don't understand why, when I look at the generated example gallery, it looks like the relevant examples did not run:

I would really like to see the examples running. I don't have a feeling for merging a PR without the examples running.


year_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular"
Select the strategy used to encode days in a year. By default, use a
CircularEncoder with period=365. If None, no encoding is performed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow things right, this is used only if "add_periodic" is True above. We should mention this.

CircularEncoder with period=365. If None, no encoding is performed.

month_encoding : str, :class:``~CircularEncoder``, :class:``~SplineEncoder`` or None, default="circular"
Select the strategy used to encode days in a month. By default, use a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here: this is used only if "add_periodic" is True above

add_periodic : bool, default=False
Add periodic features with different granularities. By default, use
trigonometric (circular) encoding of features. Spline encoding and
custom encoders are also supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a simpler API that removes the "*_encoding" arguments below and changes this argument to accepting False, "circular", "spline"?

It would reduce the flexibility of the object, but my hunch is that it would cover most of the usecases. And it would remove the inter-dependency between arguments (aka "add_periodic=False" means that the arguments below are ignored), which make user mistake more likely and hyper-parameter selection easier)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Gael. Too many hyper-parameter options might insensitive users to grid search them all while it would probably not bring much performance lift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users might want to gridsearch at least some parameters for different variables (like the number of splines for encoding a year)

Either way, I can remove everything and just hardcode the defaults

If we go with the simplified implementation, there is really no reason to have the separate encoders, is there

@rcap107
Copy link
Contributor Author

rcap107 commented Feb 17, 2025

I don't understand why, when I look at the generated example gallery, it looks like the relevant examples did not run:

I would really like to see the examples running. I don't have a feeling for merging a PR without the examples running.

I haven't gotten to the examples yet

add_periodic : bool, default=False
Add periodic features with different granularities. By default, use
trigonometric (circular) encoding of features. Spline encoding and
custom encoders are also supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Gael. Too many hyper-parameter options might insensitive users to grid search them all while it would probably not bring much performance lift.

@@ -256,10 +261,19 @@ class DatetimeEncoder(SingleColumnTransformer):
timezone used during ``fit`` and that we get the same result for "hour".
""" # noqa: E501

def __init__(self, resolution="hour", add_weekday=False, add_total_seconds=True):
def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parameter like "keep_original" would make sense to me. This reminds me somehow of coalesce in polars join https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html, which pandas doesn't have.

@rcap107 rcap107 changed the title Initial commit for periodic encoder Adding a periodic encoder to the DatetimeEncoder Feb 17, 2025
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 17, 2025 via email

@rcap107
Copy link
Contributor Author

rcap107 commented Feb 18, 2025

I had a IRL discussion with @jeromedockes about the encoder and there are some issues that we might want to discuss in person.

The main point of discussion is whether we want the user to have the flexibility to choose different parameters for the circular/spline encoders, or if it's better to keep it simple and have only None, circular, spline.

My opinion is that the flexibility is an advantage, and if we decide to remove everything we will have to either hardcode variables that a user might want to tweak (specifically, the number of splines), or expose parameters to change them anyway.

If we keep the code "as-is" (bar some refactoring), meaning with CircularEncoder and SplineEncoder and all the specific parameters:

  • The implementation would let the user choose (and maybe search over) specific values for the period and especially the number of splines. Also, the code is already there.
  • There are multiple parameters (year_encoding etc.) that bloat the docstring and that are relevant only when add_periodic=True. I don't have a better idea on how to write this.
  • The encoders are "there", so if someone needs a periodic encoder for something other than a datetime, they're still available (no idea if this is a realistic use case).

If instead we simplify everything, the result is folding the logic of CircularEncoder and SplineEncoder back in the DatetimeEncoder, and removing the *_encoding parameters with just add_periodic that can be either a bool, circular or spline.

  • This would simplify the documentation and the user-experience (I am not fully sold on this, but I get the point).
  • It also means that either 1) the flexibility goes out of the window and we hardcode every parameter, or 2) we try to retain some of that flexibility by exposing the number of splines as a parameter.
    1. is not a problem for periods, but a user might want to tweak the number of splines for a year (or avoid having 365 features for each year feature).
    1. Would just functionally be the same as providing a custom SplineEncoder, but with exposing a lower level parameter that would also be irrelevant whenever add_periodic=False.
  • Sidenote, if we go with 1) I would rather avoid starting a fractal discussion of the number of splines to go with for each case.
  • We run the risk of having someone come in and open an issue asking to provide more customization for the Encoders, which may require moving the logic back out to separate objects.

Ultimately, I will be implementing fixes for either version, what I really want to avoid is picking one version, writing all the code, and then have to rewrite it again right before merging.

The first value in a column may be a null, and the column itself might be all nulls.
@glemaitre
Copy link
Member

Here, I have an example that would benefit greatly from this feature when I compare a linear model with a gradient boosting:

https://skore.probabl.ai/0.7/auto_examples/use_cases/plot_employee_salaries.html#sphx-glr-auto-examples-use-cases-plot-employee-salaries-py

It is really close from the bike rental example from scikit-learn:

https://scikit-learn.org/dev/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants