ENH: optimize StringEncoder #1248

GaelVaroquaux · 2025-02-26T12:56:16Z

For memory and speed:

Significant memory improvement (I did not measure)
1.5 speedup on one task

For memory and speed

Force a doc build, and also fix some failing examples (still more to do)

GaelVaroquaux · 2025-02-26T13:47:28Z

FYI, the examples can be seen here: https://output.circle-artifacts.com/output/job/f25f287d-9bba-4153-977a-bbd5fbc2f3ca/artifacts/0/doc/auto_examples/02_text_with_string_encoders.html#sphx-glr-auto-examples-02-text-with-string-encoders-py
And I'm quite happy that with this PR StringEncoder is now faster than MinHashEncoder on this example ✌️

jeromedockes

LGTM but it would be nice to cover the shape[1] == n_components case in a test, LMK if you don't have time now I can push a small commit for it

jeromedockes · 2025-02-26T16:01:37Z

skrub/_string_encoder.py

+        elif X_out.shape[1] == self.n_components:
+            result = X_out.toarray()


does the 'arpack' algorithm not like it when p == n_components? or is skipping the tsvd in that case an optimization? if the latter, I guess the case where the number of discovered ngrams is exactly equal to n_components might be too rare to warrant it? and we might want a test for that branch

I guess when n_components matches the number of dimensions of the vector, running the SVD doesn't make sense? We could coalesce this unlikely condition with the else statement below, though.

It's both, actually.

I think that needed this for the tests to pass. At least my first implementation hit a corner case in the tests (good tests!)

jeromedockes · 2025-02-26T16:04:00Z

skrub/_string_encoder.py

@@ -152,6 +157,8 @@ def fit_transform(self, X, y=None):
            # Therefore, self.n_components_ below stores the resulting
            # number of dimensions of result.
            result = X_out[:, : self.n_components].toarray()
+            result = result.copy()


I suppose the copy() is that otherwise the reference to the slice would prevent X_out from being garbage collected; might be worth a short comment

GaelVaroquaux · 2025-02-26T17:39:10Z

LGTM but it would be nice to cover the shape[1] == n_components case in a test, LMK if you don't have time now I can push a small commit for it

It would be helpful. Thanks!

Let's also wait for a systematic evaluation before merging

jeromedockes

LGTM, thanks!

Vincent-Maladiere

Looks good!

ENH: optimize StringEncoder

bf6a565

For memory and speed

GaelVaroquaux requested review from rcap107 and jeromedockes February 26, 2025 12:56

GaelVaroquaux added 3 commits February 26, 2025 14:03

style

ef02b73

[doc build]

8464087

Force a doc build, and also fix some failing examples (still more to do)

fix tests

f5a1d84

GaelVaroquaux mentioned this pull request Feb 26, 2025

Investigate memory usage improvements in StringEncoder and TextEncoder #1246

Open

jeromedockes reviewed Feb 26, 2025

View reviewed changes

GaelVaroquaux and others added 3 commits February 26, 2025 18:41

comment

eafa5ff

add test

4b219dc

fix test

4514d79

jeromedockes approved these changes Feb 27, 2025

View reviewed changes

Vincent-Maladiere approved these changes Feb 27, 2025

View reviewed changes

Vincent-Maladiere merged commit 97011bd into skrub-data:main Feb 27, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: optimize StringEncoder #1248

ENH: optimize StringEncoder #1248

GaelVaroquaux commented Feb 26, 2025

GaelVaroquaux commented Feb 26, 2025 •

edited

Loading

jeromedockes left a comment

jeromedockes Feb 26, 2025

Vincent-Maladiere Feb 26, 2025

GaelVaroquaux Feb 26, 2025

jeromedockes Feb 26, 2025

GaelVaroquaux commented Feb 26, 2025

jeromedockes left a comment

Vincent-Maladiere left a comment

		elif X_out.shape[1] == self.n_components:
		result = X_out.toarray()

ENH: optimize StringEncoder #1248

ENH: optimize StringEncoder #1248

Conversation

GaelVaroquaux commented Feb 26, 2025

GaelVaroquaux commented Feb 26, 2025 • edited Loading

jeromedockes left a comment

Choose a reason for hiding this comment

jeromedockes Feb 26, 2025

Choose a reason for hiding this comment

Vincent-Maladiere Feb 26, 2025

Choose a reason for hiding this comment

GaelVaroquaux Feb 26, 2025

Choose a reason for hiding this comment

jeromedockes Feb 26, 2025

Choose a reason for hiding this comment

GaelVaroquaux commented Feb 26, 2025

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Feb 26, 2025 •

edited

Loading