Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: optimize StringEncoder #1248

Merged
merged 7 commits into from
Feb 27, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ Changes
- Progress messages when generating a ``TableReport`` are now written to stderr instead of stdout.
:pr:`1236` by :user:`Priscilla Baah<priscilla-b>`

- Optimize the :class:`StringEncoder`: significant memory reduction and 1.5x speed-up.
:pr:`1248` by :user:`Gaël Varoquaux <gaelvaroquaux>`

Release 0.5.1
=============

Expand Down
24 changes: 17 additions & 7 deletions skrub/_string_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
n_components : int, default=30
Number of components to be used for the singular value decomposition (SVD).
Must be a positive integer.
vectorizer : str, "tfidf" or "hashing"
vectorizer : str, "tfidf" or "hashing", default="tfidf"
Vectorizer to apply to the strings, either `tfidf` or `hashing` for
scikit-learn TfidfVectorizer or HashingVectorizer respectively.

Expand Down Expand Up @@ -133,12 +133,17 @@
f" 'hashing', got {self.vectorizer!r}"
)

X = sbd.fill_nulls(X, "")
X_out = self.vectorizer_.fit_transform(X)
X_filled = sbd.fill_nulls(X, "")
X_out = self.vectorizer_.fit_transform(X_filled).astype("float32")
del X_filled # optimizes memory: we no longer need X

if (min_shape := min(X_out.shape)) >= self.n_components:
self.tsvd_ = TruncatedSVD(n_components=self.n_components)
if (min_shape := min(X_out.shape)) > self.n_components:
self.tsvd_ = TruncatedSVD(
n_components=self.n_components, algorithm="arpack"
)
result = self.tsvd_.fit_transform(X_out)
elif X_out.shape[1] == self.n_components:
result = X_out.toarray()

Check warning on line 146 in skrub/_string_encoder.py

View check run for this annotation

Codecov / codecov/patch

skrub/_string_encoder.py#L146

Added line #L146 was not covered by tests
Comment on lines +145 to +146
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the 'arpack' algorithm not like it when p == n_components? or is skipping the tsvd in that case an optimization? if the latter, I guess the case where the number of discovered ngrams is exactly equal to n_components might be too rare to warrant it? and we might want a test for that branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess when n_components matches the number of dimensions of the vector, running the SVD doesn't make sense? We could coalesce this unlikely condition with the else statement below, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's both, actually.

I think that needed this for the tests to pass. At least my first implementation hit a corner case in the tests (good tests!)

else:
warnings.warn(
f"The matrix shape is {(X_out.shape)}, and its minimum is "
Expand All @@ -152,6 +157,8 @@
# Therefore, self.n_components_ below stores the resulting
# number of dimensions of result.
result = X_out[:, : self.n_components].toarray()
result = result.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the copy() is that otherwise the reference to the slice would prevent X_out from being garbage collected; might be worth a short comment

del X_out # optimize memory: we no longer need X_out

self._is_fitted = True
self.n_components_ = result.shape[1]
Expand All @@ -177,12 +184,15 @@
The embedding representation of the input.
"""

X = sbd.fill_nulls(X, "")
X_out = self.vectorizer_.transform(X)
X_filled = sbd.fill_nulls(X, "")
X_out = self.vectorizer_.transform(X_filled).astype("float32")
del X_filled # optimizes memory: we no longer need X
if hasattr(self, "tsvd_"):
result = self.tsvd_.transform(X_out)
else:
result = X_out[:, : self.n_components].toarray()
result = result.copy()
del X_out # optimize memory: we no longer need X_out

return self._post_process(X, result)

Expand Down
18 changes: 10 additions & 8 deletions skrub/tests/test_string_encoder.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import pytest
from numpy.testing import assert_almost_equal
from sklearn.base import clone
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import (
Expand Down Expand Up @@ -37,6 +38,7 @@ def test_tfidf_vectorizer(encode_column, df_module):
]
)
check = pipe.fit_transform(sbd.to_numpy(encode_column))
check = check.astype("float32") # StringEncoder is float32

names = [f"col1_{idx}" for idx in range(2)]

Expand Down Expand Up @@ -197,21 +199,21 @@ def test_missing_values(df_module, vectorizer):
encoder = StringEncoder(n_components=2, vectorizer=vectorizer)
out = encoder.fit_transform(col)
for c in sbd.to_column_list(out):
assert c[1] == 0.0
assert c[2] == 0.0
assert_almost_equal(c[1], 0.0, decimal=6)
assert_almost_equal(c[2], 0.0, decimal=6)
out = encoder.transform(col)
for c in sbd.to_column_list(out):
assert c[1] == 0.0
assert c[2] == 0.0
assert_almost_equal(c[1], 0.0, decimal=6)
assert_almost_equal(c[2], 0.0, decimal=6)
tv = TableVectorizer(
low_cardinality=StringEncoder(n_components=2, vectorizer=vectorizer)
)
df = df_module.make_dataframe({"col": col})
out = tv.fit_transform(df)
for c in sbd.to_column_list(out):
assert c[1] == 0.0
assert c[2] == 0.0
assert_almost_equal(c[1], 0.0, decimal=6)
assert_almost_equal(c[2], 0.0, decimal=6)
out = tv.transform(df)
for c in sbd.to_column_list(out):
assert c[1] == 0.0
assert c[2] == 0.0
assert_almost_equal(c[1], 0.0, decimal=6)
assert_almost_equal(c[2], 0.0, decimal=6)