Skip to content

Commit 1fed261

Browse files
committed
example
1 parent 10e6063 commit 1fed261

File tree

1 file changed

+93
-12
lines changed

1 file changed

+93
-12
lines changed

examples/expressions/10_expressions_intro.py

+93-12
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@
121121
#
122122
#
123123
# Again, as this transformation is not in a scikit-learn estimator, we have to
124-
# keep track of it ourselves so that we can later apply to unseen data, which
124+
# keep track of it ourselves so that we can later apply it to unseen data, which
125125
# is error-prone, and we cannot tune any choices (like the choice of the
126126
# aggregation function).
127127
#
@@ -132,22 +132,47 @@
132132
# A solution with skrub
133133
# ---------------------
134134
#
135-
# Here we just show the solution. Subsequent examples dive into the details.
136-
# TODO TODO TODO
135+
# Here we show how our credit fraud dataset can be handled with Skrub. We do
136+
# not explain all the details, in-depth explanations are left for the next
137+
# example. Our goal here is to motivate Skrub's way of building pipelines by
138+
# showing it can easily tackle a more complex dataset with several tables.
137139
#
140+
# The main difference with scikit-learn pipelines is that we do not provide an
141+
# explicit list of transformation steps. Rather, we manipulate skrub objects
142+
# that represent intermediate results, and the pipeline is built implicitly as
143+
# we perform operations (such as applying operators or calling functions) on
144+
# those objects.
138145

139146
# %%
140-
# Declare inputs to the pipeline
147+
# Declare inputs to the pipeline. We create skrub "variables", which are given
148+
# a name and represent the inputs to our pipeline — here, the products,
149+
# baskets, and fraud flags. They are given a name and an (optional) initial
150+
# value, which is used to show previews of the pipeline's output, detect errors
151+
# early, and provide data for cross-validation and hyperparameter search.
152+
#
153+
# We then build the pipeline by applying transformations to those inputs.
154+
141155

142156
# %%
143157
products = skrub.var("products", dataset.products)
144158
baskets = skrub.var("baskets", dataset.baskets[["ID"]]).skb.mark_as_x()
145159
fraud_flags = skrub.var("fraud_flags", dataset.baskets["fraud_flag"]).skb.mark_as_y()
146160

147161
# %%
148-
# Access to the dataframe's usual API; interactive preview of intermediate
149-
# results Note below we are using ``products`` and ``baskets`` as if they were
150-
# a pandas DataFrames
162+
# Above, `mark_as_x()` and `mark_as_y()` tell skrub that the baskets and flags
163+
# are respectively our design matrix and targets, ie the tables that should be
164+
# split into training and testing sets for cross-validation. Here they are
165+
# direct inputs to the pipeline but they don't have to be — any intermediate
166+
# result could be marked as X or y.
167+
#
168+
# Because our pipeline expects DataFrames for the products, baskets and fraud
169+
# flags, we manipulate those objects just like we would manipulate DataFrames.
170+
# All attribute accesses will be transparently forwarded to the actual input
171+
# DataFrames when we run the pipeline.
172+
#
173+
# For example let us filter products to keep only those that match one of the
174+
# baskets in the ``"baskets"`` table, then add a column containing the total
175+
# amount for each kind of product in a basket:
151176

152177
# %%
153178
products = products[products["basket_ID"].isin(baskets["ID"])]
@@ -157,7 +182,27 @@
157182
products
158183

159184
# %%
160-
# Easily specify a hyperparameter grid
185+
# Note we are getting previews of the output of intermediate results. For
186+
# example we can see the added ``"total_price"`` column in the output above.
187+
# The dropdown at the top allows us to check the structure of the pipeline and
188+
# all the steps it contains.
189+
#
190+
# With skrub we do not need to specify a grid of hyperparameters separately
191+
# from the pipeline. Instead, we can replace a parameter's value with a skrub
192+
# "choice" which indicates the range of values we would like to consider during
193+
# hyperparameter selection.
194+
#
195+
# Those choices can be nested arbitrarily. They are not restricted to
196+
# parameters of a scikit-learn estimator, but they can be anything: choosing
197+
# between different estimators, arguments to function calls, whole sections of
198+
# the pipeline etc.
199+
#
200+
# In-depth information about choices and hyperparameter/model selection is
201+
# provided in example (TODO add link).
202+
#
203+
# Here we build a skrub ``TableVectorizer`` that contains a couple of choices:
204+
# the type of encoder for high-cardinality categorical or string columns, and
205+
# the number of components it uses.
161206

162207
# %%
163208
n = skrub.choose_int(5, 15, log=True, name="n_components")
@@ -171,22 +216,33 @@
171216
vectorizer = skrub.TableVectorizer(high_cardinality=encoder)
172217

173218
# %%
174-
# Easily apply estimators to a subset of columns
219+
# A transformer does not have to apply to the full DataFrame; we can easily
220+
# restrict it to some columns. Skrub selectors allow specifying which columns
221+
# in a flexible way, selecting them by name, name pattern, dtype, or other
222+
# criteria. They can be combined with the same operators as Python sets. Here
223+
# for example we vectorize all columns except the ``"basket_ID"`` which we will
224+
# need for joining.
175225

176226
# %%
177227
from skrub import selectors as s
178228

179229
vectorized_products = products.skb.apply(vectorizer, cols=s.all() - "basket_ID")
180230

181231
# %%
182-
# Data-wrangling and multiple-table operations as part of the pipeline
232+
# Having access to the underlying dataframe's API, we can easily perform the
233+
# data-wrangling we need, including joins or other operations that involve
234+
# multiple tables. All those transformations are being implicitly added as
235+
# steps in our machine-learning pipeline.
183236

184237
# %%
185238
aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index()
186239
baskets = baskets.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
187240
columns=["ID", "basket_ID"]
188241
)
189242

243+
# %%
244+
# Finally, we add a supervised estimator and our pipeline is complete.
245+
190246
# %%
191247
from sklearn.ensemble import HistGradientBoostingClassifier
192248

@@ -197,7 +253,8 @@
197253
predictions
198254

199255
# %%
200-
# We can ask for a report of the pipeline and inspect the results at every step::
256+
# We can ask for a full report of the pipeline and inspect the results at every
257+
# step::
201258
#
202259
# predictions.skb.full_report()
203260
#
@@ -206,13 +263,37 @@
206263
# `see the output <../../_static/credit_fraud_report/index.html>`_.
207264

208265
# %%
209-
# Perform hyperparameter search or cross-validation
266+
# From the choices we inserted at different locations in our pipeline, skrub
267+
# can build a grid of hyperparameters and run the hyperparameter search for us,
268+
# backed by scikit-learn's ``GridSearchCV`` or ``RandomizedSearchCV``.
269+
#
270+
# The names we provided help provide a summary of the results that is easy to
271+
# read.
210272

211273
# %%
212274
search = predictions.skb.get_randomized_search(
213275
scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True
214276
)
215277
search.get_cv_results_table()
216278

279+
# %%
280+
# We can also ask skrub to display a parallel coordinates plot of the results.
281+
# In this plot, each line corresponds to a combination of hyperparameter
282+
# (choice) values. It goes through the corresponding test score, and training
283+
# and scoring computation durations. The other columns show the hyperparameter
284+
# values. By clicking and dragging the mouse on any column, we can restrict the
285+
# set of lines we see. This allows quickly inspecting which hyperparameters are
286+
# most important, which values perform best, and trade-offs between the quality
287+
# of predictions and computation time.
288+
#
289+
# TODO: Gif of how to use the plot.
290+
217291
# %%
218292
search.plot_parallel_coord()
293+
294+
# Conclusion
295+
# ----------
296+
#
297+
# If after reading this example you are curious to know more and learn how to
298+
# build your own complex, multi-table pipelines with easy hyperparameter
299+
# tuning, please see the next examples for an in-depth tutorial.

0 commit comments

Comments
 (0)