|
121 | 121 | #
|
122 | 122 | #
|
123 | 123 | # Again, as this transformation is not in a scikit-learn estimator, we have to
|
124 |
| -# keep track of it ourselves so that we can later apply to unseen data, which |
| 124 | +# keep track of it ourselves so that we can later apply it to unseen data, which |
125 | 125 | # is error-prone, and we cannot tune any choices (like the choice of the
|
126 | 126 | # aggregation function).
|
127 | 127 | #
|
|
132 | 132 | # A solution with skrub
|
133 | 133 | # ---------------------
|
134 | 134 | #
|
135 |
| -# Here we just show the solution. Subsequent examples dive into the details. |
136 |
| -# TODO TODO TODO |
| 135 | +# Here we show how our credit fraud dataset can be handled with Skrub. We do |
| 136 | +# not explain all the details, in-depth explanations are left for the next |
| 137 | +# example. Our goal here is to motivate Skrub's way of building pipelines by |
| 138 | +# showing it can easily tackle a more complex dataset with several tables. |
137 | 139 | #
|
| 140 | +# The main difference with scikit-learn pipelines is that we do not provide an |
| 141 | +# explicit list of transformation steps. Rather, we manipulate skrub objects |
| 142 | +# that represent intermediate results, and the pipeline is built implicitly as |
| 143 | +# we perform operations (such as applying operators or calling functions) on |
| 144 | +# those objects. |
138 | 145 |
|
139 | 146 | # %%
|
140 |
| -# Declare inputs to the pipeline |
| 147 | +# Declare inputs to the pipeline. We create skrub "variables", which are given |
| 148 | +# a name and represent the inputs to our pipeline — here, the products, |
| 149 | +# baskets, and fraud flags. They are given a name and an (optional) initial |
| 150 | +# value, which is used to show previews of the pipeline's output, detect errors |
| 151 | +# early, and provide data for cross-validation and hyperparameter search. |
| 152 | +# |
| 153 | +# We then build the pipeline by applying transformations to those inputs. |
| 154 | + |
141 | 155 |
|
142 | 156 | # %%
|
143 | 157 | products = skrub.var("products", dataset.products)
|
144 | 158 | baskets = skrub.var("baskets", dataset.baskets[["ID"]]).skb.mark_as_x()
|
145 | 159 | fraud_flags = skrub.var("fraud_flags", dataset.baskets["fraud_flag"]).skb.mark_as_y()
|
146 | 160 |
|
147 | 161 | # %%
|
148 |
| -# Access to the dataframe's usual API; interactive preview of intermediate |
149 |
| -# results Note below we are using ``products`` and ``baskets`` as if they were |
150 |
| -# a pandas DataFrames |
| 162 | +# Above, `mark_as_x()` and `mark_as_y()` tell skrub that the baskets and flags |
| 163 | +# are respectively our design matrix and targets, ie the tables that should be |
| 164 | +# split into training and testing sets for cross-validation. Here they are |
| 165 | +# direct inputs to the pipeline but they don't have to be — any intermediate |
| 166 | +# result could be marked as X or y. |
| 167 | +# |
| 168 | +# Because our pipeline expects DataFrames for the products, baskets and fraud |
| 169 | +# flags, we manipulate those objects just like we would manipulate DataFrames. |
| 170 | +# All attribute accesses will be transparently forwarded to the actual input |
| 171 | +# DataFrames when we run the pipeline. |
| 172 | +# |
| 173 | +# For example let us filter products to keep only those that match one of the |
| 174 | +# baskets in the ``"baskets"`` table, then add a column containing the total |
| 175 | +# amount for each kind of product in a basket: |
151 | 176 |
|
152 | 177 | # %%
|
153 | 178 | products = products[products["basket_ID"].isin(baskets["ID"])]
|
|
157 | 182 | products
|
158 | 183 |
|
159 | 184 | # %%
|
160 |
| -# Easily specify a hyperparameter grid |
| 185 | +# Note we are getting previews of the output of intermediate results. For |
| 186 | +# example we can see the added ``"total_price"`` column in the output above. |
| 187 | +# The dropdown at the top allows us to check the structure of the pipeline and |
| 188 | +# all the steps it contains. |
| 189 | +# |
| 190 | +# With skrub we do not need to specify a grid of hyperparameters separately |
| 191 | +# from the pipeline. Instead, we can replace a parameter's value with a skrub |
| 192 | +# "choice" which indicates the range of values we would like to consider during |
| 193 | +# hyperparameter selection. |
| 194 | +# |
| 195 | +# Those choices can be nested arbitrarily. They are not restricted to |
| 196 | +# parameters of a scikit-learn estimator, but they can be anything: choosing |
| 197 | +# between different estimators, arguments to function calls, whole sections of |
| 198 | +# the pipeline etc. |
| 199 | +# |
| 200 | +# In-depth information about choices and hyperparameter/model selection is |
| 201 | +# provided in example (TODO add link). |
| 202 | +# |
| 203 | +# Here we build a skrub ``TableVectorizer`` that contains a couple of choices: |
| 204 | +# the type of encoder for high-cardinality categorical or string columns, and |
| 205 | +# the number of components it uses. |
161 | 206 |
|
162 | 207 | # %%
|
163 | 208 | n = skrub.choose_int(5, 15, log=True, name="n_components")
|
|
171 | 216 | vectorizer = skrub.TableVectorizer(high_cardinality=encoder)
|
172 | 217 |
|
173 | 218 | # %%
|
174 |
| -# Easily apply estimators to a subset of columns |
| 219 | +# A transformer does not have to apply to the full DataFrame; we can easily |
| 220 | +# restrict it to some columns. Skrub selectors allow specifying which columns |
| 221 | +# in a flexible way, selecting them by name, name pattern, dtype, or other |
| 222 | +# criteria. They can be combined with the same operators as Python sets. Here |
| 223 | +# for example we vectorize all columns except the ``"basket_ID"`` which we will |
| 224 | +# need for joining. |
175 | 225 |
|
176 | 226 | # %%
|
177 | 227 | from skrub import selectors as s
|
178 | 228 |
|
179 | 229 | vectorized_products = products.skb.apply(vectorizer, cols=s.all() - "basket_ID")
|
180 | 230 |
|
181 | 231 | # %%
|
182 |
| -# Data-wrangling and multiple-table operations as part of the pipeline |
| 232 | +# Having access to the underlying dataframe's API, we can easily perform the |
| 233 | +# data-wrangling we need, including joins or other operations that involve |
| 234 | +# multiple tables. All those transformations are being implicitly added as |
| 235 | +# steps in our machine-learning pipeline. |
183 | 236 |
|
184 | 237 | # %%
|
185 | 238 | aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index()
|
186 | 239 | baskets = baskets.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
|
187 | 240 | columns=["ID", "basket_ID"]
|
188 | 241 | )
|
189 | 242 |
|
| 243 | +# %% |
| 244 | +# Finally, we add a supervised estimator and our pipeline is complete. |
| 245 | + |
190 | 246 | # %%
|
191 | 247 | from sklearn.ensemble import HistGradientBoostingClassifier
|
192 | 248 |
|
|
197 | 253 | predictions
|
198 | 254 |
|
199 | 255 | # %%
|
200 |
| -# We can ask for a report of the pipeline and inspect the results at every step:: |
| 256 | +# We can ask for a full report of the pipeline and inspect the results at every |
| 257 | +# step:: |
201 | 258 | #
|
202 | 259 | # predictions.skb.full_report()
|
203 | 260 | #
|
|
206 | 263 | # `see the output <../../_static/credit_fraud_report/index.html>`_.
|
207 | 264 |
|
208 | 265 | # %%
|
209 |
| -# Perform hyperparameter search or cross-validation |
| 266 | +# From the choices we inserted at different locations in our pipeline, skrub |
| 267 | +# can build a grid of hyperparameters and run the hyperparameter search for us, |
| 268 | +# backed by scikit-learn's ``GridSearchCV`` or ``RandomizedSearchCV``. |
| 269 | +# |
| 270 | +# The names we provided help provide a summary of the results that is easy to |
| 271 | +# read. |
210 | 272 |
|
211 | 273 | # %%
|
212 | 274 | search = predictions.skb.get_randomized_search(
|
213 | 275 | scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True
|
214 | 276 | )
|
215 | 277 | search.get_cv_results_table()
|
216 | 278 |
|
| 279 | +# %% |
| 280 | +# We can also ask skrub to display a parallel coordinates plot of the results. |
| 281 | +# In this plot, each line corresponds to a combination of hyperparameter |
| 282 | +# (choice) values. It goes through the corresponding test score, and training |
| 283 | +# and scoring computation durations. The other columns show the hyperparameter |
| 284 | +# values. By clicking and dragging the mouse on any column, we can restrict the |
| 285 | +# set of lines we see. This allows quickly inspecting which hyperparameters are |
| 286 | +# most important, which values perform best, and trade-offs between the quality |
| 287 | +# of predictions and computation time. |
| 288 | +# |
| 289 | +# TODO: Gif of how to use the plot. |
| 290 | + |
217 | 291 | # %%
|
218 | 292 | search.plot_parallel_coord()
|
| 293 | + |
| 294 | +# Conclusion |
| 295 | +# ---------- |
| 296 | +# |
| 297 | +# If after reading this example you are curious to know more and learn how to |
| 298 | +# build your own complex, multi-table pipelines with easy hyperparameter |
| 299 | +# tuning, please see the next examples for an in-depth tutorial. |
0 commit comments