axe_data() method for xgb.Booster objects #217

davidkane9 · 2022-03-14T18:05:27Z

It is documented that the axe_data() method does not work for xgb.Booster objects. (See related discussion here.) This is a major problem for those working in industry because:

xgb.Booster are perhaps the most popular machine learning model, at least for numeric data, in industry.
Professional users often work with very large amounts of data.
Professional users often create many models and/or use model ensembles.

Without being able to remove the data associated with each model object using axe_data(), we end up with numerous model objects, each with an entire copy of the original data set. This makes a tidymodels approach unworkable, which makes me sad. I, at least, have no choice but to use xgb.train() directly.

juliasilge · 2022-03-17T20:24:51Z

We agree totally that we need to get xgboost working with butcher, because it is such an important modeling approach. I have started some fixes in #218 to address the problems with axe_ctrl() and axe_fitted().

However, I wanted to ask you specifically about axe_data(). I may be mistaken but I don't believe the data is stored anywhere in an xgboost model:

library(butcher)
library(xgboost)

more_cars <- mtcars[rep(1:32, each = 1000),]
xgb_mod <- xgboost(
  data = as.matrix(more_cars[, -6]),
  label = more_cars[["vs"]],
  nrounds = 10)
#> [1]  train-rmse:0.350062 
#> [2]  train-rmse:0.245052 
#> [3]  train-rmse:0.171519 
#> [4]  train-rmse:0.120046 
#> [5]  train-rmse:0.084052 
#> [6]  train-rmse:0.058839 
#> [7]  train-rmse:0.041190 
#> [8]  train-rmse:0.028835 
#> [9]  train-rmse:0.020180 
#> [10] train-rmse:0.014130

weigh(xgb_mod)
#> # A tibble: 11 × 2
#>    object                            size
#>    <chr>                            <dbl>
#>  1 callbacks.cb.evaluation.log   0.0354  
#>  2 callbacks.cb.print.evaluation 0.0146  
#>  3 raw                           0.00793 
#>  4 call                          0.00151 
#>  5 feature_names                 0.000736
#>  6 handle                        0.000312
#>  7 evaluation_log.iter           0.000176
#>  8 evaluation_log.train_rmse     0.000176
#>  9 niter                         0.000056
#> 10 params.validate_parameters    0.000056
#> 11 nfeatures                     0.000056

^{Created on 2022-03-17 by the reprex package (v2.0.1)}

Do you know anything about how data is stored in a fitted xgboost model?

davidkane9 · 2022-03-17T22:17:44Z

xgboost itself does not store the data. But I usually work within a tidmodels approach,
where the data is stored (I think) within the fitted object.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

more_cars <- mtcars[rep(1:32, each = 1000),]

most_cars <- mtcars[rep(1:32, each = 100000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = mtcars)

small_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(more_cars)

large_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(most_cars)

object.size(small_data_fit)
#> 3409512 bytes
object.size(large_data_fit)
#> 282193488 bytes

^{Created on 2022-03-17 by the reprex package (v2.0.1)}

Again, I am no Tidymodels expert, but I think that I want to save and work with a bunch of fitted models.

EmilHvitfeldt · 2022-03-17T22:42:41Z

The increase in object size for large_data_fit comes because workflows keeps the predictors and outcome in object$pre$mold. axe_data() on the workflow should remove the data there.

library(tidymodels)
library(butcher)
library(xgboost)

more_cars <- mtcars[rep(1:32, each = 1000),]

most_cars <- mtcars[rep(1:32, each = 100000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = mtcars)

small_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(more_cars)

large_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(most_cars)

object.size(small_data_fit)
#> 3773816 bytes

object.size(large_data_fit)
#> 282557792 bytes

weigh(large_data_fit)
#> # A tibble: 167 × 2
#>    object                    size
#>    <chr>                    <dbl>
#>  1 pre.mold.predictors.mpg   25.6
#>  2 pre.mold.predictors.cyl   25.6
#>  3 pre.mold.predictors.disp  25.6
#>  4 pre.mold.predictors.hp    25.6
#>  5 pre.mold.predictors.drat  25.6
#>  6 pre.mold.predictors.wt    25.6
#>  7 pre.mold.predictors.qsec  25.6
#>  8 pre.mold.predictors.am    25.6
#>  9 pre.mold.predictors.gear  25.6
#> 10 pre.mold.predictors.carb  25.6
#> # … with 157 more rows

small_large_data_fit <- axe_data(large_data_fit)

weigh(small_large_data_fit)
#> # A tibble: 156 × 2
#>    object                                       size
#>    <chr>                                       <dbl>
#>  1 pre.actions.recipe.blueprint.forge.process 2.44  
#>  2 pre.mold.blueprint.forge.process           2.44  
#>  3 pre.actions.recipe.blueprint.mold.process  2.43  
#>  4 pre.mold.blueprint.mold.process            2.43  
#>  5 pre.actions.recipe.blueprint.forge.clean   2.42  
#>  6 pre.mold.blueprint.forge.clean             2.42  
#>  7 pre.actions.recipe.blueprint.mold.clean    2.39  
#>  8 pre.mold.blueprint.mold.clean              2.39  
#>  9 fit.fit.fit.callbacks.cb.evaluation.log    0.0354
#> 10 fit.fit.fit.raw                            0.0122
#> # … with 146 more rows

object.size(small_large_data_fit)
#> 954816 bytes

predict(small_large_data_fit, mtcars)
#> # A tibble: 32 × 1
#>      .pred
#>      <dbl>
#>  1 0.00237
#>  2 0.00237
#>  3 0.998  
#>  4 0.998  
#>  5 0.00237
#>  6 0.998  
#>  7 0.00237
#>  8 0.998  
#>  9 0.998  
#> 10 0.998  
#> # … with 22 more rows

^{Created on 2022-03-17 by the reprex package (v2.0.1)}

juliasilge · 2022-03-17T23:16:01Z

Thanks for the clarification @davidkane9!

I believe the fixes in #218 will allow you to successfully use xgboost from tidymodels.

DavisVaughan · 2022-03-18T15:04:18Z

I don't think there is anything to do here because axe_data.workflows removes the outcomes and predictors in the $mold slot, as @EmilHvitfeldt said, which is where the large size comes from.

davidkane9 · 2022-03-21T20:17:44Z

I still have questions/concerns about this. Can you re-open this issue or would you prefer that I start a new issue?

juliasilge · 2022-03-21T21:57:32Z

Go ahead and share here what you're thinking @davidkane9, unless you think it is a different issue than supporting axe_data() for xgb.Booster objects (in which case another issue would be great).

davidkane9 · 2022-03-24T19:52:58Z

The key issue is that axe_data() does not remove all copies of the data. It only removes half the data. It does not remove the data which is passed in with recipe(). Here is a slightly modified example:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

cars <- mtcars[rep(1:32, each = 10000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = cars)

fit_obj <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(cars)

axed_fit_obj <- axe_data(fit_obj)

object.size(fit_obj)
#> 56910680 bytes
object.size(axed_fit_obj)
#> 28747704 bytes
weigh(axed_fit_obj)
#> # A tibble: 156 × 2
#>    object                                   size
#>    <chr>                                   <dbl>
#>  1 pre.actions.recipe.recipe.template.mpg   2.56
#>  2 pre.actions.recipe.recipe.template.cyl   2.56
#>  3 pre.actions.recipe.recipe.template.disp  2.56
#>  4 pre.actions.recipe.recipe.template.hp    2.56
#>  5 pre.actions.recipe.recipe.template.drat  2.56
#>  6 pre.actions.recipe.recipe.template.wt    2.56
#>  7 pre.actions.recipe.recipe.template.qsec  2.56
#>  8 pre.actions.recipe.recipe.template.am    2.56
#>  9 pre.actions.recipe.recipe.template.gear  2.56
#> 10 pre.actions.recipe.recipe.template.carb  2.56
#> # … with 146 more rows

^{Created on 2022-03-24 by the reprex package (v2.0.1)}

axe_data() removes half the data, which is why the object.size() drops from 56 mb to 28 mb. But it does not touch the copy of the data stored in pre.actions.recipe.recipe.template.

Of course, I can solve my problem by just using cars[1, ] in the call to recipe() rather than cars. But this certainly took me a while to figure out. I assumed that axe_data() would get rid of all copies of the data.

juliasilge · 2022-03-24T20:29:00Z

Oh boy @davidkane9 the butcher methods for workflows only butcher the parsnip model, not the recipe. 🥴 I'll open an issue over there. (We do have butcher support for recipes; it's just not getting applied in the workflow.)

github-actions · 2022-04-08T00:48:42Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

DavisVaughan closed this as completed Mar 18, 2022

juliasilge mentioned this issue Mar 24, 2022

butcher methods for workflows only butcher the parsnip model, not a recipe tidymodels/workflows#147

Closed

github-actions bot locked and limited conversation to collaborators Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

axe_data() method for xgb.Booster objects #217

axe_data() method for xgb.Booster objects #217

davidkane9 commented Mar 14, 2022

juliasilge commented Mar 17, 2022

davidkane9 commented Mar 17, 2022

EmilHvitfeldt commented Mar 17, 2022

juliasilge commented Mar 17, 2022

DavisVaughan commented Mar 18, 2022

davidkane9 commented Mar 21, 2022

juliasilge commented Mar 21, 2022 •

edited

Loading

davidkane9 commented Mar 24, 2022

juliasilge commented Mar 24, 2022 •

edited

Loading

github-actions bot commented Apr 8, 2022

axe_data() method for xgb.Booster objects #217

axe_data() method for xgb.Booster objects #217

Comments

davidkane9 commented Mar 14, 2022

juliasilge commented Mar 17, 2022

davidkane9 commented Mar 17, 2022

EmilHvitfeldt commented Mar 17, 2022

juliasilge commented Mar 17, 2022

DavisVaughan commented Mar 18, 2022

davidkane9 commented Mar 21, 2022

juliasilge commented Mar 21, 2022 • edited Loading

davidkane9 commented Mar 24, 2022

juliasilge commented Mar 24, 2022 • edited Loading

github-actions bot commented Apr 8, 2022

juliasilge commented Mar 21, 2022 •

edited

Loading

juliasilge commented Mar 24, 2022 •

edited

Loading