Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

axe_data() method for xgb.Booster objects #217

Closed
davidkane9 opened this issue Mar 14, 2022 · 10 comments
Closed

axe_data() method for xgb.Booster objects #217

davidkane9 opened this issue Mar 14, 2022 · 10 comments

Comments

@davidkane9
Copy link

It is documented that the axe_data() method does not work for xgb.Booster objects. (See related discussion here.) This is a major problem for those working in industry because:

  • xgb.Booster are perhaps the most popular machine learning model, at least for numeric data, in industry.

  • Professional users often work with very large amounts of data.

  • Professional users often create many models and/or use model ensembles.

Without being able to remove the data associated with each model object using axe_data(), we end up with numerous model objects, each with an entire copy of the original data set. This makes a tidymodels approach unworkable, which makes me sad. I, at least, have no choice but to use xgb.train() directly.

@juliasilge
Copy link
Member

We agree totally that we need to get xgboost working with butcher, because it is such an important modeling approach. I have started some fixes in #218 to address the problems with axe_ctrl() and axe_fitted().

However, I wanted to ask you specifically about axe_data(). I may be mistaken but I don't believe the data is stored anywhere in an xgboost model:

library(butcher)
library(xgboost)

more_cars <- mtcars[rep(1:32, each = 1000),]
xgb_mod <- xgboost(
  data = as.matrix(more_cars[, -6]),
  label = more_cars[["vs"]],
  nrounds = 10)
#> [1]  train-rmse:0.350062 
#> [2]  train-rmse:0.245052 
#> [3]  train-rmse:0.171519 
#> [4]  train-rmse:0.120046 
#> [5]  train-rmse:0.084052 
#> [6]  train-rmse:0.058839 
#> [7]  train-rmse:0.041190 
#> [8]  train-rmse:0.028835 
#> [9]  train-rmse:0.020180 
#> [10] train-rmse:0.014130

weigh(xgb_mod)
#> # A tibble: 11 × 2
#>    object                            size
#>    <chr>                            <dbl>
#>  1 callbacks.cb.evaluation.log   0.0354  
#>  2 callbacks.cb.print.evaluation 0.0146  
#>  3 raw                           0.00793 
#>  4 call                          0.00151 
#>  5 feature_names                 0.000736
#>  6 handle                        0.000312
#>  7 evaluation_log.iter           0.000176
#>  8 evaluation_log.train_rmse     0.000176
#>  9 niter                         0.000056
#> 10 params.validate_parameters    0.000056
#> 11 nfeatures                     0.000056

Created on 2022-03-17 by the reprex package (v2.0.1)

Do you know anything about how data is stored in a fitted xgboost model?

@davidkane9
Copy link
Author

xgboost itself does not store the data. But I usually work within a tidmodels approach,
where the data is stored (I think) within the fitted object.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

more_cars <- mtcars[rep(1:32, each = 1000),]

most_cars <- mtcars[rep(1:32, each = 100000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = mtcars)

small_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(more_cars)

large_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(most_cars)

object.size(small_data_fit)
#> 3409512 bytes
object.size(large_data_fit)
#> 282193488 bytes

Created on 2022-03-17 by the reprex package (v2.0.1)

Again, I am no Tidymodels expert, but I think that I want to save and work with a bunch of fitted models.

@EmilHvitfeldt
Copy link
Member

The increase in object size for large_data_fit comes because workflows keeps the predictors and outcome in object$pre$mold. axe_data() on the workflow should remove the data there.

library(tidymodels)
library(butcher)
library(xgboost)

more_cars <- mtcars[rep(1:32, each = 1000),]

most_cars <- mtcars[rep(1:32, each = 100000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = mtcars)

small_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(more_cars)

large_data_fit <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(most_cars)

object.size(small_data_fit)
#> 3773816 bytes

object.size(large_data_fit)
#> 282557792 bytes

weigh(large_data_fit)
#> # A tibble: 167 × 2
#>    object                    size
#>    <chr>                    <dbl>
#>  1 pre.mold.predictors.mpg   25.6
#>  2 pre.mold.predictors.cyl   25.6
#>  3 pre.mold.predictors.disp  25.6
#>  4 pre.mold.predictors.hp    25.6
#>  5 pre.mold.predictors.drat  25.6
#>  6 pre.mold.predictors.wt    25.6
#>  7 pre.mold.predictors.qsec  25.6
#>  8 pre.mold.predictors.am    25.6
#>  9 pre.mold.predictors.gear  25.6
#> 10 pre.mold.predictors.carb  25.6
#> # … with 157 more rows

small_large_data_fit <- axe_data(large_data_fit)

weigh(small_large_data_fit)
#> # A tibble: 156 × 2
#>    object                                       size
#>    <chr>                                       <dbl>
#>  1 pre.actions.recipe.blueprint.forge.process 2.44  
#>  2 pre.mold.blueprint.forge.process           2.44  
#>  3 pre.actions.recipe.blueprint.mold.process  2.43  
#>  4 pre.mold.blueprint.mold.process            2.43  
#>  5 pre.actions.recipe.blueprint.forge.clean   2.42  
#>  6 pre.mold.blueprint.forge.clean             2.42  
#>  7 pre.actions.recipe.blueprint.mold.clean    2.39  
#>  8 pre.mold.blueprint.mold.clean              2.39  
#>  9 fit.fit.fit.callbacks.cb.evaluation.log    0.0354
#> 10 fit.fit.fit.raw                            0.0122
#> # … with 146 more rows

object.size(small_large_data_fit)
#> 954816 bytes

predict(small_large_data_fit, mtcars)
#> # A tibble: 32 × 1
#>      .pred
#>      <dbl>
#>  1 0.00237
#>  2 0.00237
#>  3 0.998  
#>  4 0.998  
#>  5 0.00237
#>  6 0.998  
#>  7 0.00237
#>  8 0.998  
#>  9 0.998  
#> 10 0.998  
#> # … with 22 more rows

Created on 2022-03-17 by the reprex package (v2.0.1)

@juliasilge
Copy link
Member

Thanks for the clarification @davidkane9!

I believe the fixes in #218 will allow you to successfully use xgboost from tidymodels.

@DavisVaughan
Copy link
Member

I don't think there is anything to do here because axe_data.workflows removes the outcomes and predictors in the $mold slot, as @EmilHvitfeldt said, which is where the large size comes from.

@davidkane9
Copy link
Author

I still have questions/concerns about this. Can you re-open this issue or would you prefer that I start a new issue?

@juliasilge
Copy link
Member

juliasilge commented Mar 21, 2022

Go ahead and share here what you're thinking @davidkane9, unless you think it is a different issue than supporting axe_data() for xgb.Booster objects (in which case another issue would be great).

@davidkane9
Copy link
Author

The key issue is that axe_data() does not remove all copies of the data. It only removes half the data. It does not remove the data which is passed in with recipe(). Here is a slightly modified example:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

cars <- mtcars[rep(1:32, each = 10000),] 

base_spec <- 
  boost_tree() %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

base_recipe <- 
  recipe(formula = vs ~ ., 
         data = cars)

fit_obj <-
  workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(base_spec) %>% 
  fit(cars)

axed_fit_obj <- axe_data(fit_obj)

object.size(fit_obj)
#> 56910680 bytes
object.size(axed_fit_obj)
#> 28747704 bytes
weigh(axed_fit_obj)
#> # A tibble: 156 × 2
#>    object                                   size
#>    <chr>                                   <dbl>
#>  1 pre.actions.recipe.recipe.template.mpg   2.56
#>  2 pre.actions.recipe.recipe.template.cyl   2.56
#>  3 pre.actions.recipe.recipe.template.disp  2.56
#>  4 pre.actions.recipe.recipe.template.hp    2.56
#>  5 pre.actions.recipe.recipe.template.drat  2.56
#>  6 pre.actions.recipe.recipe.template.wt    2.56
#>  7 pre.actions.recipe.recipe.template.qsec  2.56
#>  8 pre.actions.recipe.recipe.template.am    2.56
#>  9 pre.actions.recipe.recipe.template.gear  2.56
#> 10 pre.actions.recipe.recipe.template.carb  2.56
#> # … with 146 more rows

Created on 2022-03-24 by the reprex package (v2.0.1)

axe_data() removes half the data, which is why the object.size() drops from 56 mb to 28 mb. But it does not touch the copy of the data stored in pre.actions.recipe.recipe.template.

Of course, I can solve my problem by just using cars[1, ] in the call to recipe() rather than cars. But this certainly took me a while to figure out. I assumed that axe_data() would get rid of all copies of the data.

@juliasilge
Copy link
Member

juliasilge commented Mar 24, 2022

Oh boy @davidkane9 the butcher methods for workflows only butcher the parsnip model, not the recipe. 🥴 I'll open an issue over there. (We do have butcher support for recipes; it's just not getting applied in the workflow.)

@github-actions
Copy link

github-actions bot commented Apr 8, 2022

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants