-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
axe_data() method for xgb.Booster objects #217
Comments
We agree totally that we need to get xgboost working with butcher, because it is such an important modeling approach. I have started some fixes in #218 to address the problems with However, I wanted to ask you specifically about library(butcher)
library(xgboost)
more_cars <- mtcars[rep(1:32, each = 1000),]
xgb_mod <- xgboost(
data = as.matrix(more_cars[, -6]),
label = more_cars[["vs"]],
nrounds = 10)
#> [1] train-rmse:0.350062
#> [2] train-rmse:0.245052
#> [3] train-rmse:0.171519
#> [4] train-rmse:0.120046
#> [5] train-rmse:0.084052
#> [6] train-rmse:0.058839
#> [7] train-rmse:0.041190
#> [8] train-rmse:0.028835
#> [9] train-rmse:0.020180
#> [10] train-rmse:0.014130
weigh(xgb_mod)
#> # A tibble: 11 × 2
#> object size
#> <chr> <dbl>
#> 1 callbacks.cb.evaluation.log 0.0354
#> 2 callbacks.cb.print.evaluation 0.0146
#> 3 raw 0.00793
#> 4 call 0.00151
#> 5 feature_names 0.000736
#> 6 handle 0.000312
#> 7 evaluation_log.iter 0.000176
#> 8 evaluation_log.train_rmse 0.000176
#> 9 niter 0.000056
#> 10 params.validate_parameters 0.000056
#> 11 nfeatures 0.000056 Created on 2022-03-17 by the reprex package (v2.0.1) Do you know anything about how data is stored in a fitted xgboost model? |
xgboost itself does not store the data. But I usually work within a tidmodels approach, library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#>
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#>
#> slice
more_cars <- mtcars[rep(1:32, each = 1000),]
most_cars <- mtcars[rep(1:32, each = 100000),]
base_spec <-
boost_tree() %>%
set_mode("regression") %>%
set_engine("xgboost")
base_recipe <-
recipe(formula = vs ~ .,
data = mtcars)
small_data_fit <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(base_spec) %>%
fit(more_cars)
large_data_fit <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(base_spec) %>%
fit(most_cars)
object.size(small_data_fit)
#> 3409512 bytes
object.size(large_data_fit)
#> 282193488 bytes Created on 2022-03-17 by the reprex package (v2.0.1) Again, I am no Tidymodels expert, but I think that I want to save and work with a bunch of fitted models. |
The increase in object size for library(tidymodels)
library(butcher)
library(xgboost)
more_cars <- mtcars[rep(1:32, each = 1000),]
most_cars <- mtcars[rep(1:32, each = 100000),]
base_spec <-
boost_tree() %>%
set_mode("regression") %>%
set_engine("xgboost")
base_recipe <-
recipe(formula = vs ~ .,
data = mtcars)
small_data_fit <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(base_spec) %>%
fit(more_cars)
large_data_fit <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(base_spec) %>%
fit(most_cars)
object.size(small_data_fit)
#> 3773816 bytes
object.size(large_data_fit)
#> 282557792 bytes
weigh(large_data_fit)
#> # A tibble: 167 × 2
#> object size
#> <chr> <dbl>
#> 1 pre.mold.predictors.mpg 25.6
#> 2 pre.mold.predictors.cyl 25.6
#> 3 pre.mold.predictors.disp 25.6
#> 4 pre.mold.predictors.hp 25.6
#> 5 pre.mold.predictors.drat 25.6
#> 6 pre.mold.predictors.wt 25.6
#> 7 pre.mold.predictors.qsec 25.6
#> 8 pre.mold.predictors.am 25.6
#> 9 pre.mold.predictors.gear 25.6
#> 10 pre.mold.predictors.carb 25.6
#> # … with 157 more rows
small_large_data_fit <- axe_data(large_data_fit)
weigh(small_large_data_fit)
#> # A tibble: 156 × 2
#> object size
#> <chr> <dbl>
#> 1 pre.actions.recipe.blueprint.forge.process 2.44
#> 2 pre.mold.blueprint.forge.process 2.44
#> 3 pre.actions.recipe.blueprint.mold.process 2.43
#> 4 pre.mold.blueprint.mold.process 2.43
#> 5 pre.actions.recipe.blueprint.forge.clean 2.42
#> 6 pre.mold.blueprint.forge.clean 2.42
#> 7 pre.actions.recipe.blueprint.mold.clean 2.39
#> 8 pre.mold.blueprint.mold.clean 2.39
#> 9 fit.fit.fit.callbacks.cb.evaluation.log 0.0354
#> 10 fit.fit.fit.raw 0.0122
#> # … with 146 more rows
object.size(small_large_data_fit)
#> 954816 bytes
predict(small_large_data_fit, mtcars)
#> # A tibble: 32 × 1
#> .pred
#> <dbl>
#> 1 0.00237
#> 2 0.00237
#> 3 0.998
#> 4 0.998
#> 5 0.00237
#> 6 0.998
#> 7 0.00237
#> 8 0.998
#> 9 0.998
#> 10 0.998
#> # … with 22 more rows Created on 2022-03-17 by the reprex package (v2.0.1) |
Thanks for the clarification @davidkane9! I believe the fixes in #218 will allow you to successfully use xgboost from tidymodels. |
I don't think there is anything to do here because |
I still have questions/concerns about this. Can you re-open this issue or would you prefer that I start a new issue? |
Go ahead and share here what you're thinking @davidkane9, unless you think it is a different issue than supporting |
The key issue is that axe_data() does not remove all copies of the data. It only removes half the data. It does not remove the data which is passed in with recipe(). Here is a slightly modified example: library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(butcher)
library(xgboost)
#>
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#>
#> slice
cars <- mtcars[rep(1:32, each = 10000),]
base_spec <-
boost_tree() %>%
set_mode("regression") %>%
set_engine("xgboost")
base_recipe <-
recipe(formula = vs ~ .,
data = cars)
fit_obj <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(base_spec) %>%
fit(cars)
axed_fit_obj <- axe_data(fit_obj)
object.size(fit_obj)
#> 56910680 bytes
object.size(axed_fit_obj)
#> 28747704 bytes
weigh(axed_fit_obj)
#> # A tibble: 156 × 2
#> object size
#> <chr> <dbl>
#> 1 pre.actions.recipe.recipe.template.mpg 2.56
#> 2 pre.actions.recipe.recipe.template.cyl 2.56
#> 3 pre.actions.recipe.recipe.template.disp 2.56
#> 4 pre.actions.recipe.recipe.template.hp 2.56
#> 5 pre.actions.recipe.recipe.template.drat 2.56
#> 6 pre.actions.recipe.recipe.template.wt 2.56
#> 7 pre.actions.recipe.recipe.template.qsec 2.56
#> 8 pre.actions.recipe.recipe.template.am 2.56
#> 9 pre.actions.recipe.recipe.template.gear 2.56
#> 10 pre.actions.recipe.recipe.template.carb 2.56
#> # … with 146 more rows Created on 2022-03-24 by the reprex package (v2.0.1) axe_data() removes half the data, which is why the object.size() drops from 56 mb to 28 mb. But it does not touch the copy of the data stored in pre.actions.recipe.recipe.template. Of course, I can solve my problem by just using cars[1, ] in the call to recipe() rather than cars. But this certainly took me a while to figure out. I assumed that axe_data() would get rid of all copies of the data. |
Oh boy @davidkane9 the butcher methods for workflows only butcher the parsnip model, not the recipe. 🥴 I'll open an issue over there. (We do have butcher support for recipes; it's just not getting applied in the workflow.) |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
It is documented that the axe_data() method does not work for xgb.Booster objects. (See related discussion here.) This is a major problem for those working in industry because:
xgb.Booster are perhaps the most popular machine learning model, at least for numeric data, in industry.
Professional users often work with very large amounts of data.
Professional users often create many models and/or use model ensembles.
Without being able to remove the data associated with each model object using axe_data(), we end up with numerous model objects, each with an entire copy of the original data set. This makes a tidymodels approach unworkable, which makes me sad. I, at least, have no choice but to use xgb.train() directly.
The text was updated successfully, but these errors were encountered: