Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement axe_fitted.recipe #207

Closed
AshesITR opened this issue Nov 25, 2021 · 5 comments · Fixed by #208
Closed

Implement axe_fitted.recipe #207

AshesITR opened this issue Nov 25, 2021 · 5 comments · Fixed by #208

Comments

@AshesITR
Copy link
Contributor

It should remove x$template, which contains the prepped data of the training set.

reprex stolen and adapted from tidymodels/recipes#859

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(butcher)

rec <- recipe(data = dplyr::bind_rows(rep(list(iris), 10)), formula = Species ~ .) %>%
  step_normalize(starts_with("Petal.")) %>%
  step_BoxCox(starts_with("Sepal."))

rec_prepped <- prep(rec)

lobstr::obj_size(rec_prepped)
#> 128,656 B
lobstr::obj_size(butcher(rec_prepped))
#> 70,320 B
lobstr::obj_size(butcher(prep(rec_prepped, retain = FALSE)))
#> 15,936 B

The proposed implementation is quite simple, if I'm not missing anything:

axe_fitted.recipe <- function(x, verbose = FALSE, ...) {
  old <- x
  x$template <- x$template[integer(), ]

  add_butcher_attributes(
    x,
    old,
    verbose = verbose
  )
}
@juliasilge
Copy link
Member

This seems like a really good idea for butchering a recipe, like replacing the template with vctrs::vec_ptype(template):

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(concrete)

concrete <- 
  concrete %>% 
  group_by(across(-compressive_strength)) %>% 
  summarize(compressive_strength = mean(compressive_strength),
            .groups = "drop")

set.seed(1501)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test  <- testing(concrete_split)

rec <- recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_poly(all_predictors()) %>% 
  step_interact(~ all_predictors():all_predictors())

prepped <- prep(rec)
bake(prepped, new_data = concrete_test)
#> # A tibble: 249 × 137
#>    compressive_strength cement_poly_1 cement_poly_2 blast_furnace_slag_poly_1
#>                   <dbl>         <dbl>         <dbl>                     <dbl>
#>  1                 4.57       -0.0632        0.0967                    0.0354
#>  2                 7.68       -0.0632        0.0967                    0.0354
#>  3                 7.72       -0.0609        0.0887                    0.0395
#>  4                20.6        -0.0609        0.0887                    0.0395
#>  5                 6.28       -0.0581        0.0794                    0.0440
#>  6                31.0        -0.0581        0.0794                    0.0440
#>  7                10.4        -0.0558        0.0716                    0.0487
#>  8                33.3        -0.0524        0.0611                    0.0584
#>  9                13.7        -0.0521        0.0600                    0.0556
#> 10                 7.51       -0.0511        0.0571                    0.0571
#> # … with 239 more rows, and 133 more variables:
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, age_poly_1 <dbl>,
#> #   age_poly_2 <dbl>, cement_poly_1_x_cement_poly_2 <dbl>, …

prepped$template <- prepped$template[integer(), ]
juice(prepped)
#> # A tibble: 0 × 137
#> # … with 137 variables: compressive_strength <dbl>, cement_poly_1 <dbl>,
#> #   cement_poly_2 <dbl>, blast_furnace_slag_poly_1 <dbl>,
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, …
bake(prepped, new_data = concrete_test)
#> # A tibble: 249 × 137
#>    compressive_strength cement_poly_1 cement_poly_2 blast_furnace_slag_poly_1
#>                   <dbl>         <dbl>         <dbl>                     <dbl>
#>  1                 4.57       -0.0632        0.0967                    0.0354
#>  2                 7.68       -0.0632        0.0967                    0.0354
#>  3                 7.72       -0.0609        0.0887                    0.0395
#>  4                20.6        -0.0609        0.0887                    0.0395
#>  5                 6.28       -0.0581        0.0794                    0.0440
#>  6                31.0        -0.0581        0.0794                    0.0440
#>  7                10.4        -0.0558        0.0716                    0.0487
#>  8                33.3        -0.0524        0.0611                    0.0584
#>  9                13.7        -0.0521        0.0600                    0.0556
#> 10                 7.51       -0.0511        0.0571                    0.0571
#> # … with 239 more rows, and 133 more variables:
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, age_poly_1 <dbl>,
#> #   age_poly_2 <dbl>, cement_poly_1_x_cement_poly_2 <dbl>, …

Created on 2021-11-29 by the reprex package (v2.0.1)

@juliasilge
Copy link
Member

@AshesITR would you be interested in contributing a PR to hardhat to implement this butcher method for a recipe? We have an article here with some advice on contributing to butcher, but like you have probably already discovered, the method would go in this file.

@AshesITR
Copy link
Contributor Author

Sure, I'll make this a PR. Regarding df[integer(), ] vs. vctrs::vec_ptype(df): Do you have an opinion regarding any of these alternatives?

@juliasilge
Copy link
Member

juliasilge commented Nov 30, 2021

Do you have an opinion on that @DavisVaughan? butcher doesn't currently import vctrs but does import tibble, which imports vctrs.

AshesITR added a commit to AshesITR/butcher that referenced this issue Nov 30, 2021
fixes tidymodels#207

The .ignore file changes relate to me using PyCharm IDE. LMK if I should undo these commits and instead locally ignore .idea.
juliasilge added a commit that referenced this issue Dec 2, 2021
* Implement axe_fitted.recipe

fixes #207

The .ignore file changes relate to me using PyCharm IDE. LMK if I should undo these commits and instead locally ignore .idea.

* Tidy up recipe butcher() expansion

* Update .Rbuildignore

Co-authored-by: Julia Silge <[email protected]>
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants