Allow custom names in transformations #1240

maciejmotyka · 2020-04-20T19:06:00Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Proposal

I'd like to be able to explicitly specify names of targets created with map() or cross().
I don't want to have to rely on variable names (see #1220) or remember numbers (.id = FALSE), but simply supply a character vector with my own meaningful names.

Something like this:

plan <- drake_plan(
    my_fun(path = file_in(my_path)),
    transform = map(
        my_path = c(
            "/nothing/interesting/METHOD1/stuff/DATA1",
            "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
        ),
        .id = FALSE,
        .custom_names = c(
            "METHOD1_DATA1",
            "METHOD2_DATA2"
        )
    )
)

Having short meaningful names would help in downstream prototyping or exploratory analysis. Currently I have to do short_name <- loadd("a_la_la_la_la_long_long_li_long_long_name") for each target.

The text was updated successfully, but these errors were encountered:

wlandau · 2020-04-21T12:59:22Z

Sorry, I do not have plans to implement 100% custom names. You can kind of work around this by defining a custom grouping variable and feeding it to .id.

library(drake)
custom_ids <- c("method1_data1", "method2_data2")
plan <- drake_plan(
  x = target( # Start with a short name.
    my_fun(path = file_in(my_path)),
    transform = map(
      my_path = c(
        "/nothing/interesting/METHOD1/stuff/DATA1",
        "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
      ),
      custom_ids = !!custom_ids, # The !! is important.
      .id = custom_ids,
    )
  )
)

plan
#> # A tibble: 2 x 2
#>   target         command                                                        
#>   <chr>          <expr>                                                         
#> 1 x_method1_dat… my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA…
#> 2 x_method2_dat… my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/st…

drake_plan_source(plan)
#> drake_plan(
#>   x_method1_data1 = my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA1")),
#>   x_method2_data2 = my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"))
#> )

^{Created on 2020-04-21 by the reprex package (v0.3.0)}

januz · 2020-04-21T15:23:23Z

Hm, I was just about to write a similar feature request. The arguments I map() and cross() over aren't really suited to generate target names from (long variable names, vectors, ...). I wanted to suggest that one just supplies named vectors to the parameters that then are used in the target name for those parameters, i.e., something like

plan <- drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = cross(
      param1 = c(
        first = "FirstLongVariableName",
        second = "SecondLongVariableName"
      ),
      param2 = c(
        b1 = "LongVariableName_1",
        b2 = c("LongVariableName_1",  "LongVariableName_2")
      )
    )
  )
)

with resulting target names

model_first_b1
model_first_b2
model_second_b1
model_second_b2

If that won't be possible, would you mind explaining how to work around this for cross()? Your suggestion above only works for map(), right? Thank you!!

wlandau · 2020-04-21T16:45:16Z

You can supply a .data argument to map() with the expanded grid.

library(drake)
library(tidyverse)

grid <- expand_grid(
  param1 = c("FirstLongVariableName", "SecondLongVariableName"),
  param2 = c("LongVariableName_1", "LongVariableName_2")
) %>%
  mutate(
    id1 = recode(
      param1,
      FirstLongVariableName = "first",
      SecondLongVariableName = "second"
    )
  ) %>%
  mutate(
    id2 = recode(
      param2,
      LongVariableName_1 = "b1",
      LongVariableName_2 = "b2"
    )
  )

grid
#> # A tibble: 4 x 4
#>   param1                 param2             id1    id2  
#>   <chr>                  <chr>              <chr>  <chr>
#> 1 FirstLongVariableName  LongVariableName_1 first  b1   
#> 2 FirstLongVariableName  LongVariableName_2 first  b2   
#> 3 SecondLongVariableName LongVariableName_1 second b1   
#> 4 SecondLongVariableName LongVariableName_2 second b2

drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = map(.data = !!grid, .id = c(id1, id2))
  )
)
#> # A tibble: 4 x 2
#>   target          command                                               
#>   <chr>           <expr>                                                
#> 1 model_first_b1  my_fun("FirstLongVariableName", "LongVariableName_1") 
#> 2 model_first_b2  my_fun("FirstLongVariableName", "LongVariableName_2") 
#> 3 model_second_b1 my_fun("SecondLongVariableName", "LongVariableName_1")
#> 4 model_second_b2 my_fun("SecondLongVariableName", "LongVariableName_2")

^{Created on 2020-04-21 by the reprex package (v0.3.0)}

januz · 2020-04-22T23:24:24Z

Thank you very much for the example, @wlandau! This works, but I had to use case_when() to account for cases when the input is a vector and not a single character/number (like b2 in my example). I still think that just using named vectors would be an elegant solution and probably easier to read/understand, but I understand that you won't implement it. Thanks for your help!

maciejmotyka · 2020-04-23T08:57:12Z

Yes, thanks for these hacks @wlandau.
Just out of curiosity: why do you think custom names are a bad idea? Are they unnecessary, too time-consuming to implement, or maybe would somehow interfere with reproducibility?

wlandau · 2020-04-23T12:06:22Z

I let it sit for a while and changed my mind because implementation turned out to be trivially easy. Implemented in 312e124.

library(drake)
drake_plan(
  x = target(
    f(x),
    transform = map(x = !!seq_len(2), .names = c("a", "b"))
  ),
  y = target(
    f(w, x),
    transform = cross(
      w = !!seq_len(2),
      x,
      .names = c("aa", "ab", "ba", "bb")
    )
  ),
  z = target(
    g(y),
    transform = map(y)
  ),
  final = target(
    h(z),
    transform = combine(z, .by = x, .names = c("final1", "final2"))
  )
)
#> # A tibble: 12 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 final1 h(z_aa, z_ab)
#>  2 final2 h(z_ba, z_bb)
#>  3 a      f(1L)        
#>  4 b      f(2L)        
#>  5 aa     f(1L, a)     
#>  6 ab     f(2L, a)     
#>  7 ba     f(1L, b)     
#>  8 bb     f(2L, b)     
#>  9 z_aa   g(aa)        
#> 10 z_ab   g(ab)        
#> 11 z_ba   g(ba)        
#> 12 z_bb   g(bb)

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

I had considered this feature before, but I resisted on general principles. It breaks the abstraction of the interface, which is usually unwise from a software development perspective. I have made the mistake of being too eager to add flexibility over the last 3-ish years of developing drake. In other words, I was not lazy enough as a programmer, and I had little experience developing a tool this complicated.

If I could go back and develop the whole package from scratch, I would recommend clever workarounds for as long as possible before immediately implementing new features. For example, triggers was not a good idea in hindsight. The implementation was extremely complicated to create and even more burdensome to maintain, and I now strongly suspect the most important functionality can be completely covered with global objects, cancel(), and cancel_if(), which are much simpler.

For custom names, it took a go at implementation to see this, but it turned out to have less technical debt than I originally expected.

januz · 2020-04-23T17:41:09Z

Thank you so much for implementing custom names, @wlandau!! I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments. I guess one can apply the approach below and use named vectors (to have a clear value-name correspondence), then construct the names for the targets programmatically before the call to drake_plan() (I tried to do so within the call, but I didn't find a way to refer to the names resulting from a previous target). But even that might turn out to be error prone.

x_arg1 <- c(a = 1L, b = 2L)
x_arg2 <- c(c = 3L, d = 4L)
x_names <- paste0("xtarget_", paste(names(x_arg1), names(x_arg2), sep = "_"))
  
y_arg  <- c(e = 5L, f = 6L)
y_names <- paste0("ytarget_", do.call(paste, c(expand.grid(names(y_arg), x_names), sep = "_")))

final_names <- paste0("finaltarget_", x_names)

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2,
      .names = !!x_names
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget,
      .names = !!y_names
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget, .names = !!final_names)
  )
)

which gives the same result as the drake's standard naming convention, with the argument names switched out:

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget)
  )
)

wlandau · 2020-04-23T20:11:52Z

I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

I share your concerns, and this was part of my initial resistance. I recommend avoiding .names unless target names get truly unappealing or unmanageable.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments.

Fundamental changes to behavior get tricky because drake needs to be as back-compatible with previous versions as possible. And relying on vector/list names is another dimension to behavior that could easily get confusing and complicated for users to learn, even when explained properly. Also, it is too easy to forget when your vector has names and when it does not, and this would make plans brittle.

januz · 2020-04-24T15:56:13Z

This is all true! I think I will stay with using .data for now. Seems like the most robust solution. Thank you!!

maciejmotyka · 2020-04-24T17:05:12Z

I will keep using .names for now to test it out.
One inconvenience is as @januz noticed, that the names of the previous targets are not accessible inside the following transformations which would enable more elegant construction of new names.

Here, I'm trying to prepend the previous names:

plan <- drake::drake_plan(
    
    # Read haplotype seqs and clean their names 
    haps = target(
        read_haps(path = drake::file_in(path)),
        transform = map(
            path = c(
                "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
            ),
            .names = c("abqr_001",
                       "csnv_t5_tf001",
                       "csnv_t5_tf01",
                       "csnv_t10_tf01")
        )
    ), 
    
    # Align the haps using Muscle, ClustalO and ClustalW with default settings
    aln = target(
        msa::msa(inputSeqs = haps, method = methods),
        transform = cross(
            haps,
            methods = c("Muscle", "ClustalOmega", "ClustalW"),
            .names = outer(
                c("Muscle", "ClustO", "ClustW"),
                haps,
                paste,
                sep = "_"
            ) %>% as.vector()
        )
    )
)

#> Error in outer(c("Muscle", "ClustO", "ClustW"), haps, paste, sep = "_") : 
#> object 'haps' not found

Of course, I can just copy the first .names vector and paste it in place of haps inside the second target. However, if haps was of length 40 instead of 4 then pasting the whole vector again would make the plan less readable.

wlandau · 2020-04-24T21:12:27Z

You can work with external variables to support your plan, such as the hap_names vector below.

library(drake)

hap_names <- c(
  "abqr_001",
  "csnv_t5_tf001",
  "csnv_t5_tf01",
  "csnv_t10_tf01"
)

plan <- drake_plan(
  haps = target(
    read_haps(path = drake::file_in(path)),
    transform = map(
      path = c(
        "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
      ),
      .names = !!hap_names
    )
  ), 
  aln = target(
    msa::msa(inputSeqs = haps, method = methods),
    transform = cross(
      haps,
      methods = c("Muscle", "ClustalOmega", "ClustalW"),
      .names = as.vector(outer(
        c("Muscle", "ClustO", "ClustW"),
        !!hap_names,
        paste,
        sep = "_"
      ))
    )
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   Muscle_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "Muscle"),
#>   ClustO_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalOmega"),
#>   ClustW_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "Muscle"),
#>   ClustO_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "Muscle"),
#>   ClustO_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalW"),
#>   Muscle_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "Muscle"),
#>   ClustO_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalW"),
#>   abqr_001 = read_haps(path = drake::file_in("/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes")),
#>   csnv_t5_tf001 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes")),
#>   csnv_t5_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes")),
#>   csnv_t10_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"))
#> )

^{Created on 2020-04-24 by the reprex package (v0.3.0)}

maciejmotyka added the type: new feature label Apr 20, 2020

maciejmotyka assigned wlandau Apr 20, 2020

wlandau closed this as completed Apr 21, 2020

wlandau pushed a commit that referenced this issue Apr 23, 2020

Implement #1240

312e124

wlandau pushed a commit to ropensci-books/drake that referenced this issue Apr 23, 2020

Document ropensci/drake#1240

bf19ed3

wlandau pushed a commit to ropensci-books/drake that referenced this issue Apr 23, 2020

Finish documenting ropensci/drake#1240

17ae623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom names in transformations #1240

Allow custom names in transformations #1240

maciejmotyka commented Apr 20, 2020 •

edited by wlandau

Loading

wlandau commented Apr 21, 2020

januz commented Apr 21, 2020

wlandau commented Apr 21, 2020

januz commented Apr 22, 2020

maciejmotyka commented Apr 23, 2020

wlandau commented Apr 23, 2020 •

edited

Loading

januz commented Apr 23, 2020

wlandau commented Apr 23, 2020 •

edited

Loading

januz commented Apr 24, 2020

maciejmotyka commented Apr 24, 2020

wlandau commented Apr 24, 2020 •

edited

Loading

Allow custom names in transformations #1240

Allow custom names in transformations #1240

Comments

maciejmotyka commented Apr 20, 2020 • edited by wlandau Loading

Prework

Proposal

wlandau commented Apr 21, 2020

januz commented Apr 21, 2020

wlandau commented Apr 21, 2020

januz commented Apr 22, 2020

maciejmotyka commented Apr 23, 2020

wlandau commented Apr 23, 2020 • edited Loading

januz commented Apr 23, 2020

wlandau commented Apr 23, 2020 • edited Loading

januz commented Apr 24, 2020

maciejmotyka commented Apr 24, 2020

wlandau commented Apr 24, 2020 • edited Loading

maciejmotyka commented Apr 20, 2020 •

edited by wlandau

Loading

wlandau commented Apr 23, 2020 •

edited

Loading

wlandau commented Apr 23, 2020 •

edited

Loading

wlandau commented Apr 24, 2020 •

edited

Loading