Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom names in transformations #1240

Closed
2 tasks done
maciejmotyka opened this issue Apr 20, 2020 · 11 comments
Closed
2 tasks done

Allow custom names in transformations #1240

maciejmotyka opened this issue Apr 20, 2020 · 11 comments
Assignees

Comments

@maciejmotyka
Copy link

maciejmotyka commented Apr 20, 2020

Prework

Proposal

I'd like to be able to explicitly specify names of targets created with map() or cross().
I don't want to have to rely on variable names (see #1220) or remember numbers (.id = FALSE), but simply supply a character vector with my own meaningful names.

Something like this:

plan <- drake_plan(
    my_fun(path = file_in(my_path)),
    transform = map(
        my_path = c(
            "/nothing/interesting/METHOD1/stuff/DATA1",
            "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
        ),
        .id = FALSE,
        .custom_names = c(
            "METHOD1_DATA1",
            "METHOD2_DATA2"
        )
    )
)

Having short meaningful names would help in downstream prototyping or exploratory analysis. Currently I have to do short_name <- loadd("a_la_la_la_la_long_long_li_long_long_name") for each target.

@wlandau
Copy link
Member

wlandau commented Apr 21, 2020

Sorry, I do not have plans to implement 100% custom names. You can kind of work around this by defining a custom grouping variable and feeding it to .id.

library(drake)
custom_ids <- c("method1_data1", "method2_data2")
plan <- drake_plan(
  x = target( # Start with a short name.
    my_fun(path = file_in(my_path)),
    transform = map(
      my_path = c(
        "/nothing/interesting/METHOD1/stuff/DATA1",
        "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
      ),
      custom_ids = !!custom_ids, # The !! is important.
      .id = custom_ids,
    )
  )
)

plan
#> # A tibble: 2 x 2
#>   target         command                                                        
#>   <chr>          <expr>                                                         
#> 1 x_method1_dat… my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA…
#> 2 x_method2_dat… my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/st…

drake_plan_source(plan)
#> drake_plan(
#>   x_method1_data1 = my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA1")),
#>   x_method2_data2 = my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"))
#> )

Created on 2020-04-21 by the reprex package (v0.3.0)

@wlandau wlandau closed this as completed Apr 21, 2020
@januz
Copy link

januz commented Apr 21, 2020

Hm, I was just about to write a similar feature request. The arguments I map() and cross() over aren't really suited to generate target names from (long variable names, vectors, ...). I wanted to suggest that one just supplies named vectors to the parameters that then are used in the target name for those parameters, i.e., something like

plan <- drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = cross(
      param1 = c(
        first = "FirstLongVariableName",
        second = "SecondLongVariableName"
      ),
      param2 = c(
        b1 = "LongVariableName_1",
        b2 = c("LongVariableName_1",  "LongVariableName_2")
      )
    )
  )
)

with resulting target names

model_first_b1
model_first_b2
model_second_b1
model_second_b2

If that won't be possible, would you mind explaining how to work around this for cross()? Your suggestion above only works for map(), right? Thank you!!

@wlandau
Copy link
Member

wlandau commented Apr 21, 2020

You can supply a .data argument to map() with the expanded grid.

library(drake)
library(tidyverse)

grid <- expand_grid(
  param1 = c("FirstLongVariableName", "SecondLongVariableName"),
  param2 = c("LongVariableName_1", "LongVariableName_2")
) %>%
  mutate(
    id1 = recode(
      param1,
      FirstLongVariableName = "first",
      SecondLongVariableName = "second"
    )
  ) %>%
  mutate(
    id2 = recode(
      param2,
      LongVariableName_1 = "b1",
      LongVariableName_2 = "b2"
    )
  )

grid
#> # A tibble: 4 x 4
#>   param1                 param2             id1    id2  
#>   <chr>                  <chr>              <chr>  <chr>
#> 1 FirstLongVariableName  LongVariableName_1 first  b1   
#> 2 FirstLongVariableName  LongVariableName_2 first  b2   
#> 3 SecondLongVariableName LongVariableName_1 second b1   
#> 4 SecondLongVariableName LongVariableName_2 second b2

drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = map(.data = !!grid, .id = c(id1, id2))
  )
)
#> # A tibble: 4 x 2
#>   target          command                                               
#>   <chr>           <expr>                                                
#> 1 model_first_b1  my_fun("FirstLongVariableName", "LongVariableName_1") 
#> 2 model_first_b2  my_fun("FirstLongVariableName", "LongVariableName_2") 
#> 3 model_second_b1 my_fun("SecondLongVariableName", "LongVariableName_1")
#> 4 model_second_b2 my_fun("SecondLongVariableName", "LongVariableName_2")

Created on 2020-04-21 by the reprex package (v0.3.0)

@januz
Copy link

januz commented Apr 22, 2020

Thank you very much for the example, @wlandau! This works, but I had to use case_when() to account for cases when the input is a vector and not a single character/number (like b2 in my example). I still think that just using named vectors would be an elegant solution and probably easier to read/understand, but I understand that you won't implement it. Thanks for your help!

@maciejmotyka
Copy link
Author

Yes, thanks for these hacks @wlandau.
Just out of curiosity: why do you think custom names are a bad idea? Are they unnecessary, too time-consuming to implement, or maybe would somehow interfere with reproducibility?

wlandau pushed a commit that referenced this issue Apr 23, 2020
@wlandau
Copy link
Member

wlandau commented Apr 23, 2020

I let it sit for a while and changed my mind because implementation turned out to be trivially easy. Implemented in 312e124.

library(drake)
drake_plan(
  x = target(
    f(x),
    transform = map(x = !!seq_len(2), .names = c("a", "b"))
  ),
  y = target(
    f(w, x),
    transform = cross(
      w = !!seq_len(2),
      x,
      .names = c("aa", "ab", "ba", "bb")
    )
  ),
  z = target(
    g(y),
    transform = map(y)
  ),
  final = target(
    h(z),
    transform = combine(z, .by = x, .names = c("final1", "final2"))
  )
)
#> # A tibble: 12 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 final1 h(z_aa, z_ab)
#>  2 final2 h(z_ba, z_bb)
#>  3 a      f(1L)        
#>  4 b      f(2L)        
#>  5 aa     f(1L, a)     
#>  6 ab     f(2L, a)     
#>  7 ba     f(1L, b)     
#>  8 bb     f(2L, b)     
#>  9 z_aa   g(aa)        
#> 10 z_ab   g(ab)        
#> 11 z_ba   g(ba)        
#> 12 z_bb   g(bb)

Created on 2020-04-23 by the reprex package (v0.3.0)

I had considered this feature before, but I resisted on general principles. It breaks the abstraction of the interface, which is usually unwise from a software development perspective. I have made the mistake of being too eager to add flexibility over the last 3-ish years of developing drake. In other words, I was not lazy enough as a programmer, and I had little experience developing a tool this complicated.

If I could go back and develop the whole package from scratch, I would recommend clever workarounds for as long as possible before immediately implementing new features. For example, triggers was not a good idea in hindsight. The implementation was extremely complicated to create and even more burdensome to maintain, and I now strongly suspect the most important functionality can be completely covered with global objects, cancel(), and cancel_if(), which are much simpler.

For custom names, it took a go at implementation to see this, but it turned out to have less technical debt than I originally expected.

wlandau pushed a commit to ropensci-books/drake that referenced this issue Apr 23, 2020
wlandau pushed a commit to ropensci-books/drake that referenced this issue Apr 23, 2020
@januz
Copy link

januz commented Apr 23, 2020

Thank you so much for implementing custom names, @wlandau!! I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments. I guess one can apply the approach below and use named vectors (to have a clear value-name correspondence), then construct the names for the targets programmatically before the call to drake_plan() (I tried to do so within the call, but I didn't find a way to refer to the names resulting from a previous target). But even that might turn out to be error prone.

x_arg1 <- c(a = 1L, b = 2L)
x_arg2 <- c(c = 3L, d = 4L)
x_names <- paste0("xtarget_", paste(names(x_arg1), names(x_arg2), sep = "_"))
  
y_arg  <- c(e = 5L, f = 6L)
y_names <- paste0("ytarget_", do.call(paste, c(expand.grid(names(y_arg), x_names), sep = "_")))

final_names <- paste0("finaltarget_", x_names)

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2,
      .names = !!x_names
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget,
      .names = !!y_names
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget, .names = !!final_names)
  )
)

which gives the same result as the drake's standard naming convention, with the argument names switched out:

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget)
  )
) 

@wlandau
Copy link
Member

wlandau commented Apr 23, 2020

I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

I share your concerns, and this was part of my initial resistance. I recommend avoiding .names unless target names get truly unappealing or unmanageable.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments.

Fundamental changes to behavior get tricky because drake needs to be as back-compatible with previous versions as possible. And relying on vector/list names is another dimension to behavior that could easily get confusing and complicated for users to learn, even when explained properly. Also, it is too easy to forget when your vector has names and when it does not, and this would make plans brittle.

@januz
Copy link

januz commented Apr 24, 2020

This is all true! I think I will stay with using .data for now. Seems like the most robust solution. Thank you!!

@maciejmotyka
Copy link
Author

I will keep using .names for now to test it out.
One inconvenience is as @januz noticed, that the names of the previous targets are not accessible inside the following transformations which would enable more elegant construction of new names.

Here, I'm trying to prepend the previous names:

plan <- drake::drake_plan(
    
    # Read haplotype seqs and clean their names 
    haps = target(
        read_haps(path = drake::file_in(path)),
        transform = map(
            path = c(
                "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
            ),
            .names = c("abqr_001",
                       "csnv_t5_tf001",
                       "csnv_t5_tf01",
                       "csnv_t10_tf01")
        )
    ), 
    
    # Align the haps using Muscle, ClustalO and ClustalW with default settings
    aln = target(
        msa::msa(inputSeqs = haps, method = methods),
        transform = cross(
            haps,
            methods = c("Muscle", "ClustalOmega", "ClustalW"),
            .names = outer(
                c("Muscle", "ClustO", "ClustW"),
                haps,
                paste,
                sep = "_"
            ) %>% as.vector()
        )
    )
)

#> Error in outer(c("Muscle", "ClustO", "ClustW"), haps, paste, sep = "_") : 
#> object 'haps' not found

Of course, I can just copy the first .names vector and paste it in place of haps inside the second target. However, if haps was of length 40 instead of 4 then pasting the whole vector again would make the plan less readable.

@wlandau
Copy link
Member

wlandau commented Apr 24, 2020

You can work with external variables to support your plan, such as the hap_names vector below.

library(drake)

hap_names <- c(
  "abqr_001",
  "csnv_t5_tf001",
  "csnv_t5_tf01",
  "csnv_t10_tf01"
)

plan <- drake_plan(
  haps = target(
    read_haps(path = drake::file_in(path)),
    transform = map(
      path = c(
        "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
      ),
      .names = !!hap_names
    )
  ), 
  aln = target(
    msa::msa(inputSeqs = haps, method = methods),
    transform = cross(
      haps,
      methods = c("Muscle", "ClustalOmega", "ClustalW"),
      .names = as.vector(outer(
        c("Muscle", "ClustO", "ClustW"),
        !!hap_names,
        paste,
        sep = "_"
      ))
    )
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   Muscle_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "Muscle"),
#>   ClustO_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalOmega"),
#>   ClustW_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "Muscle"),
#>   ClustO_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "Muscle"),
#>   ClustO_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalW"),
#>   Muscle_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "Muscle"),
#>   ClustO_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalW"),
#>   abqr_001 = read_haps(path = drake::file_in("/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes")),
#>   csnv_t5_tf001 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes")),
#>   csnv_t5_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes")),
#>   csnv_t10_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"))
#> )

Created on 2020-04-24 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants