renaming grouping functions? #51

julianesiebourg · 2024-09-02T14:46:50Z

We have some functions that suggest shuffling "grouped" data that for me are a bit confusing:

designit::shuffle_grouped_data(batch_container,  allocate_var,  keep_together_vars = c(),  keep_separate_vars = c(),  n_min = NA,  
                                                   n_max = NA,  n_ideal = NA,  subgroup_var_name = NULL,  report_grouping_as_attribute = FALSE,  
                                                   prefer_big_groups = FALSE,  strict = TRUE,  fullTree = FALSE, maxCalls = 1e+06)

designit::mk_subgroup_shuffling_function(subgroup_vars, restrain_on_subgroup_levels = c(), n_swaps = 1)

designit::shuffle_with_subgroup_formation(subgroup_object,  subgroup_allocations,  keep_separate_vars = c(),
                                                                     report_grouping_as_attribute = FALSE)

I guess they come from the invivo example and are tailored to that.

Maybe I didn't fully get them, but for a simple grouping problem I had, none of them were working.

I had ~50 patients, 25with one, 25 with two measurements.
I wanted to put them in batches such that samples of the same patient were put in the same batch.

I thought a function "shuffle_grouped_data" would do this, given the variable names that form the groups (in my case Patient ID).

Iakov helped with a solution for that particular case, which, a bit more generalized, could be part of the package:

# not parametrized...
keep_groups_together <- function(bc, i) {
  d <- bc$get_samples(include_id = TRUE) |>
    mutate(location_id = row_number())
  # select random src location
  src_id <- d |>
    # exclude empty locations
    filter(!is.na(.sample_id)) |>
    sample_n(1) |>
    pull(location_id)
  stopifnot(length(src_id) == 1)

  # find all samples with matching `Subject ID` and timepoint
  all_src_id <- d |>
    filter(
      # exclude empty locations
      !is.na(.sample_id),
      # we are searching for matching samples
      `Subject ID` == d$`Subject ID`[src_id]
    ) |>
    pull(location_id)

  dst_id <- d |>
    filter(
      # we don't want source locations
      !location_id %in% all_src_id
    ) |>
    group_by(`Subject ID`) |>
    # we only choose empty or location of "lonely" samples
    filter(is.na(.sample_id) | n() == 1) |>
    # find suitable Run with enough space
    group_by(Run) |>
    filter(n_distinct(location_id) >= length(all_src_id)) |>
    ungroup() |>
    # choose destination Run
    filter(Run == sample(unique(Run), 1)) |>
    sample_n(length(all_src_id)) |>
    pull(location_id)
  list(
    src = c(all_src_id, dst_id),
    dst = c(dst_id, all_src_id)
  )
}

But then I wonder, should we discuss about the namings of all those functions so that its clearer what they do?

@ingitwetrust and @idavydov what are your thoughts?

The text was updated successfully, but these errors were encountered:

julianesiebourg · 2025-01-10T11:38:25Z

Just to mention, I need this function again as for the newest ranomization I again have subjects with different number of samples that need to be placed together on a plate.

idavydov · 2025-01-15T10:08:31Z

Thanks, @julianesiebourg . Do you think you would be able to make a merge request?

Here's what ChatGPT and me came up with so far

#' Shuffle samples while keeping groups together
#'
#' This function returns another function, which when called (with a \code{bc} 
#' object and iteration \code{i}), will shuffle the samples by grouping 
#' them based on the column you specify (e.g., \code{Subject ID}), and also
#' allows you to specify which column represents the batch/run/plate (e.g. \code{Run}). 
#'
#' @param group_id A tidy-evaluation style argument (unquoted column name) 
#'   representing the grouping column (e.g. \code{Subject ID}).
#' @param batch_col A tidy-evaluation style argument (unquoted column name)
#'   representing the batch variable (e.g. \code{Run} or \code{plate}).
#'
#' @return A function of the form \code{function(bc, i) \{\dots\}} that returns
#'   a list containing two vectors: \code{src} and \code{dst}, each defining
#'   the "from" and "to" location indices, respectively. If no valid shuffle 
#'   is possible, both \code{src} and \code{dst} will be empty integer vectors.
#'
#' @export
#'
#' @example man/examples/shuffle.R
## gpt-generated
shuffle_groups <- function(group_id, batch_var) {
  function(bc, i) {
    d <- bc$get_samples(include_id = TRUE) |>
      dplyr::mutate(location_id = dplyr::row_number())

    # Select one random source location (must be occupied).
    src_id <- d |>
      dplyr::filter(!is.na(.data$.sample_id)) |>
      dplyr::sample_n(1) |>
      dplyr::pull(.data$location_id)
    stopifnot(length(src_id) == 1)

    # Find all samples belonging to the same group as src:
    #   We use {{ group_id }} to refer to the grouping column
    #   and pull(d, {{ group_id }})[src_id] to get the group value at src_id
    all_src_id <- d |>
      dplyr::filter(
        !is.na(.data$.sample_id),
        {{ group_id }} == dplyr::pull(d, {{ group_id }})[src_id]
      ) |>
      dplyr::pull(.data$location_id)

    # Potential destination candidates:
    #   1) Group by the same grouping column
    #   2) Only keep empty or "lonely" locations (n()==1)
    #   3) Then group by batch_col and require enough space
    #   4) Pick one batches at random
    dst_candidates <- d |>
      dplyr::group_by({{ group_id }}) |>
      dplyr::filter(is.na(.data$.sample_id) | dplyr::n() == 1) |>
      dplyr::group_by({{ batch_col }}) |>
      dplyr::filter(dplyr::n_distinct(.data$location_id) >= length(all_src_id)) |>
      dplyr::ungroup()

    possible_batches <- dst_candidates |>
      dplyr::pull({{ batch_col }}) |>
      unique()

    # If there are no valid batches to move these samples to, return empty lists
    if (length(possible_batches) == 0) {
      return(list(src = integer(0), dst = integer(0)))
    }

    chosen_batch <- sample(possible_batches, 1)
    dst_candidates <- dst_candidates |>
      dplyr::filter({{ batch_col }} == chosen_batch)

    # If we don't have enough rows to move all group samples, return empty
    if (nrow(dst_candidates) < length(all_src_id)) {
      return(list(src = integer(0), dst = integer(0)))
    }

    # Now we can sample exactly as many rows as the group we need to relocate
    dst_id <- dst_candidates |>
      dplyr::sample_n(length(all_src_id)) |>
      dplyr::pull(.data$location_id)

    # Return a list of "src" and "dst" location vectors
    list(
      src = c(all_src_id, dst_id),
      dst = c(dst_id, all_src_id)
    )
  }
}

but that would probably require some tests and maybe we need to make sure that optimize_design() accepts list(src = integer(0), dst = integer(0)).

idavydov · 2025-01-21T14:29:37Z

Hi @julianesiebourg ,

this is minimally tested but not optimal by any means, what do you think?

#' Create a Shuffle-Function that Places Each Group into Exactly One Batch
#'
#' This function returns another function (a "shuffle-proposal function") that
#' can be used in `optimize_design()`. The returned function will assign each group
#' (e.g., each "Litter") so that it stays on a single batch (e.g., a specific "plate").
#'
#' Internally, it sorts groups by descending size and places them in batches
#' that have enough *unoccupied* capacity. If a group cannot fit, it retries a random
#' assignment from scratch. If it fails after `max_retries`, it stops with an error.
#'
#' @param batch_var Column in the container locations that defines the batch
#'   (e.g. `"plate"`).
#' @param group_id Column in the sample data that defines the group
#'   (e.g. `"Litter"`).
#' @param max_retries How many times to retry if a group doesn't fit anywhere.
#' @param quiet Whether to suppress progress messages about retries.
#'
#' @return A function `(batch_container, iteration) -> integer vector`
#'   suitable as `shuffle_proposal_func` in `optimize_design()`.
#'   The integer vector (length = # of locations) is the new `.sample_id` assignment
#'   per location row.
#'
#' @examples
#' library(designit)
#'
#' data("invivo_study_samples")
#' bc <- BatchContainer$new(dimensions = c("plate" = 4, "column" = 5, "row" = 3)) |>
#' # we sort by Strain to ensure poor assignment
#"  assign_in_order(dplyr::arrange(invivo_study_samples, Strain))
#'
#' # Shuffle so each Litter remains on a single plate, up to 5 random tries
#' shuffle_func <- mk_group_batch_shuffler("plate", "Litter")
#'
#' set.seed(43)
#' bc_opt <- optimize_design(
#'   batch_container = bc,
#'   scoring = osat_score_generator("plate", "Strain"),
#'   shuffle_proposal_func = shuffle_func,
#'   max_iter = 50
#' )
#' @export
mk_group_batch_shuffler <- function(
    batch_var = "plate",
    group_id = "Litter",
    max_retries = 100L,
    quiet = FALSE) {
  ## gpt-generated
  # Fix these parameters in the returned function
  force(batch_var)
  force(group_id)
  force(max_retries)
  force(quiet)

  # The function that optimize_design() will call each iteration
  function(batch_container, iteration) {
    # 1) Get location info & find how many FREE slots each batch has
    loc_df <- batch_container$get_locations() |>
      dplyr::mutate(.location_id = dplyr::row_number())
    n_loc <- nrow(loc_df)
    stopifnot(batch_container$has_samples)

    if (!batch_var %in% colnames(loc_df)) {
      stop("'", batch_var, "' not found in the container locations.")
    }

    # For each location, see if it's free:
    #   assignment[i] is NA -> location i is free.
    current_assign <- batch_container$assignment
    capacity_df <- batch_container$get_locations() |>
      dplyr::group_by(.data[[batch_var]]) |>
      dplyr::summarize(
        capacity = dplyr::n(),
        .groups = "drop"
      )
    batch_levels <- capacity_df[[batch_var]]

    # 2) Gather sample data to see how many assigned samples are in each group
    samp_df <- batch_container$get_samples(
      include_id = TRUE,
      remove_empty_locations = TRUE,
      as_tibble = FALSE
    )

    if (!group_id %in% colnames(samp_df)) {
      stop("'", group_id, "' not found among the sample columns.")
    }

    # Only consider actually assigned samples (with .sample_id != NA)
    group_sizes <- samp_df |>
      dplyr::group_by(.data[[group_id]]) |>
      dplyr::summarise(count = dplyr::n(), .groups = "drop") |>
      dplyr::arrange(dplyr::desc(.data[["count"]]))

    # If no assigned samples, just return the existing assignment
    stopifnot(nrow(group_sizes) > 0L)

    # 3) A helper that tries one random assignment of group->batch
    #    returning a df with (group, batch) or NULL if fail
    try_assign_once <- function() {
      # local copy of free capacity
      free_cap <- capacity_df$capacity

      out <- vector("list", nrow(group_sizes))
      for (i in seq_len(nrow(group_sizes))) {
        g_size <- group_sizes[["count"]][i]
        # feasible batches?
        feasible_batches <- which(free_cap >= g_size)

        if (length(feasible_batches) == 0L) {
          return(NULL)
        }
        # randomly pick 1 feasible batch
        # note: sample(vec, 1) returns incorrect results when vector is a single
        # integer
        chosen_idx <- feasible_batches[sample.int(length(feasible_batches), 1L)]
        # reduce capacity
        free_cap[chosen_idx] <- free_cap[chosen_idx] - g_size

        out[[i]] <- data.frame(
          group = group_sizes[[group_id]][i],
          batch = batch_levels[chosen_idx],
          stringsAsFactors = FALSE
        )
      }
      dplyr::bind_rows(out)
    }

    # 4) Try up to max_retries. If we get a valid group->batch map, proceed
    group_assign <- NULL
    attempt <- 1L
    for (attempt in seq_len(max_retries)) {
      ga <- try_assign_once()
      if (!is.null(ga)) {
        group_assign <- ga
        break
      }
    }
    if (is.null(group_assign)) {
      warning("Could not place all groups into batches after ", max_retries, " tries.")
      return(NULL)
    } else if (!quiet && attempt > 1) {
      # message("Groups assigned successfully on attempt #", attempt)
    }

    # 5) Build new assignment vector. We set sample assignment
    #    according to the new group->batch mapping (in order).
    new_assign <- rep(NA_integer_, n_loc)

    # note: sample_assign tells us each sample's group => batch
    sample_assign <- samp_df |>
      dplyr::select(
        ".sample_id",
        group_id = dplyr::all_of(group_id)
      ) |>
      dplyr::left_join(group_assign, by = c("group_id" = "group"))

    # Split location rows by batch
    loc_split <- split(loc_df, loc_df[[batch_var]])
    # Also split the sample df by assigned batch
    sample_split <- split(sample_assign, sample_assign[["batch"]])

    # For each batch level, fill its free slots with the assigned samples
    # in a consistent order.
    for (b_lev in names(loc_split)) {
      sub_loc <- loc_split[[b_lev]]
      # which row indices in the original loc_df are these?
      idx_in_loc_df <- sub_loc[[".location_id"]]

      # The samples assigned to this b_lev (may be NULL if no group assigned)
      smp_sub <- sample_split[[b_lev]]
      if (!is.null(smp_sub)) {
        n_smp <- nrow(smp_sub)
        n_slot <- nrow(sub_loc)
        if (n_smp > n_slot) {
          # This shouldn't happen if capacity checks are correct.
          # We treat it as a "failed attempt" => return NULL
          stop("Internal error: more samples than slots in batch ", b_lev)
        }
        # fill them in order
        new_assign[idx_in_loc_df[seq_len(n_smp)]] <- smp_sub[[".sample_id"]]
      }
    }

    # If we made it this far, new_assign is good.
    # Return the new assignment vector to the optimizer.
    new_assign
  }
}

idavydov · 2025-01-22T08:53:46Z

One more though here related to @ingitwetrust's idea of enumerating all possible. If we assume for a second that every plate has an infinite capacity, then the number of possible assignments is something like $n^k$, where $n$ is the number of plates and $k$ is the of groups. What we could potentially do is estimate this number (upper bound). If it's less than 1 million then we enumerate all possible assignments, e.g. using dynamic programming.

Complexity is (at least) $O(n^k)$ in the worst case unfortunately. Easy to prove: consider a case when $c_i\ge\sum_{j=1}^{k} w_j$ for each $i\in 1..n$, where $c_i$ is capacity of the plate $i$ and $w_j$ is size of the group $j$. Number of possible solutions is $n^k$; the complexity cannot be lower than the number of solutions.

ingitwetrust · 2025-01-22T09:37:38Z

Hi @davydov, Iakov ***@***.***> , yes n^k is the upper bound, but actually possible solutions satisfying plate size constraints as well as the n! divider for permuted solutions on a plate level limit the number down to a fraction of this upper bound. We can also set an upper limit of solutions to be returned, avoiding further recursive invocations of the function that generates solutions. How about the following procedure: 1. Create a list of group sizes Gi and plate sizes Sj. i=1...k, j= 1..n Sort G by size, in decreasing order. Create an Assignment list A with n elements (plates), each a list of groups assigned to that plate. Initially, no assignment. A has methods to determine the number of allocated positions and number of assigned groups for each plate easily. Set up an empty list V to store valid assignments. 2. Invoke shuffling generator function with G, A and upper limit of solutions L 3. In function, return immediately if length of V is greater than L 4. Look at first element of G and find out to how many elements (plates) in A it could be assigned 4. If there are more than 1 plates with the same number of filled positions as well as groups, remove all but one for follow-up. (To avoid trivial permuted solutions as well as equivalent solutions). 5. Rank possible assignments by increasing by number of already filled positions and number of already assigned groups, in that order. 6. If length(G)==1, add those solutions to global list V and return 7. Chop off the first element of G 8. For all possible assignments in the ranking order, invoke function again, i,.e. step 3 with updated G and A 9 return So this should in theory provide a list of (roughly, could be just very few more) solutions, provided there exist any. Favorable assignments with more balanced group numbers across plates should appear more towards the top of the list. Regarding termination... probably one should add an iteration depth counter as well to the function parameters to allow another termination in case we run unexpectedly deep into the recursion, based on a reasonable criterion on n and k. OK, has to be tested in practice I guess. :)

…

On Wed, Jan 22, 2025 at 9:54 AM Iakov Davydov ***@***.***> wrote: One more though here related to @ingitwetrust <https://github.com/ingitwetrust>'s idea of enumerating all possible. If we assume for a second that every plate has an infinite capacity, then the number of possible assignments is something like $n^k$, where $n$ is the number of plates and $k$ is the of groups. What we could potentially do is estimate this number (upper bound). If it's less than 1 million then we enumerate all possible assignments, e.g. using dynamic programming. Complexity is (at least) $O(n^k)$ in the worst case unfortunately (easy to prove, consider a case when $c_i\ge\sum_{j=1}^{k} w_j$ for each $i\in 1..n$, where $c_i$ is capacity of the plate $i$ and $w_j$ is size of the group $j$. — Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3SZLUM632WIURGQEXZ5EN32L5MDBAVCNFSM6AAAAABNQPMFYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBWGYZTIMZTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Guido Steiner, PhD Senior Principal Scientist Pharmaceutical Sciences Pharmaceutical Sciences - Predictive Modeling and Data Analysis (PMDA) Roche Pharmaceutical Research and Early Development (pRED) Roche Innovation Center Basel F. Hoffmann-La Roche AG Building 93/4.01 Grenzacher Strasse CH-4070 Basel Phone +41 61 688 3329

-- Confidentiality Note: This message is intended only for the use of the named recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited.

julianesiebourg added help wanted Extra attention is needed question Further information is requested labels Sep 2, 2024

idavydov mentioned this issue Sep 4, 2024

group shuffle helpers #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

renaming grouping functions? #51

renaming grouping functions? #51

julianesiebourg commented Sep 2, 2024

julianesiebourg commented Jan 10, 2025

idavydov commented Jan 15, 2025

idavydov commented Jan 21, 2025 •

edited

Loading

idavydov commented Jan 22, 2025 •

edited

Loading

ingitwetrust commented Jan 22, 2025 via email

renaming grouping functions? #51

renaming grouping functions? #51

Comments

julianesiebourg commented Sep 2, 2024

julianesiebourg commented Jan 10, 2025

idavydov commented Jan 15, 2025

idavydov commented Jan 21, 2025 • edited Loading

idavydov commented Jan 22, 2025 • edited Loading

ingitwetrust commented Jan 22, 2025 via email

idavydov commented Jan 21, 2025 •

edited

Loading

idavydov commented Jan 22, 2025 •

edited

Loading