Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

renaming grouping functions? #51

Open
julianesiebourg opened this issue Sep 2, 2024 · 5 comments
Open

renaming grouping functions? #51

julianesiebourg opened this issue Sep 2, 2024 · 5 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@julianesiebourg
Copy link
Collaborator

We have some functions that suggest shuffling "grouped" data that for me are a bit confusing:

designit::shuffle_grouped_data(batch_container,  allocate_var,  keep_together_vars = c(),  keep_separate_vars = c(),  n_min = NA,  
                                                   n_max = NA,  n_ideal = NA,  subgroup_var_name = NULL,  report_grouping_as_attribute = FALSE,  
                                                   prefer_big_groups = FALSE,  strict = TRUE,  fullTree = FALSE, maxCalls = 1e+06)

designit::mk_subgroup_shuffling_function(subgroup_vars, restrain_on_subgroup_levels = c(), n_swaps = 1)

designit::shuffle_with_subgroup_formation(subgroup_object,  subgroup_allocations,  keep_separate_vars = c(),
                                                                     report_grouping_as_attribute = FALSE)

I guess they come from the invivo example and are tailored to that.

Maybe I didn't fully get them, but for a simple grouping problem I had, none of them were working.

  • I had ~50 patients, 25with one, 25 with two measurements.
  • I wanted to put them in batches such that samples of the same patient were put in the same batch.

I thought a function "shuffle_grouped_data" would do this, given the variable names that form the groups (in my case Patient ID).

Iakov helped with a solution for that particular case, which, a bit more generalized, could be part of the package:

# not parametrized...
keep_groups_together <- function(bc, i) {
  d <- bc$get_samples(include_id = TRUE) |>
    mutate(location_id = row_number())
  # select random src location
  src_id <- d |>
    # exclude empty locations
    filter(!is.na(.sample_id)) |>
    sample_n(1) |>
    pull(location_id)
  stopifnot(length(src_id) == 1)

  # find all samples with matching `Subject ID` and timepoint
  all_src_id <- d |>
    filter(
      # exclude empty locations
      !is.na(.sample_id),
      # we are searching for matching samples
      `Subject ID` == d$`Subject ID`[src_id]
    ) |>
    pull(location_id)

  dst_id <- d |>
    filter(
      # we don't want source locations
      !location_id %in% all_src_id
    ) |>
    group_by(`Subject ID`) |>
    # we only choose empty or location of "lonely" samples
    filter(is.na(.sample_id) | n() == 1) |>
    # find suitable Run with enough space
    group_by(Run) |>
    filter(n_distinct(location_id) >= length(all_src_id)) |>
    ungroup() |>
    # choose destination Run
    filter(Run == sample(unique(Run), 1)) |>
    sample_n(length(all_src_id)) |>
    pull(location_id)
  list(
    src = c(all_src_id, dst_id),
    dst = c(dst_id, all_src_id)
  )
}

But then I wonder, should we discuss about the namings of all those functions so that its clearer what they do?

@ingitwetrust and @idavydov what are your thoughts?

@julianesiebourg julianesiebourg added help wanted Extra attention is needed question Further information is requested labels Sep 2, 2024
@julianesiebourg
Copy link
Collaborator Author

Just to mention, I need this function again as for the newest ranomization I again have subjects with different number of samples that need to be placed together on a plate.

@idavydov
Copy link
Collaborator

Thanks, @julianesiebourg . Do you think you would be able to make a merge request?

Here's what ChatGPT and me came up with so far

#' Shuffle samples while keeping groups together
#'
#' This function returns another function, which when called (with a \code{bc} 
#' object and iteration \code{i}), will shuffle the samples by grouping 
#' them based on the column you specify (e.g., \code{Subject ID}), and also
#' allows you to specify which column represents the batch/run/plate (e.g. \code{Run}). 
#'
#' @param group_id A tidy-evaluation style argument (unquoted column name) 
#'   representing the grouping column (e.g. \code{Subject ID}).
#' @param batch_col A tidy-evaluation style argument (unquoted column name)
#'   representing the batch variable (e.g. \code{Run} or \code{plate}).
#'
#' @return A function of the form \code{function(bc, i) \{\dots\}} that returns
#'   a list containing two vectors: \code{src} and \code{dst}, each defining
#'   the "from" and "to" location indices, respectively. If no valid shuffle 
#'   is possible, both \code{src} and \code{dst} will be empty integer vectors.
#'
#' @export
#'
#' @example man/examples/shuffle.R
## gpt-generated
shuffle_groups <- function(group_id, batch_var) {
  function(bc, i) {
    d <- bc$get_samples(include_id = TRUE) |>
      dplyr::mutate(location_id = dplyr::row_number())

    # Select one random source location (must be occupied).
    src_id <- d |>
      dplyr::filter(!is.na(.data$.sample_id)) |>
      dplyr::sample_n(1) |>
      dplyr::pull(.data$location_id)
    stopifnot(length(src_id) == 1)

    # Find all samples belonging to the same group as src:
    #   We use {{ group_id }} to refer to the grouping column
    #   and pull(d, {{ group_id }})[src_id] to get the group value at src_id
    all_src_id <- d |>
      dplyr::filter(
        !is.na(.data$.sample_id),
        {{ group_id }} == dplyr::pull(d, {{ group_id }})[src_id]
      ) |>
      dplyr::pull(.data$location_id)

    # Potential destination candidates:
    #   1) Group by the same grouping column
    #   2) Only keep empty or "lonely" locations (n()==1)
    #   3) Then group by batch_col and require enough space
    #   4) Pick one batches at random
    dst_candidates <- d |>
      dplyr::group_by({{ group_id }}) |>
      dplyr::filter(is.na(.data$.sample_id) | dplyr::n() == 1) |>
      dplyr::group_by({{ batch_col }}) |>
      dplyr::filter(dplyr::n_distinct(.data$location_id) >= length(all_src_id)) |>
      dplyr::ungroup()

    possible_batches <- dst_candidates |>
      dplyr::pull({{ batch_col }}) |>
      unique()

    # If there are no valid batches to move these samples to, return empty lists
    if (length(possible_batches) == 0) {
      return(list(src = integer(0), dst = integer(0)))
    }

    chosen_batch <- sample(possible_batches, 1)
    dst_candidates <- dst_candidates |>
      dplyr::filter({{ batch_col }} == chosen_batch)

    # If we don't have enough rows to move all group samples, return empty
    if (nrow(dst_candidates) < length(all_src_id)) {
      return(list(src = integer(0), dst = integer(0)))
    }

    # Now we can sample exactly as many rows as the group we need to relocate
    dst_id <- dst_candidates |>
      dplyr::sample_n(length(all_src_id)) |>
      dplyr::pull(.data$location_id)

    # Return a list of "src" and "dst" location vectors
    list(
      src = c(all_src_id, dst_id),
      dst = c(dst_id, all_src_id)
    )
  }
}

but that would probably require some tests and maybe we need to make sure that optimize_design() accepts list(src = integer(0), dst = integer(0)).

@idavydov
Copy link
Collaborator

idavydov commented Jan 21, 2025

Hi @julianesiebourg ,

this is minimally tested but not optimal by any means, what do you think?

#' Create a Shuffle-Function that Places Each Group into Exactly One Batch
#'
#' This function returns another function (a "shuffle-proposal function") that
#' can be used in `optimize_design()`. The returned function will assign each group
#' (e.g., each "Litter") so that it stays on a single batch (e.g., a specific "plate").
#'
#' Internally, it sorts groups by descending size and places them in batches
#' that have enough *unoccupied* capacity. If a group cannot fit, it retries a random
#' assignment from scratch. If it fails after `max_retries`, it stops with an error.
#'
#' @param batch_var Column in the container locations that defines the batch
#'   (e.g. `"plate"`).
#' @param group_id Column in the sample data that defines the group
#'   (e.g. `"Litter"`).
#' @param max_retries How many times to retry if a group doesn't fit anywhere.
#' @param quiet Whether to suppress progress messages about retries.
#'
#' @return A function `(batch_container, iteration) -> integer vector`
#'   suitable as `shuffle_proposal_func` in `optimize_design()`.
#'   The integer vector (length = # of locations) is the new `.sample_id` assignment
#'   per location row.
#'
#' @examples
#' library(designit)
#'
#' data("invivo_study_samples")
#' bc <- BatchContainer$new(dimensions = c("plate" = 4, "column" = 5, "row" = 3)) |>
#' # we sort by Strain to ensure poor assignment
#"  assign_in_order(dplyr::arrange(invivo_study_samples, Strain))
#'
#' # Shuffle so each Litter remains on a single plate, up to 5 random tries
#' shuffle_func <- mk_group_batch_shuffler("plate", "Litter")
#'
#' set.seed(43)
#' bc_opt <- optimize_design(
#'   batch_container = bc,
#'   scoring = osat_score_generator("plate", "Strain"),
#'   shuffle_proposal_func = shuffle_func,
#'   max_iter = 50
#' )
#' @export
mk_group_batch_shuffler <- function(
    batch_var = "plate",
    group_id = "Litter",
    max_retries = 100L,
    quiet = FALSE) {
  ## gpt-generated
  # Fix these parameters in the returned function
  force(batch_var)
  force(group_id)
  force(max_retries)
  force(quiet)

  # The function that optimize_design() will call each iteration
  function(batch_container, iteration) {
    # 1) Get location info & find how many FREE slots each batch has
    loc_df <- batch_container$get_locations() |>
      dplyr::mutate(.location_id = dplyr::row_number())
    n_loc <- nrow(loc_df)
    stopifnot(batch_container$has_samples)

    if (!batch_var %in% colnames(loc_df)) {
      stop("'", batch_var, "' not found in the container locations.")
    }

    # For each location, see if it's free:
    #   assignment[i] is NA -> location i is free.
    current_assign <- batch_container$assignment
    capacity_df <- batch_container$get_locations() |>
      dplyr::group_by(.data[[batch_var]]) |>
      dplyr::summarize(
        capacity = dplyr::n(),
        .groups = "drop"
      )
    batch_levels <- capacity_df[[batch_var]]

    # 2) Gather sample data to see how many assigned samples are in each group
    samp_df <- batch_container$get_samples(
      include_id = TRUE,
      remove_empty_locations = TRUE,
      as_tibble = FALSE
    )

    if (!group_id %in% colnames(samp_df)) {
      stop("'", group_id, "' not found among the sample columns.")
    }

    # Only consider actually assigned samples (with .sample_id != NA)
    group_sizes <- samp_df |>
      dplyr::group_by(.data[[group_id]]) |>
      dplyr::summarise(count = dplyr::n(), .groups = "drop") |>
      dplyr::arrange(dplyr::desc(.data[["count"]]))

    # If no assigned samples, just return the existing assignment
    stopifnot(nrow(group_sizes) > 0L)

    # 3) A helper that tries one random assignment of group->batch
    #    returning a df with (group, batch) or NULL if fail
    try_assign_once <- function() {
      # local copy of free capacity
      free_cap <- capacity_df$capacity

      out <- vector("list", nrow(group_sizes))
      for (i in seq_len(nrow(group_sizes))) {
        g_size <- group_sizes[["count"]][i]
        # feasible batches?
        feasible_batches <- which(free_cap >= g_size)

        if (length(feasible_batches) == 0L) {
          return(NULL)
        }
        # randomly pick 1 feasible batch
        # note: sample(vec, 1) returns incorrect results when vector is a single
        # integer
        chosen_idx <- feasible_batches[sample.int(length(feasible_batches), 1L)]
        # reduce capacity
        free_cap[chosen_idx] <- free_cap[chosen_idx] - g_size

        out[[i]] <- data.frame(
          group = group_sizes[[group_id]][i],
          batch = batch_levels[chosen_idx],
          stringsAsFactors = FALSE
        )
      }
      dplyr::bind_rows(out)
    }

    # 4) Try up to max_retries. If we get a valid group->batch map, proceed
    group_assign <- NULL
    attempt <- 1L
    for (attempt in seq_len(max_retries)) {
      ga <- try_assign_once()
      if (!is.null(ga)) {
        group_assign <- ga
        break
      }
    }
    if (is.null(group_assign)) {
      warning("Could not place all groups into batches after ", max_retries, " tries.")
      return(NULL)
    } else if (!quiet && attempt > 1) {
      # message("Groups assigned successfully on attempt #", attempt)
    }

    # 5) Build new assignment vector. We set sample assignment
    #    according to the new group->batch mapping (in order).
    new_assign <- rep(NA_integer_, n_loc)

    # note: sample_assign tells us each sample's group => batch
    sample_assign <- samp_df |>
      dplyr::select(
        ".sample_id",
        group_id = dplyr::all_of(group_id)
      ) |>
      dplyr::left_join(group_assign, by = c("group_id" = "group"))

    # Split location rows by batch
    loc_split <- split(loc_df, loc_df[[batch_var]])
    # Also split the sample df by assigned batch
    sample_split <- split(sample_assign, sample_assign[["batch"]])

    # For each batch level, fill its free slots with the assigned samples
    # in a consistent order.
    for (b_lev in names(loc_split)) {
      sub_loc <- loc_split[[b_lev]]
      # which row indices in the original loc_df are these?
      idx_in_loc_df <- sub_loc[[".location_id"]]

      # The samples assigned to this b_lev (may be NULL if no group assigned)
      smp_sub <- sample_split[[b_lev]]
      if (!is.null(smp_sub)) {
        n_smp <- nrow(smp_sub)
        n_slot <- nrow(sub_loc)
        if (n_smp > n_slot) {
          # This shouldn't happen if capacity checks are correct.
          # We treat it as a "failed attempt" => return NULL
          stop("Internal error: more samples than slots in batch ", b_lev)
        }
        # fill them in order
        new_assign[idx_in_loc_df[seq_len(n_smp)]] <- smp_sub[[".sample_id"]]
      }
    }

    # If we made it this far, new_assign is good.
    # Return the new assignment vector to the optimizer.
    new_assign
  }
}

@idavydov
Copy link
Collaborator

idavydov commented Jan 22, 2025

One more though here related to @ingitwetrust's idea of enumerating all possible. If we assume for a second that every plate has an infinite capacity, then the number of possible assignments is something like $n^k$, where $n$ is the number of plates and $k$ is the of groups. What we could potentially do is estimate this number (upper bound). If it's less than 1 million then we enumerate all possible assignments, e.g. using dynamic programming.

Complexity is (at least) $O(n^k)$ in the worst case unfortunately. Easy to prove: consider a case when $c_i\ge\sum_{j=1}^{k} w_j$ for each $i\in 1..n$, where $c_i$ is capacity of the plate $i$ and $w_j$ is size of the group $j$. Number of possible solutions is $n^k$; the complexity cannot be lower than the number of solutions.

@ingitwetrust
Copy link
Collaborator

ingitwetrust commented Jan 22, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants