-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
renaming grouping functions? #51
Comments
Just to mention, I need this function again as for the newest ranomization I again have subjects with different number of samples that need to be placed together on a plate. |
Thanks, @julianesiebourg . Do you think you would be able to make a merge request? Here's what ChatGPT and me came up with so far #' Shuffle samples while keeping groups together
#'
#' This function returns another function, which when called (with a \code{bc}
#' object and iteration \code{i}), will shuffle the samples by grouping
#' them based on the column you specify (e.g., \code{Subject ID}), and also
#' allows you to specify which column represents the batch/run/plate (e.g. \code{Run}).
#'
#' @param group_id A tidy-evaluation style argument (unquoted column name)
#' representing the grouping column (e.g. \code{Subject ID}).
#' @param batch_col A tidy-evaluation style argument (unquoted column name)
#' representing the batch variable (e.g. \code{Run} or \code{plate}).
#'
#' @return A function of the form \code{function(bc, i) \{\dots\}} that returns
#' a list containing two vectors: \code{src} and \code{dst}, each defining
#' the "from" and "to" location indices, respectively. If no valid shuffle
#' is possible, both \code{src} and \code{dst} will be empty integer vectors.
#'
#' @export
#'
#' @example man/examples/shuffle.R
## gpt-generated
shuffle_groups <- function(group_id, batch_var) {
function(bc, i) {
d <- bc$get_samples(include_id = TRUE) |>
dplyr::mutate(location_id = dplyr::row_number())
# Select one random source location (must be occupied).
src_id <- d |>
dplyr::filter(!is.na(.data$.sample_id)) |>
dplyr::sample_n(1) |>
dplyr::pull(.data$location_id)
stopifnot(length(src_id) == 1)
# Find all samples belonging to the same group as src:
# We use {{ group_id }} to refer to the grouping column
# and pull(d, {{ group_id }})[src_id] to get the group value at src_id
all_src_id <- d |>
dplyr::filter(
!is.na(.data$.sample_id),
{{ group_id }} == dplyr::pull(d, {{ group_id }})[src_id]
) |>
dplyr::pull(.data$location_id)
# Potential destination candidates:
# 1) Group by the same grouping column
# 2) Only keep empty or "lonely" locations (n()==1)
# 3) Then group by batch_col and require enough space
# 4) Pick one batches at random
dst_candidates <- d |>
dplyr::group_by({{ group_id }}) |>
dplyr::filter(is.na(.data$.sample_id) | dplyr::n() == 1) |>
dplyr::group_by({{ batch_col }}) |>
dplyr::filter(dplyr::n_distinct(.data$location_id) >= length(all_src_id)) |>
dplyr::ungroup()
possible_batches <- dst_candidates |>
dplyr::pull({{ batch_col }}) |>
unique()
# If there are no valid batches to move these samples to, return empty lists
if (length(possible_batches) == 0) {
return(list(src = integer(0), dst = integer(0)))
}
chosen_batch <- sample(possible_batches, 1)
dst_candidates <- dst_candidates |>
dplyr::filter({{ batch_col }} == chosen_batch)
# If we don't have enough rows to move all group samples, return empty
if (nrow(dst_candidates) < length(all_src_id)) {
return(list(src = integer(0), dst = integer(0)))
}
# Now we can sample exactly as many rows as the group we need to relocate
dst_id <- dst_candidates |>
dplyr::sample_n(length(all_src_id)) |>
dplyr::pull(.data$location_id)
# Return a list of "src" and "dst" location vectors
list(
src = c(all_src_id, dst_id),
dst = c(dst_id, all_src_id)
)
}
} but that would probably require some tests and maybe we need to make sure that |
Hi @julianesiebourg , this is minimally tested but not optimal by any means, what do you think? #' Create a Shuffle-Function that Places Each Group into Exactly One Batch
#'
#' This function returns another function (a "shuffle-proposal function") that
#' can be used in `optimize_design()`. The returned function will assign each group
#' (e.g., each "Litter") so that it stays on a single batch (e.g., a specific "plate").
#'
#' Internally, it sorts groups by descending size and places them in batches
#' that have enough *unoccupied* capacity. If a group cannot fit, it retries a random
#' assignment from scratch. If it fails after `max_retries`, it stops with an error.
#'
#' @param batch_var Column in the container locations that defines the batch
#' (e.g. `"plate"`).
#' @param group_id Column in the sample data that defines the group
#' (e.g. `"Litter"`).
#' @param max_retries How many times to retry if a group doesn't fit anywhere.
#' @param quiet Whether to suppress progress messages about retries.
#'
#' @return A function `(batch_container, iteration) -> integer vector`
#' suitable as `shuffle_proposal_func` in `optimize_design()`.
#' The integer vector (length = # of locations) is the new `.sample_id` assignment
#' per location row.
#'
#' @examples
#' library(designit)
#'
#' data("invivo_study_samples")
#' bc <- BatchContainer$new(dimensions = c("plate" = 4, "column" = 5, "row" = 3)) |>
#' # we sort by Strain to ensure poor assignment
#" assign_in_order(dplyr::arrange(invivo_study_samples, Strain))
#'
#' # Shuffle so each Litter remains on a single plate, up to 5 random tries
#' shuffle_func <- mk_group_batch_shuffler("plate", "Litter")
#'
#' set.seed(43)
#' bc_opt <- optimize_design(
#' batch_container = bc,
#' scoring = osat_score_generator("plate", "Strain"),
#' shuffle_proposal_func = shuffle_func,
#' max_iter = 50
#' )
#' @export
mk_group_batch_shuffler <- function(
batch_var = "plate",
group_id = "Litter",
max_retries = 100L,
quiet = FALSE) {
## gpt-generated
# Fix these parameters in the returned function
force(batch_var)
force(group_id)
force(max_retries)
force(quiet)
# The function that optimize_design() will call each iteration
function(batch_container, iteration) {
# 1) Get location info & find how many FREE slots each batch has
loc_df <- batch_container$get_locations() |>
dplyr::mutate(.location_id = dplyr::row_number())
n_loc <- nrow(loc_df)
stopifnot(batch_container$has_samples)
if (!batch_var %in% colnames(loc_df)) {
stop("'", batch_var, "' not found in the container locations.")
}
# For each location, see if it's free:
# assignment[i] is NA -> location i is free.
current_assign <- batch_container$assignment
capacity_df <- batch_container$get_locations() |>
dplyr::group_by(.data[[batch_var]]) |>
dplyr::summarize(
capacity = dplyr::n(),
.groups = "drop"
)
batch_levels <- capacity_df[[batch_var]]
# 2) Gather sample data to see how many assigned samples are in each group
samp_df <- batch_container$get_samples(
include_id = TRUE,
remove_empty_locations = TRUE,
as_tibble = FALSE
)
if (!group_id %in% colnames(samp_df)) {
stop("'", group_id, "' not found among the sample columns.")
}
# Only consider actually assigned samples (with .sample_id != NA)
group_sizes <- samp_df |>
dplyr::group_by(.data[[group_id]]) |>
dplyr::summarise(count = dplyr::n(), .groups = "drop") |>
dplyr::arrange(dplyr::desc(.data[["count"]]))
# If no assigned samples, just return the existing assignment
stopifnot(nrow(group_sizes) > 0L)
# 3) A helper that tries one random assignment of group->batch
# returning a df with (group, batch) or NULL if fail
try_assign_once <- function() {
# local copy of free capacity
free_cap <- capacity_df$capacity
out <- vector("list", nrow(group_sizes))
for (i in seq_len(nrow(group_sizes))) {
g_size <- group_sizes[["count"]][i]
# feasible batches?
feasible_batches <- which(free_cap >= g_size)
if (length(feasible_batches) == 0L) {
return(NULL)
}
# randomly pick 1 feasible batch
# note: sample(vec, 1) returns incorrect results when vector is a single
# integer
chosen_idx <- feasible_batches[sample.int(length(feasible_batches), 1L)]
# reduce capacity
free_cap[chosen_idx] <- free_cap[chosen_idx] - g_size
out[[i]] <- data.frame(
group = group_sizes[[group_id]][i],
batch = batch_levels[chosen_idx],
stringsAsFactors = FALSE
)
}
dplyr::bind_rows(out)
}
# 4) Try up to max_retries. If we get a valid group->batch map, proceed
group_assign <- NULL
attempt <- 1L
for (attempt in seq_len(max_retries)) {
ga <- try_assign_once()
if (!is.null(ga)) {
group_assign <- ga
break
}
}
if (is.null(group_assign)) {
warning("Could not place all groups into batches after ", max_retries, " tries.")
return(NULL)
} else if (!quiet && attempt > 1) {
# message("Groups assigned successfully on attempt #", attempt)
}
# 5) Build new assignment vector. We set sample assignment
# according to the new group->batch mapping (in order).
new_assign <- rep(NA_integer_, n_loc)
# note: sample_assign tells us each sample's group => batch
sample_assign <- samp_df |>
dplyr::select(
".sample_id",
group_id = dplyr::all_of(group_id)
) |>
dplyr::left_join(group_assign, by = c("group_id" = "group"))
# Split location rows by batch
loc_split <- split(loc_df, loc_df[[batch_var]])
# Also split the sample df by assigned batch
sample_split <- split(sample_assign, sample_assign[["batch"]])
# For each batch level, fill its free slots with the assigned samples
# in a consistent order.
for (b_lev in names(loc_split)) {
sub_loc <- loc_split[[b_lev]]
# which row indices in the original loc_df are these?
idx_in_loc_df <- sub_loc[[".location_id"]]
# The samples assigned to this b_lev (may be NULL if no group assigned)
smp_sub <- sample_split[[b_lev]]
if (!is.null(smp_sub)) {
n_smp <- nrow(smp_sub)
n_slot <- nrow(sub_loc)
if (n_smp > n_slot) {
# This shouldn't happen if capacity checks are correct.
# We treat it as a "failed attempt" => return NULL
stop("Internal error: more samples than slots in batch ", b_lev)
}
# fill them in order
new_assign[idx_in_loc_df[seq_len(n_smp)]] <- smp_sub[[".sample_id"]]
}
}
# If we made it this far, new_assign is good.
# Return the new assignment vector to the optimizer.
new_assign
}
} |
One more though here related to @ingitwetrust's idea of enumerating all possible. If we assume for a second that every plate has an infinite capacity, then the number of possible assignments is something like Complexity is (at least) |
Hi @davydov, Iakov ***@***.***> , yes n^k is the upper bound,
but actually possible solutions satisfying plate size constraints as well
as the n! divider for permuted solutions
on a plate level limit the number down to a fraction of this upper bound.
We can also set an upper limit of solutions to be returned, avoiding
further recursive
invocations of the function that generates solutions. How about the
following procedure:
1. Create a list of group sizes Gi and plate sizes Sj. i=1...k, j= 1..n
Sort G by size, in decreasing order.
Create an Assignment list A with n elements (plates), each a list of groups
assigned to that plate. Initially, no assignment.
A has methods to determine the number of allocated positions and number of
assigned groups for each plate easily.
Set up an empty list V to store valid assignments.
2. Invoke shuffling generator function with G, A and upper limit of
solutions L
3. In function, return immediately if length of V is greater than L
4. Look at first element of G and find out to how many elements (plates) in
A it could be assigned
4. If there are more than 1 plates with the same number of filled positions
as well as groups, remove all but one
for follow-up. (To avoid trivial permuted solutions as well as
equivalent solutions).
5. Rank possible assignments by increasing by number of already filled
positions and number of already assigned
groups, in that order.
6. If length(G)==1, add those solutions to global list V and return
7. Chop off the first element of G
8. For all possible assignments in the ranking order, invoke function
again, i,.e. step 3 with updated G and A
9 return
So this should in theory provide a list of (roughly, could be just very few
more) solutions, provided there exist any.
Favorable assignments with more balanced group numbers across plates should
appear more towards the top
of the list.
Regarding termination... probably one should add an iteration depth counter
as well to the function parameters to allow
another termination in case we run unexpectedly deep into the recursion,
based on a reasonable criterion on n and k.
OK, has to be tested in practice I guess. :)
…On Wed, Jan 22, 2025 at 9:54 AM Iakov Davydov ***@***.***> wrote:
One more though here related to @ingitwetrust
<https://github.com/ingitwetrust>'s idea of enumerating all possible. If
we assume for a second that every plate has an infinite capacity, then the
number of possible assignments is something like $n^k$, where $n$ is the
number of plates and $k$ is the of groups. What we could potentially do
is estimate this number (upper bound). If it's less than 1 million then we
enumerate all possible assignments, e.g. using dynamic programming.
Complexity is (at least) $O(n^k)$ in the worst case unfortunately (easy
to prove, consider a case when $c_i\ge\sum_{j=1}^{k} w_j$ for each $i\in
1..n$, where $c_i$ is capacity of the plate $i$ and $w_j$ is size of the
group $j$.
—
Reply to this email directly, view it on GitHub
<#51 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3SZLUM632WIURGQEXZ5EN32L5MDBAVCNFSM6AAAAABNQPMFYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBWGYZTIMZTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Guido Steiner, PhD
Senior Principal Scientist
Pharmaceutical Sciences
Pharmaceutical Sciences - Predictive Modeling and Data Analysis (PMDA)
Roche Pharmaceutical Research and Early Development (pRED)
Roche Innovation Center Basel
F. Hoffmann-La Roche AG
Building 93/4.01
Grenzacher Strasse
CH-4070 Basel
Phone +41 61 688 3329
--
Confidentiality Note: This message is intended only for the use of the
named recipient(s) and may contain confidential and/or privileged
information. If you are not the intended recipient, please contact the
sender and delete this message. Any unauthorized use of the information
contained in this message is prohibited.
|
We have some functions that suggest shuffling "grouped" data that for me are a bit confusing:
I guess they come from the invivo example and are tailored to that.
Maybe I didn't fully get them, but for a simple grouping problem I had, none of them were working.
I thought a function "shuffle_grouped_data" would do this, given the variable names that form the groups (in my case Patient ID).
Iakov helped with a solution for that particular case, which, a bit more generalized, could be part of the package:
But then I wonder, should we discuss about the namings of all those functions so that its clearer what they do?
@ingitwetrust and @idavydov what are your thoughts?
The text was updated successfully, but these errors were encountered: