Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

albertz · 2025-01-29T14:46:45Z

We currently do this in Torch Engine.forward_with_callback:

...
elif dataset.supports_seq_order_sorting():
    # We can sort it. Sort it in reverse to make sure that we have enough memory right at the beginning.
    print("Dataset supports sorting, i.e. it will be sorted for optimal performance.", file=log.v3)
    dataset.seq_ordering = "sorted_reverse"

MetaDataset.supports_seq_order_sorting is this:

def supports_seq_order_sorting(self) -> bool:
    """supports sorting"""
    if self.seq_order_control_dataset:
        return self.datasets[self.seq_order_control_dataset].supports_seq_order_sorting()
    if self._seq_lens or self._seq_order_seq_lens_file:
        return True
    return False

Here a seq_order_control_dataset was used, to supports_seq_order_sorting returns True.

However, then MetaDataset will completely ignore the seq_ordering. It will just leave the sorting, and any options completely up to seq_order_control_dataset.

Many other meta datasets, e.g. MultiProcDataset, PostprocessingDataset, are behaving just in the same way.

I'm not really sure what code to blame, or how to fix this. Maybe setting

dataset.seq_ordering = "sorted_reverse"

is just wrong? We also cannot easily change that without suddenly changing many existing setups in maybe unexpected ways.

Even fixing this now, i.e. specifically that a meta dataset is now sorted in forward, even by itself is already a change which might require a new behavior version, or option? It definitely will influence the results with most of our existing models (due to batching and padding).

(cc @NeoLegends @dorian-K)

The text was updated successfully, but these errors were encountered:

dorian-K · 2025-01-31T13:13:23Z

I believe that supports_seq_order_sorting should return False in this case, which is accurate because the MetaDataset itself does not support sorting its data, only the control dataset can do that

albertz · 2025-01-31T14:21:44Z

That depends on how you define what supports_seq_order_sorting is supposed to return. If it returns whether setting dataset.seq_ordering works or not, then yes, it should return False.

But that's not really the point here. The situation is, we can easily support this. The question is, should we somehow make setting dataset.seq_ordering work for the MetaDataset? Or some other mechanism? And then supports_seq_order_sorting would say whether this other mechanism works or not, which is does once we implement it. I see the issue in the fact that it doesn't work, so we should change/fix that, and supports_seq_order_sorting can stay as it is.

dorian-K · 2025-01-31T15:05:55Z

I'm a bit hesitant, because the purpose of setting seq_ordering in this case is to have the biggest batches right at the beginning. When MetaDataset simply forwards the seq_ordering to its control dataset, then this might not lead to the intended behaviour if the "real" data comes from another dataset (a dataset that is much more relevant for memory consumption relative to the control dataset), and the control dataset is just some constant size sequence (or similar).

albertz · 2025-01-31T15:58:43Z

What data key is used for the seq length for the sorting (and/or heuristic for approx seq length) is anyway up to the dataset. But usually e.g. the input sequence lengths are very correlated to the output sequence lengths, so this was not so much an issue. In case of MetaDataset, in all cases that we have had so far, just leaving sorting to the control dataset (as we anyway do now) is totally fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

albertz commented Jan 29, 2025

dorian-K commented Jan 31, 2025

albertz commented Jan 31, 2025

dorian-K commented Jan 31, 2025 •

edited

Loading

albertz commented Jan 31, 2025

Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

Comments

albertz commented Jan 29, 2025

dorian-K commented Jan 31, 2025

albertz commented Jan 31, 2025

dorian-K commented Jan 31, 2025 • edited Loading

albertz commented Jan 31, 2025

dorian-K commented Jan 31, 2025 •

edited

Loading