Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically sorting dataset does not work with Torch engine forward + MetaDatasets #1683

Open
albertz opened this issue Jan 29, 2025 · 4 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 29, 2025

We currently do this in Torch Engine.forward_with_callback:

...
elif dataset.supports_seq_order_sorting():
    # We can sort it. Sort it in reverse to make sure that we have enough memory right at the beginning.
    print("Dataset supports sorting, i.e. it will be sorted for optimal performance.", file=log.v3)
    dataset.seq_ordering = "sorted_reverse"

MetaDataset.supports_seq_order_sorting is this:

def supports_seq_order_sorting(self) -> bool:
    """supports sorting"""
    if self.seq_order_control_dataset:
        return self.datasets[self.seq_order_control_dataset].supports_seq_order_sorting()
    if self._seq_lens or self._seq_order_seq_lens_file:
        return True
    return False

Here a seq_order_control_dataset was used, to supports_seq_order_sorting returns True.

However, then MetaDataset will completely ignore the seq_ordering. It will just leave the sorting, and any options completely up to seq_order_control_dataset.

Many other meta datasets, e.g. MultiProcDataset, PostprocessingDataset, are behaving just in the same way.

I'm not really sure what code to blame, or how to fix this. Maybe setting

dataset.seq_ordering = "sorted_reverse"

is just wrong? We also cannot easily change that without suddenly changing many existing setups in maybe unexpected ways.

Even fixing this now, i.e. specifically that a meta dataset is now sorted in forward, even by itself is already a change which might require a new behavior version, or option? It definitely will influence the results with most of our existing models (due to batching and padding).

(cc @NeoLegends @dorian-K)

@dorian-K
Copy link
Contributor

I believe that supports_seq_order_sorting should return False in this case, which is accurate because the MetaDataset itself does not support sorting its data, only the control dataset can do that

@albertz
Copy link
Member Author

albertz commented Jan 31, 2025

That depends on how you define what supports_seq_order_sorting is supposed to return. If it returns whether setting dataset.seq_ordering works or not, then yes, it should return False.

But that's not really the point here. The situation is, we can easily support this. The question is, should we somehow make setting dataset.seq_ordering work for the MetaDataset? Or some other mechanism? And then supports_seq_order_sorting would say whether this other mechanism works or not, which is does once we implement it. I see the issue in the fact that it doesn't work, so we should change/fix that, and supports_seq_order_sorting can stay as it is.

@dorian-K
Copy link
Contributor

dorian-K commented Jan 31, 2025

I'm a bit hesitant, because the purpose of setting seq_ordering in this case is to have the biggest batches right at the beginning. When MetaDataset simply forwards the seq_ordering to its control dataset, then this might not lead to the intended behaviour if the "real" data comes from another dataset (a dataset that is much more relevant for memory consumption relative to the control dataset), and the control dataset is just some constant size sequence (or similar).

@albertz
Copy link
Member Author

albertz commented Jan 31, 2025

What data key is used for the seq length for the sorting (and/or heuristic for approx seq length) is anyway up to the dataset. But usually e.g. the input sequence lengths are very correlated to the output sequence lengths, so this was not so much an issue. In case of MetaDataset, in all cases that we have had so far, just leaving sorting to the control dataset (as we anyway do now) is totally fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants