Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

Merged
merged 5 commits into from
Dec 2, 2024

Conversation

a-r-r-o-w
Copy link
Owner

No description provided.

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review December 1, 2024 16:06
@a-r-r-o-w a-r-r-o-w requested a review from sayakpaul December 1, 2024 16:06
@Leojc Leojc mentioned this pull request Dec 2, 2024
Copy link
Collaborator

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +401 to +402
RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of DeepSpeed we won't be able to access transformer.config. Let's use model_config?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can access it as this point because it is still not wrapped by deepspeed

@@ -828,6 +833,10 @@ def load_model_hook(models, input_dir):
gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
logs.update({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not consider if we're using DeepSpeed and log it accordingly?

I referring to the following comment:

# gradnorm + deepspeed: https://github.com/microsoft/DeepSpeed/issues/4555

Copy link
Owner Author

@a-r-r-o-w a-r-r-o-w Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot calculate grad_norm for deepspeed, so this will be nan or 0 i think. This is because the gradient step is handled internally in deepspeed and before we have access to the gradients, it gets cleared. If we want access to it, it will require a backward hook on the last module, but we can skip that for now and revisit in future since it didn't work before these changes either

Comment on lines 289 to 292
def save_image(image: torch.Tensor, path: pathlib.Path) -> None:
image = to_pil_image(image)
image = image.to(dtype=torch.float32).clamp(-1, 1)
image = to_pil_image(image.float())
image.save(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this change?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F.ToPILImage does not support bfloat16. So if someone was to do precomputation in bfloat16 and try to save the images/videos in viewable format, it would error out. casting to fp32 fixes this

@a-r-r-o-w a-r-r-o-w merged commit 6c41984 into main Dec 2, 2024
@a-r-r-o-w a-r-r-o-w deleted the multiple-fixes branch December 2, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants