RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

a-r-r-o-w · 2024-12-01T15:50:22Z

No description provided.

sayakpaul

Thank you!

sayakpaul · 2024-12-02T09:18:35Z

training/cogvideox_image_to_video_lora.py

+    RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
+    RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL


In case of DeepSpeed we won't be able to access transformer.config. Let's use model_config?

We can access it as this point because it is still not wrapped by deepspeed

training/cogvideox_image_to_video_lora.py

training/cogvideox_text_to_video_lora.py

sayakpaul · 2024-12-02T09:21:33Z

training/cogvideox_image_to_video_lora.py

@@ -828,6 +833,10 @@ def load_model_hook(models, input_dir):
                    gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
                    accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
                    gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
+                    logs.update({


Should we not consider if we're using DeepSpeed and log it accordingly?

I referring to the following comment:

# gradnorm + deepspeed: https://github.com/microsoft/DeepSpeed/issues/4555

we cannot calculate grad_norm for deepspeed, so this will be nan or 0 i think. This is because the gradient step is handled internally in deepspeed and before we have access to the gradients, it gets cleared. If we want access to it, it will require a backward hook on the last module, but we can skip that for now and revisit in future since it didn't work before these changes either

training/cogvideox_text_to_video_lora.py

sayakpaul · 2024-12-02T09:22:51Z

training/prepare_dataset.py

 def save_image(image: torch.Tensor, path: pathlib.Path) -> None:
-    image = to_pil_image(image)
+    image = image.to(dtype=torch.float32).clamp(-1, 1)
+    image = to_pil_image(image.float())
    image.save(path)


Could you explain this change?

F.ToPILImage does not support bfloat16. So if someone was to do precomputation in bfloat16 and try to save the images/videos in viewable format, it would error out. casting to fp32 fixes this

a-r-r-o-w added 3 commits December 1, 2024 16:42

rope scaling fixes for cog 1.5

5e326f8

grad norm log fix

a1ec764

bf16 prepare dataset fix

6a56cf2

a-r-r-o-w marked this pull request as ready for review December 1, 2024 16:06

a-r-r-o-w requested a review from sayakpaul December 1, 2024 16:06

update

5f15fbc

Leojc mentioned this pull request Dec 2, 2024

Multi-resolution training #103

Closed

sayakpaul reviewed Dec 2, 2024

View reviewed changes

sayakpaul approved these changes Dec 2, 2024

View reviewed changes

style

0ef57eb

a-r-r-o-w merged commit 6c41984 into main Dec 2, 2024

a-r-r-o-w deleted the multiple-fixes branch December 2, 2024 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

a-r-r-o-w commented Dec 1, 2024

sayakpaul left a comment

sayakpaul Dec 2, 2024

a-r-r-o-w Dec 2, 2024

sayakpaul Dec 2, 2024

a-r-r-o-w Dec 2, 2024 •

edited

Loading

sayakpaul Dec 2, 2024

a-r-r-o-w Dec 2, 2024

		RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
		RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL

RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

RoPE fixes for 1.5, bfloat16 support in prepare_dataset, gradient_accumulation grad norm undefined fix #107

Conversation

a-r-r-o-w commented Dec 1, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul Dec 2, 2024

Choose a reason for hiding this comment

a-r-r-o-w Dec 2, 2024

Choose a reason for hiding this comment

sayakpaul Dec 2, 2024

Choose a reason for hiding this comment

a-r-r-o-w Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

sayakpaul Dec 2, 2024

Choose a reason for hiding this comment

a-r-r-o-w Dec 2, 2024

Choose a reason for hiding this comment

a-r-r-o-w Dec 2, 2024 •

edited

Loading