Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb tracker in scheduling problems during the training initiation and training stages #100

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

glide-the
Copy link
Collaborator

wandb tracker in scheduling problems during the training initiation and training stages
Login is not performed as expected during startup

@sayakpaul sayakpaul merged commit d5cc7c6 into main Nov 29, 2024
@sayakpaul sayakpaul deleted the wandb-tracker-fix branch November 29, 2024 09:38
glide-the added a commit that referenced this pull request Nov 30, 2024
```
                                [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module>                                                                                                             [rank1]:     main(args)                                                                                                 [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main                                                                                                                  [rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,                                                    [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment
```
ref1 :#84
ref2: #100
It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.
@glide-the glide-the mentioned this pull request Nov 30, 2024
a-r-r-o-w pushed a commit that referenced this pull request Nov 30, 2024
* is some error
```
                                [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module>                                                                                                             [rank1]:     main(args)                                                                                                 [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main                                                                                                                  [rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,                                                    [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment
```
ref1 :#84
ref2: #100
It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants