Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add I2V sft and fix an error #97

Merged
merged 5 commits into from
Dec 5, 2024
Merged

add I2V sft and fix an error #97

merged 5 commits into from
Dec 5, 2024

Conversation

jiashenggu
Copy link
Contributor

@jiashenggu jiashenggu commented Nov 26, 2024

i am not sure how to set ofs in training, I just follow the inference pipeline setting

@sayakpaul
Copy link
Collaborator

Thanks! Could you provide some example results as well?

@sayakpaul sayakpaul requested review from sayakpaul and a-r-r-o-w and removed request for sayakpaul November 29, 2024 09:42
Copy link
Collaborator

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some preliminary comments. Will let @a-r-r-o-w comment on the new script first

LEARNING_RATES=("1e-4")
LR_SCHEDULES=("cosine_with_restarts")
OPTIMIZERS=("adamw")
MAX_TRAIN_STEPS=("20000")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20000 steps!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just follow the setting of train_text_to_video_sft.sh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you performed any experiments yourself?

@@ -799,12 +799,15 @@ def load_model_hook(models, input_dir):
# (this is the forward diffusion process)
noisy_video_latents = scheduler.add_noise(video_latents, noise, timesteps)
noisy_model_input = torch.cat([noisy_video_latents, image_latents], dim=2)

model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. No assignment to a variable. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget to delete early edit. Fixed it

Copy link
Owner

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the script and OFS fixes! Do you have any publically available training runs to verify correctness? For example, we did a 20000 step run on T2V SFT before sharing the script, so something similar to verify I2V would be nice (even if lower number of steps)

@jiashenggu
Copy link
Contributor Author

jiashenggu commented Dec 4, 2024

Thank you for the script and OFS fixes! Do you have any publically available training runs to verify correctness? For example, we did a 20000 step run on T2V SFT before sharing the script, so something similar to verify I2V would be nice (even if lower number of steps)

I did a 24000 step run, but I'm not sure it met expectations. It seems somehow better.

valid prompt:
A black-and-white animated scene featuring three characters in a static setting. Mickey Mouse-like character stands on one leg, hands on hips, with a playful expression. Center character has an exaggerated open mouth, caught in mid-motion, suggesting singing or surprise. Female character in a tutu and flower-adorned hat dances, arms raised. Background features a plain wall with scattered musical notes. The characters maintain their positions and expressions, with no changes in lighting, environment, or camera perspective, focusing on their interaction within this continuous moment.

valid image:
valid
base model output:

output_base.mp4

24000 step run model output:

output_24000.mp4

@jiashenggu jiashenggu requested a review from a-r-r-o-w December 4, 2024 06:53
@sayakpaul
Copy link
Collaborator

Seems definitely better to me. At least it learned semantics and better motion (IMO).

@a-r-r-o-w
Copy link
Owner

@jiashenggu Looks good to merge! Could you rebase against the main branch? Because it looks like some changes already in main our made here as well

@a-r-r-o-w
Copy link
Owner

Thank you so much for this! I've verified that it works. Please feel free to open PRs for speedups or other suggestions for improvements :)

@a-r-r-o-w a-r-r-o-w merged commit 80d1150 into a-r-r-o-w:main Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants