-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] support DeepSpeed. #139
Conversation
@@ -33,7 +33,7 @@ def main(): | |||
trainer.prepare_for_training() | |||
trainer.prepare_trackers() | |||
trainer.train() | |||
trainer.evaluate() | |||
# trainer.evaluate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I have just made it explicit that evaluate()
is not implemented.
I have addressed the feedback. Will do the testing and rest of the TODOs and request for another review. |
@a-r-r-o-w I think there's some problem with DEBUG:finetrainers:Validation artifacts on process 0: ['image', 'video', 'artifact_0']█████████████████████| 50/50 [00:12<00:00, 4.16it/s]
DEBUG:finetrainers:Saving video to /raid/.cache/huggingface/sayak/ltx-video/ltxv_disney/validation-10-0-afkx-A-black-and-white-an.mp4 To quickly reproduce: Commandexport NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG
GPU_IDS="0,1"
DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/ltx-video/ltxv_disney"
# Model arguments
model_cmd="--model_name ltx_video \
--pretrained_model_name_or_path Lightricks/LTX-Video"
# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token BW_STYLE \
--video_resolution_buckets 49x512x768 \
--caption_dropout_p 0.05"
# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"
# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"
# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--mixed_precision bf16 \
--batch_size 1 \
--train_steps 10 \
--rank 128 \
--lora_alpha 128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 5 \
--checkpointing_limit 2 \
--resume_from_checkpoint=latest \
--enable_slicing \
--enable_tiling"
# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--lr 3e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"
# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
--num_validation_videos 1 \
--validation_steps 100"
# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb --push_to_hub"
cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$diffusion_cmd \
$training_cmd \
$optimizer_cmd \
$validation_cmd \
$miscellaneous_cmd"
echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n" Is this expected? |
Nope, not expected. All my runs have completed successfully so far. I can take a look after diffusers patch related things are wrapped up |
With DeepSpeed for Hunyuan Video, there's a separate problem: ERROR:finetrainers:Traceback (most recent call last):
File "/fsx/sayak/finetrainers/train.py", line 35, in main
trainer.train()
File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 667, in train
accelerator.backward(loss)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. So, I think for now, we could raise an error when initializing the |
@sayakpaul Taking a look now, sorry for the delay. When you tried, do you notice |
Just tested and seems to be hanging with/without deepspeed. Looking through the changes again as this was not the case before. Hopefully can find what's wrong quickly 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Feel free to merge after adding the note about DeepSpeed you mentioned in the description. Do we have a minimal run with DeepSpeed training to verify that it works? If not, I can queue up one with LTXV
I think the final todos are:
Would prefer the readme related things to go in a single PR. WDYT? |
Oh, sorry I missed the Hunyuan comment. Just the readme and the check is remaining - but could I have maybe 20 mins to understand what's happening in Hunyuan? If we can't figure it out, okay to look into it in the future |
I have runs here https://wandb.ai/sayakpaul/finetrainers-ltxv/runs/id150mwa but I didn't complete a minimal run to check the quality. |
Yeah works for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have never required that for my other projects. LTX-V is working as expected without that. But I will try with that too. |
It's not needed in your projects probably because you create a custom scheduler manually (which is what we do too). But many projects use a scheduler defined in the DeepSpeed config file, so it is important to respect that. We already handle DummyOptim case, so this would be make it near ideal. Okay to take in a separate PR, but just providing context on why it is important |
Will handle it. But I suspect if it will make a difference especially because LTX video is working. |
No, I don't think it will make a difference because (from the example):
The DummyScheduler is only initialized if we added a scheduler type to the deepspeed config. Currently, our deepspeed config is very simple and does not use a custom DS-provided scheduler/optimizer. So, we will never end up creating a Dummy one and instead use our own defined optimizer/scheduler based on CLI args |
Grad Norm tracking in DeepSpeed
|
Now I am hit with a different error when trying to do HunyuanVideo with DeepSpeed. My - 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.0
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.4.4
- xFormers version: 0.0.27.post2
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in> deepspeed configcompute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false training commandexport NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG
GPU_IDS="0,1"
DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/hunyuan_video/hunyuan_disney"
# Model arguments
model_cmd="--model_name hunyuan_video \
--pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"
# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token afkx \
--video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \
--caption_dropout_p 0.05"
# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"
# Diffusion arguments
diffusion_cmd=""
# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--mixed_precision bf16 \
--batch_size 1 \
--train_steps 50 \
--rank 4 \
--lora_alpha 4 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 5 \
--checkpointing_limit 2 \
--enable_slicing \
--enable_tiling"
# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--lr 2e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"
# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
--num_validation_videos 1 \
--validation_steps 100"
# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"
cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS --main_process_port 29501 train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$diffusion_cmd \
$training_cmd \
$optimizer_cmd \
$validation_cmd \
$miscellaneous_cmd"
echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n" ErrorERROR:finetrainers:Traceback (most recent call last):
File "/home/sayak/finetrainers/train.py", line 35, in main
trainer.train()
File "/home/sayak/finetrainers/finetrainers/trainer.py", line 717, in train
accelerator.backward(loss)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2240, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 246, in backward
self.engine.backward(loss, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)` Non-DS leads me to an OOM right at the first step (no validation): torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.84 GiB. GPU 0 has a total capacity of 79.15 GiB of which 4.51 GiB is free. Including non-PyTorch memory, this process has 74.63 GiB memory in use. Of the allocated memory 69.83 GiB is allocated by PyTorch, and 4.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
dtype = ( | ||
torch.float16 | ||
if self.args.mixed_precision == "fp16" | ||
else torch.bfloat16 | ||
if self.args.mixed_precision == "bf16" | ||
else torch.float32 | ||
) | ||
self.transformer = self.transformer.to(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transformer
should already be in the right data-type. If we want the users to upcast before getting the final state dict, that is a different concern.
@sayakpaul Taking a look at the OOM. I am going to start with 1x512x768 and see how far we can go on a single 80 GB gpu, and we can update the README accordingly. The 129x512x768 in the config came as an example. We report memory requirements for 49x512x768 only (which takes under 48 GB without precomputation, and about 43 GB with precomputation) |
Sounds good! I will try a couple of different environments for Hunyuan DS support and investigate further. But I guess that doesn’t block this PR. |
For DeepSpeed, resolution buckets as
@sayakpaul Could you verify if these are the numbers you see as well? Or do you get OOM for this too? I'm on PT 2.5.1+cu124 as well but both DeepSpeed and normal seem to work for me |
Checking shortly. |
And the same, but non-deepspeed (uncompiled_2.yaml config):
|
I was able to match the numbers, yes. Thanks for all the help! I have also added a small change in 84c1756. PTAL. There's a problem we face when noisy_latents.shape=torch.Size([2, 16, 21, 64, 96])
query.shape=torch.Size([2, 24, 32512, 128]), key.shape=torch.Size([2, 24, 32512, 128]), attention_mask.shape=torch.Size([2, 32512, 32512])
ERROR:finetrainers:An error occurred during training: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1. Target sizes: [2, 24, 32512, 32512]. Tensor sizes: [2, 32512, 32512]
ERROR:finetrainers:Traceback (most recent call last):
File "/home/sayak/finetrainers/train.py", line 35, in main
trainer.train()
File "/home/sayak/finetrainers/finetrainers/trainer.py", line 709, in train
pred = self.model_config["forward_pass"](
File "/home/sayak/finetrainers/finetrainers/hunyuan_video/hunyuan_video_lora.py", line 225, in forward_pass
denoised_latents = transformer(
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
loss = self.module(*inputs, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 742, in forward
hidden_states, encoder_hidden_states = torch.utils.checkpoint.checkpoint(
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
ret = function(*args, **kwargs)
File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 735, in custom_forward
return module(*inputs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 480, in forward
attn_output, context_attn_output = self.attn(
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/diffusers/src/diffusers/models/attention_processor.py", line 588, in forward
return self.processor(
File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 118, in __call__
hidden_states = F.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1. Target sizes: [2, 24, 32512, 32512]. Tensor sizes: [2, 32512, 32512] Can be reproduced easily: import torch
import torch.nn.functional as F
q = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
attn_mask = torch.ones(2, 32512, 32512, device="cuda").bool()
out = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
print(out.shape) But I guess this is independent of the PR so could be tackled in a different PR? I will work on 8bit opts and Cog next. |
Thanks, LGTM! Just one doubt: The change in your commit is increasing the ndim for sigmas, but the error seems to be from attention mask. Are the two related? |
Not related. I changed it to suit broadcasting. Otherwise, the error happens in the noise flow interpolation step. finetrainers/finetrainers/trainer.py Line 700 in 84c1756
|
There are additional things that I have clubbed in this PR. LMK your thoughts. Some comments in-line.
To test:
command
TODOs