Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] support DeepSpeed. #139

Merged
merged 20 commits into from
Dec 27, 2024
Merged

[feat] support DeepSpeed. #139

merged 20 commits into from
Dec 27, 2024

Conversation

sayakpaul
Copy link
Collaborator

@sayakpaul sayakpaul commented Dec 24, 2024

There are additional things that I have clubbed in this PR. LMK your thoughts. Some comments in-line.

To test:

command
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/ltx-video/ltxv_disney"

# Model arguments
model_cmd="--model_name ltx_video \
  --pretrained_model_name_or_path Lightricks/LTX-Video"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token BW_STYLE \
  --video_resolution_buckets 49x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 50 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --resume_from_checkpoint=latest \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 3e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

TODOs

  • Update the README to include a DeepSpeed support note
  • Test with HunyuanVideo

@sayakpaul sayakpaul requested a review from a-r-r-o-w December 24, 2024 05:31
@@ -33,7 +33,7 @@ def main():
trainer.prepare_for_training()
trainer.prepare_trackers()
trainer.train()
trainer.evaluate()
# trainer.evaluate()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I have just made it explicit that evaluate() is not implemented.

@sayakpaul
Copy link
Collaborator Author

I have addressed the feedback. Will do the testing and rest of the TODOs and request for another review.

@sayakpaul
Copy link
Collaborator Author

@a-r-r-o-w I think there's some problem with accelerator.end_training(). The process is stuck forever after

DEBUG:finetrainers:Validation artifacts on process 0: ['image', 'video', 'artifact_0']█████████████████████| 50/50 [00:12<00:00,  4.16it/s]
DEBUG:finetrainers:Saving video to /raid/.cache/huggingface/sayak/ltx-video/ltxv_disney/validation-10-0-afkx-A-black-and-white-an.mp4

To quickly reproduce:

Command
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/ltx-video/ltxv_disney"

# Model arguments
model_cmd="--model_name ltx_video \
  --pretrained_model_name_or_path Lightricks/LTX-Video"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token BW_STYLE \
  --video_resolution_buckets 49x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 10 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --resume_from_checkpoint=latest \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 3e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb --push_to_hub"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Is this expected?

@a-r-r-o-w
Copy link
Owner

Nope, not expected. All my runs have completed successfully so far. I can take a look after diffusers patch related things are wrapped up

@sayakpaul
Copy link
Collaborator Author

With DeepSpeed for Hunyuan Video, there's a separate problem:

ERROR:finetrainers:Traceback (most recent call last):
  File "/fsx/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 667, in train
    accelerator.backward(loss)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
    torch.autograd.backward(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

So, I think for now, we could raise an error when initializing the Trainer if we detect the model type is Hunyuan and DistributedType is deepspeed. In a follow-up PR, I am going to add support for 8bit optimizers from bitsandbytes so that the memory requirements could be lowered down further. This will be especially beneficial for HunyuanVideo.

@a-r-r-o-w
Copy link
Owner

@sayakpaul Taking a look now, sorry for the delay.

When you tried, do you notice accelerator.end_training() process stuck forever thing happeniing when not using DeepSpeed?

@a-r-r-o-w
Copy link
Owner

Just tested and seems to be hanging with/without deepspeed. Looking through the changes again as this was not the case before. Hopefully can find what's wrong quickly 🤞

Copy link
Owner

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Feel free to merge after adding the note about DeepSpeed you mentioned in the description. Do we have a minimal run with DeepSpeed training to verify that it works? If not, I can queue up one with LTXV

@sayakpaul
Copy link
Collaborator Author

I think the final todos are:

  • Add a check for DeepSpeed and Hunyuan as mentioned here.

Would prefer the readme related things to go in a single PR.

WDYT?

@a-r-r-o-w
Copy link
Owner

Oh, sorry I missed the Hunyuan comment. Just the readme and the check is remaining - but could I have maybe 20 mins to understand what's happening in Hunyuan? If we can't figure it out, okay to look into it in the future

@sayakpaul
Copy link
Collaborator Author

Do we have a minimal run with DeepSpeed training to verify that it works? If not, I can queue up one with LTXV

I have runs here https://wandb.ai/sayakpaul/finetrainers-ltxv/runs/id150mwa but I didn't complete a minimal run to check the quality.

@sayakpaul
Copy link
Collaborator Author

Just the readme and the check is remaining - but could I have maybe 20 mins to understand what's happening in Hunyuan? If we can't figure it out, okay to look into it in the future

Yeah works for me.

Copy link
Owner

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayakpaul
Copy link
Collaborator Author

Have never required that for my other projects. LTX-V is working as expected without that. But I will try with that too.

@a-r-r-o-w
Copy link
Owner

It's not needed in your projects probably because you create a custom scheduler manually (which is what we do too). But many projects use a scheduler defined in the DeepSpeed config file, so it is important to respect that. We already handle DummyOptim case, so this would be make it near ideal. Okay to take in a separate PR, but just providing context on why it is important

@sayakpaul
Copy link
Collaborator Author

Will handle it. But I suspect if it will make a difference especially because LTX video is working.

@a-r-r-o-w
Copy link
Owner

No, I don't think it will make a difference because (from the example):

 if (
     accelerator.state.deepspeed_plugin is None
     or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
 ):
     lr_scheduler = get_scheduler(
         name=args.lr_scheduler_type,
         optimizer=optimizer,
         num_warmup_steps=args.num_warmup_steps,
         num_training_steps=args.max_train_steps,
     )
 else:
     lr_scheduler = DummyScheduler(
         optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
     )

The DummyScheduler is only initialized if we added a scheduler type to the deepspeed config. Currently, our deepspeed config is very simple and does not use a custom DS-provided scheduler/optimizer. So, we will never end up creating a Dummy one and instead use our own defined optimizer/scheduler based on CLI args

@a-r-r-o-w
Copy link
Owner

What is your env?

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.26.2
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.13.3.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

@sayakpaul
Copy link
Collaborator Author

sayakpaul commented Dec 26, 2024

Now I am hit with a different error when trying to do HunyuanVideo with DeepSpeed.

My diffusers-cli env on the DGX:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.0
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.4.4
- xFormers version: 0.0.27.post2
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
deepspeed config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
training command
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/hunyuan_video/hunyuan_disney"

# Model arguments
model_cmd="--model_name hunyuan_video \
  --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token afkx \
  --video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 50 \
  --rank 4 \
  --lora_alpha 4 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 2e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS --main_process_port 29501 train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"
Error
ERROR:finetrainers:Traceback (most recent call last):
  File "/home/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/home/sayak/finetrainers/finetrainers/trainer.py", line 717, in train
    accelerator.backward(loss)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2240, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 246, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
    torch.autograd.backward(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Non-DS leads me to an OOM right at the first step (no validation):

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.84 GiB. GPU 0 has a total capacity of 79.15 GiB of which 4.51 GiB is free. Including non-PyTorch memory, this process has 74.63 GiB memory in use. Of the allocated memory 69.83 GiB is allocated by PyTorch, and 4.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Comment on lines -729 to -736
dtype = (
torch.float16
if self.args.mixed_precision == "fp16"
else torch.bfloat16
if self.args.mixed_precision == "bf16"
else torch.float32
)
self.transformer = self.transformer.to(dtype)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer should already be in the right data-type. If we want the users to upcast before getting the final state dict, that is a different concern.

@sayakpaul sayakpaul requested a review from a-r-r-o-w December 26, 2024 15:46
@a-r-r-o-w
Copy link
Owner

@sayakpaul Taking a look at the OOM. I am going to start with 1x512x768 and see how far we can go on a single 80 GB gpu, and we can update the README accordingly.

The 129x512x768 in the config came as an example. We report memory requirements for 49x512x768 only (which takes under 48 GB without precomputation, and about 43 GB with precomputation)

@sayakpaul
Copy link
Collaborator Author

Sounds good! I will try a couple of different environments for Hunyuan DS support and investigate further. But I guess that doesn’t block this PR.

@a-r-r-o-w
Copy link
Owner

For DeepSpeed, resolution buckets as 81x512x768:

Memory after epoch 2: {
    "memory_allocated": 39.284,
    "memory_reserved": 62.9,
    "max_memory_allocated": 59.832,
    "max_memory_reserved": 62.9
}
Memory after validation end: {
    "memory_allocated": 24.581,
    "memory_reserved": 26.424,
    "max_memory_allocated": 67.129,
    "max_memory_reserved": 70.771
}

@sayakpaul Could you verify if these are the numbers you see as well? Or do you get OOM for this too? I'm on PT 2.5.1+cu124 as well but both DeepSpeed and normal seem to work for me

@sayakpaul
Copy link
Collaborator Author

Checking shortly.

@a-r-r-o-w
Copy link
Owner

And the same, but non-deepspeed (uncompiled_2.yaml config):

Memory after epoch 2: {
    "memory_allocated": 40.202,
    "memory_reserved": 67.895,
    "max_memory_allocated": 60.761,
    "max_memory_reserved": 67.895
}
Memory after validation end: {
    "memory_allocated": 25.499,
    "memory_reserved": 28.771,
    "max_memory_allocated": 68.056,
    "max_memory_reserved": 71.977
}

@sayakpaul
Copy link
Collaborator Author

@a-r-r-o-w

I was able to match the numbers, yes. Thanks for all the help! I have also added a small change in 84c1756. PTAL.

There's a problem we face when batch_size is specified to be > 1.

noisy_latents.shape=torch.Size([2, 16, 21, 64, 96])
query.shape=torch.Size([2, 24, 32512, 128]), key.shape=torch.Size([2, 24, 32512, 128]), attention_mask.shape=torch.Size([2, 32512, 32512])
ERROR:finetrainers:An error occurred during training: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1.  Target sizes: [2, 24, 32512, 32512].  Tensor sizes: [2, 32512, 32512]
ERROR:finetrainers:Traceback (most recent call last):
  File "/home/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/home/sayak/finetrainers/finetrainers/trainer.py", line 709, in train
    pred = self.model_config["forward_pass"](
  File "/home/sayak/finetrainers/finetrainers/hunyuan_video/hunyuan_video_lora.py", line 225, in forward_pass
    denoised_latents = transformer(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 742, in forward
    hidden_states, encoder_hidden_states = torch.utils.checkpoint.checkpoint(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
    ret = function(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 735, in custom_forward
    return module(*inputs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 480, in forward
    attn_output, context_attn_output = self.attn(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/attention_processor.py", line 588, in forward
    return self.processor(
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 118, in __call__
    hidden_states = F.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1.  Target sizes: [2, 24, 32512, 32512].  Tensor sizes: [2, 32512, 32512]

Can be reproduced easily:

import torch 
import torch.nn.functional as F

q = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
attn_mask = torch.ones(2, 32512, 32512, device="cuda").bool()

out = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
print(out.shape)

But I guess this is independent of the PR so could be tackled in a different PR? I will work on 8bit opts and Cog next.

@a-r-r-o-w
Copy link
Owner

Thanks, LGTM!

Just one doubt: The change in your commit is increasing the ndim for sigmas, but the error seems to be from attention mask. Are the two related?

@sayakpaul
Copy link
Collaborator Author

Not related. I changed it to suit broadcasting. Otherwise, the error happens in the noise flow interpolation step.

noisy_latents = (1.0 - sigmas) * latent_conditions["latents"] + sigmas * noise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants