[feat] support DeepSpeed. #139

sayakpaul · 2024-12-24T05:31:09Z

There are additional things that I have clubbed in this PR. LMK your thoughts. Some comments in-line.

To test:

command

export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/ltx-video/ltxv_disney"

# Model arguments
model_cmd="--model_name ltx_video \
  --pretrained_model_name_or_path Lightricks/LTX-Video"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token BW_STYLE \
  --video_resolution_buckets 49x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 50 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --resume_from_checkpoint=latest \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 3e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

TODOs

Update the README to include a DeepSpeed support note
Test with HunyuanVideo

finetrainers/trainer.py

sayakpaul · 2024-12-24T05:34:18Z

train.py

@@ -33,7 +33,7 @@ def main():
        trainer.prepare_for_training()
        trainer.prepare_trackers()
        trainer.train()
-        trainer.evaluate()
+        # trainer.evaluate()


For now, I have just made it explicit that evaluate() is not implemented.

finetrainers/utils/checkpointing.py

finetrainers/trainer.py

sayakpaul · 2024-12-24T16:08:35Z

I have addressed the feedback. Will do the testing and rest of the TODOs and request for another review.

sayakpaul · 2024-12-25T06:37:01Z

@a-r-r-o-w I think there's some problem with accelerator.end_training(). The process is stuck forever after

DEBUG:finetrainers:Validation artifacts on process 0: ['image', 'video', 'artifact_0']█████████████████████| 50/50 [00:12<00:00,  4.16it/s]
DEBUG:finetrainers:Saving video to /raid/.cache/huggingface/sayak/ltx-video/ltxv_disney/validation-10-0-afkx-A-black-and-white-an.mp4

To quickly reproduce:

Command

export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/ltx-video/ltxv_disney"

# Model arguments
model_cmd="--model_name ltx_video \
  --pretrained_model_name_or_path Lightricks/LTX-Video"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token BW_STYLE \
  --video_resolution_buckets 49x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 10 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --resume_from_checkpoint=latest \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 3e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb --push_to_hub"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Is this expected?

a-r-r-o-w · 2024-12-25T06:38:40Z

Nope, not expected. All my runs have completed successfully so far. I can take a look after diffusers patch related things are wrapped up

sayakpaul · 2024-12-25T07:50:06Z

With DeepSpeed for Hunyuan Video, there's a separate problem:

ERROR:finetrainers:Traceback (most recent call last):
  File "/fsx/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 667, in train
    accelerator.backward(loss)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
    torch.autograd.backward(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

So, I think for now, we could raise an error when initializing the Trainer if we detect the model type is Hunyuan and DistributedType is deepspeed. In a follow-up PR, I am going to add support for 8bit optimizers from bitsandbytes so that the memory requirements could be lowered down further. This will be especially beneficial for HunyuanVideo.

a-r-r-o-w · 2024-12-25T12:45:36Z

@sayakpaul Taking a look now, sorry for the delay.

When you tried, do you notice accelerator.end_training() process stuck forever thing happeniing when not using DeepSpeed?

a-r-r-o-w · 2024-12-25T13:09:20Z

Just tested and seems to be hanging with/without deepspeed. Looking through the changes again as this was not the case before. Hopefully can find what's wrong quickly 🤞

finetrainers/trainer.py

a-r-r-o-w

LGTM! Feel free to merge after adding the note about DeepSpeed you mentioned in the description. Do we have a minimal run with DeepSpeed training to verify that it works? If not, I can queue up one with LTXV

sayakpaul · 2024-12-25T13:26:15Z

I think the final todos are:

Add a check for DeepSpeed and Hunyuan as mentioned here.

Would prefer the readme related things to go in a single PR.

WDYT?

a-r-r-o-w · 2024-12-25T13:27:35Z

Oh, sorry I missed the Hunyuan comment. Just the readme and the check is remaining - but could I have maybe 20 mins to understand what's happening in Hunyuan? If we can't figure it out, okay to look into it in the future

sayakpaul · 2024-12-25T13:27:52Z

Do we have a minimal run with DeepSpeed training to verify that it works? If not, I can queue up one with LTXV

I have runs here https://wandb.ai/sayakpaul/finetrainers-ltxv/runs/id150mwa but I didn't complete a minimal run to check the quality.

sayakpaul · 2024-12-25T13:28:29Z

Just the readme and the check is remaining - but could I have maybe 20 mins to understand what's happening in Hunyuan? If we can't figure it out, okay to look into it in the future

Yeah works for me.

a-r-r-o-w

Do we handle DummyScheduler creation for DeepSpeed @sayakpaul?

https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed#deepspeed-config-file:~:text=GPUs%20per%20node)-,Important%20code%20changes%20when%20using%20DeepSpeed%20Config%20File,-DeepSpeed%20Optimizers%20and

sayakpaul · 2024-12-25T13:43:31Z

Have never required that for my other projects. LTX-V is working as expected without that. But I will try with that too.

a-r-r-o-w · 2024-12-25T13:51:45Z

It's not needed in your projects probably because you create a custom scheduler manually (which is what we do too). But many projects use a scheduler defined in the DeepSpeed config file, so it is important to respect that. We already handle DummyOptim case, so this would be make it near ideal. Okay to take in a separate PR, but just providing context on why it is important

sayakpaul · 2024-12-25T13:54:23Z

Will handle it. But I suspect if it will make a difference especially because LTX video is working.

a-r-r-o-w · 2024-12-25T13:57:12Z

No, I don't think it will make a difference because (from the example):

 if (
     accelerator.state.deepspeed_plugin is None
     or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
 ):
     lr_scheduler = get_scheduler(
         name=args.lr_scheduler_type,
         optimizer=optimizer,
         num_warmup_steps=args.num_warmup_steps,
         num_training_steps=args.max_train_steps,
     )
 else:
     lr_scheduler = DummyScheduler(
         optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
     )

The DummyScheduler is only initialized if we added a scheduler type to the deepspeed config. Currently, our deepspeed config is very simple and does not use a custom DS-provided scheduler/optimizer. So, we will never end up creating a Dummy one and instead use our own defined optimizer/scheduler based on CLI args

Grad Norm tracking in DeepSpeed

a-r-r-o-w · 2024-12-25T18:14:58Z

What is your env?

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.26.2
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.13.3.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

finetrainers/trainer.py

Co-authored-by: Aryan <[email protected]>

sayakpaul · 2024-12-26T03:29:31Z

Now I am hit with a different error when trying to do HunyuanVideo with DeepSpeed.

My diffusers-cli env on the DGX:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.0
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.4.4
- xFormers version: 0.0.27.post2
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

deepspeed config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

training command

export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/hunyuan_video/hunyuan_disney"

# Model arguments
model_cmd="--model_name hunyuan_video \
  --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token afkx \
  --video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 50 \
  --rank 4 \
  --lora_alpha 4 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 2e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS --main_process_port 29501 train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Error

ERROR:finetrainers:Traceback (most recent call last):
  File "/home/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/home/sayak/finetrainers/finetrainers/trainer.py", line 717, in train
    accelerator.backward(loss)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/accelerator.py", line 2240, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 246, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
    torch.autograd.backward(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Non-DS leads me to an OOM right at the first step (no validation):

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.84 GiB. GPU 0 has a total capacity of 79.15 GiB of which 4.51 GiB is free. Including non-PyTorch memory, this process has 74.63 GiB memory in use. Of the allocated memory 69.83 GiB is allocated by PyTorch, and 4.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

sayakpaul · 2024-12-26T03:31:32Z

finetrainers/trainer.py

-            dtype = (
-                torch.float16
-                if self.args.mixed_precision == "fp16"
-                else torch.bfloat16
-                if self.args.mixed_precision == "bf16"
-                else torch.float32
-            )
-            self.transformer = self.transformer.to(dtype)


transformer should already be in the right data-type. If we want the users to upcast before getting the final state dict, that is a different concern.

a-r-r-o-w · 2024-12-27T05:22:21Z

@sayakpaul Taking a look at the OOM. I am going to start with 1x512x768 and see how far we can go on a single 80 GB gpu, and we can update the README accordingly.

The 129x512x768 in the config came as an example. We report memory requirements for 49x512x768 only (which takes under 48 GB without precomputation, and about 43 GB with precomputation)

sayakpaul · 2024-12-27T05:30:35Z

Sounds good! I will try a couple of different environments for Hunyuan DS support and investigate further. But I guess that doesn’t block this PR.

a-r-r-o-w · 2024-12-27T06:20:50Z

For DeepSpeed, resolution buckets as 81x512x768:

Memory after epoch 2: {
    "memory_allocated": 39.284,
    "memory_reserved": 62.9,
    "max_memory_allocated": 59.832,
    "max_memory_reserved": 62.9
}
Memory after validation end: {
    "memory_allocated": 24.581,
    "memory_reserved": 26.424,
    "max_memory_allocated": 67.129,
    "max_memory_reserved": 70.771
}

@sayakpaul Could you verify if these are the numbers you see as well? Or do you get OOM for this too? I'm on PT 2.5.1+cu124 as well but both DeepSpeed and normal seem to work for me

sayakpaul · 2024-12-27T06:33:16Z

Checking shortly.

a-r-r-o-w · 2024-12-27T06:49:06Z

And the same, but non-deepspeed (uncompiled_2.yaml config):

Memory after epoch 2: {
    "memory_allocated": 40.202,
    "memory_reserved": 67.895,
    "max_memory_allocated": 60.761,
    "max_memory_reserved": 67.895
}
Memory after validation end: {
    "memory_allocated": 25.499,
    "memory_reserved": 28.771,
    "max_memory_allocated": 68.056,
    "max_memory_reserved": 71.977
}

sayakpaul · 2024-12-27T09:32:49Z

@a-r-r-o-w

I was able to match the numbers, yes. Thanks for all the help! I have also added a small change in 84c1756. PTAL.

There's a problem we face when batch_size is specified to be > 1.

noisy_latents.shape=torch.Size([2, 16, 21, 64, 96])
query.shape=torch.Size([2, 24, 32512, 128]), key.shape=torch.Size([2, 24, 32512, 128]), attention_mask.shape=torch.Size([2, 32512, 32512])
ERROR:finetrainers:An error occurred during training: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1.  Target sizes: [2, 24, 32512, 32512].  Tensor sizes: [2, 32512, 32512]
ERROR:finetrainers:Traceback (most recent call last):
  File "/home/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/home/sayak/finetrainers/finetrainers/trainer.py", line 709, in train
    pred = self.model_config["forward_pass"](
  File "/home/sayak/finetrainers/finetrainers/hunyuan_video/hunyuan_video_lora.py", line 225, in forward_pass
    denoised_latents = transformer(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 742, in forward
    hidden_states, encoder_hidden_states = torch.utils.checkpoint.checkpoint(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
    ret = function(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 735, in custom_forward
    return module(*inputs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 480, in forward
    attn_output, context_attn_output = self.attn(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/attention_processor.py", line 588, in forward
    return self.processor(
  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py", line 118, in __call__
    hidden_states = F.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 1.  Target sizes: [2, 24, 32512, 32512].  Tensor sizes: [2, 32512, 32512]

Can be reproduced easily:

import torch 
import torch.nn.functional as F

q = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 24, 32512, 128, device="cuda", dtype=torch.bfloat16)
attn_mask = torch.ones(2, 32512, 32512, device="cuda").bool()

out = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
print(out.shape)

But I guess this is independent of the PR so could be tackled in a different PR? I will work on 8bit opts and Cog next.

a-r-r-o-w · 2024-12-27T09:41:14Z

Thanks, LGTM!

Just one doubt: The change in your commit is increasing the ndim for sigmas, but the error seems to be from attention mask. Are the two related?

sayakpaul · 2024-12-27T09:51:10Z

Not related. I changed it to suit broadcasting. Otherwise, the error happens in the noise flow interpolation step.

finetrainers/finetrainers/trainer.py

Line 700 in 84c1756

noisy_latents = (1.0 - sigmas) * latent_conditions["latents"] + sigmas * noise

support DeepSpeed.

6329e9e

sayakpaul requested a review from a-r-r-o-w December 24, 2024 05:31

sayakpaul commented Dec 24, 2024

View reviewed changes

finetrainers/trainer.py Show resolved Hide resolved

sayakpaul commented Dec 24, 2024

View reviewed changes

finetrainers/trainer.py Outdated Show resolved Hide resolved

sayakpaul commented Dec 24, 2024

View reviewed changes

finetrainers/trainer.py Outdated Show resolved Hide resolved

sayakpaul commented Dec 24, 2024

View reviewed changes

a-r-r-o-w approved these changes Dec 24, 2024

View reviewed changes

address review feedback.

875cb72

fixes

69b585a

fixes

d7aa509

fix deadlock; make style

e2178de

a-r-r-o-w reviewed Dec 25, 2024

View reviewed changes

finetrainers/trainer.py Outdated Show resolved Hide resolved

a-r-r-o-w approved these changes Dec 25, 2024

View reviewed changes

a-r-r-o-w reviewed Dec 25, 2024

View reviewed changes

a-r-r-o-w and others added 2 commits December 25, 2024 15:04

track grad norm for deepspeed

47e4aaa

Merge pull request #148 from a-r-r-o-w/deepspeed-changes

5aef4a6

Grad Norm tracking in DeepSpeed

fix grad norm related logging

5757b1b

a-r-r-o-w reviewed Dec 25, 2024

View reviewed changes

finetrainers/trainer.py Outdated Show resolved Hide resolved

a-r-r-o-w reviewed Dec 25, 2024

View reviewed changes

finetrainers/trainer.py Show resolved Hide resolved

sayakpaul and others added 6 commits December 26, 2024 07:49

torch.is_tensor check

28ecfe6

scheduler

0c0a0a5

Apply suggestions from code review

0309f3b

Co-authored-by: Aryan <[email protected]>

private

1d5e056

deepspeed note

1594c1d

Merge branch 'main' into support-deepspeed

87af1a0

sayakpaul commented Dec 26, 2024

View reviewed changes

sayakpaul requested a review from a-r-r-o-w December 26, 2024 15:46

a-r-r-o-w added 2 commits December 27, 2024 07:19

update README; remove print; make style

a19401e

update Makefile

739f3d8

sayakpaul added 2 commits December 27, 2024 12:25

tracker logging

4461cdd

fixes

84c1756

sayakpaul merged commit e9fba4a into main Dec 27, 2024

sayakpaul deleted the support-deepspeed branch December 27, 2024 09:51

a-r-r-o-w mentioned this pull request Dec 27, 2024

Resume and reprod of reproducibility for Hunayuan #156

Closed

This was referenced Dec 30, 2024

4-card A100 training display cuda out of memory #162

Closed

Batch size of 2 will break the training loop on LTX-loRA finetuning. #173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] support DeepSpeed. #139

[feat] support DeepSpeed. #139

sayakpaul commented Dec 24, 2024 •

edited

Loading

sayakpaul Dec 24, 2024

sayakpaul commented Dec 24, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w left a comment

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w left a comment

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 26, 2024 •

edited

Loading

sayakpaul Dec 26, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

[feat] support DeepSpeed. #139

[feat] support DeepSpeed. #139

Conversation

sayakpaul commented Dec 24, 2024 • edited Loading

TODOs

sayakpaul Dec 24, 2024

Choose a reason for hiding this comment

sayakpaul commented Dec 24, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w left a comment

Choose a reason for hiding this comment

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w left a comment

Choose a reason for hiding this comment

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w commented Dec 25, 2024

sayakpaul commented Dec 26, 2024 • edited Loading

sayakpaul Dec 26, 2024

Choose a reason for hiding this comment

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

a-r-r-o-w commented Dec 27, 2024

sayakpaul commented Dec 27, 2024

sayakpaul commented Dec 24, 2024 •

edited

Loading

sayakpaul commented Dec 26, 2024 •

edited

Loading