Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

Open
ChuningShi opened this issue Mar 11, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@ChuningShi
Copy link

描述这个 bug
你好,我在Recbole上pretrain S3Rec的时候尝试用2个GPU训练,但是每个epoch的训练速度和用1个GPU的时候几乎相似。
用2个GPU时:

Image

Image

用一个GPU时

Image

并且当我用2个GPU时,如果worker没有设置为0的话,会报以下错误

Train     0:   0%|                                                          | 0/548 [00:00<?, ?it/s]/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = amp.GradScaler(enabled=self.enable_scaler)
wandb: Tracking run with wandb version 0.19.6
wandb: Run data is saved locally in /data/user_data/chunings/RecSys-Benchmark/RecBole/wandb/run-20250309_183259-locyxzg8
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run s3rec_amzn-sports
wandb: ⭐️ View project at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark
wandb: 🚀 View run at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark/runs/locyxzg8
/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = amp.GradScaler(enabled=self.enable_scaler)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
    storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
    storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)

如何复现
复现这个 bug 的步骤:
我的bash file关于GPU的设置是这样:

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8

#SBATCH --mem=64G

#SBATCH --gres=gpu:2

运行的command也设置了--nproc 2。

预期
预期用多个GPU训练S3Rec会比用单个GPU更快。现在用单个/多个GPU训练S3Rec一个epoch都需要3.5小时左右。

@ChuningShi ChuningShi added the bug Something isn't working label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant