[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

ChuningShi · 2025-03-11T19:42:26Z

描述这个 bug
你好，我在Recbole上pretrain S3Rec的时候尝试用2个GPU训练，但是每个epoch的训练速度和用1个GPU的时候几乎相似。
用2个GPU时：

用一个GPU时

并且当我用2个GPU时，如果worker没有设置为0的话，会报以下错误

Train     0:   0%|                                                          | 0/548 [00:00<?, ?it/s]/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = amp.GradScaler(enabled=self.enable_scaler)
wandb: Tracking run with wandb version 0.19.6
wandb: Run data is saved locally in /data/user_data/chunings/RecSys-Benchmark/RecBole/wandb/run-20250309_183259-locyxzg8
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run s3rec_amzn-sports
wandb: ⭐️ View project at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark
wandb: 🚀 View run at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark/runs/locyxzg8
/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = amp.GradScaler(enabled=self.enable_scaler)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
    storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
    storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)

如何复现
复现这个 bug 的步骤：
我的bash file关于GPU的设置是这样：

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8

#SBATCH --mem=64G

#SBATCH --gres=gpu:2

运行的command也设置了--nproc 2。

预期
预期用多个GPU训练S3Rec会比用单个GPU更快。现在用单个/多个GPU训练S3Rec一个epoch都需要3.5小时左右。

The text was updated successfully, but these errors were encountered:

ChuningShi added the bug Something isn't working label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

ChuningShi commented Mar 11, 2025

[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

[🐛BUG] 用多个GPU训练无法加速一个epoch的training速度 #2154

Comments

ChuningShi commented Mar 11, 2025