You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Train 0: 0%| | 0/548 [00:00<?, ?it/s]/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = amp.GradScaler(enabled=self.enable_scaler)
wandb: Tracking run with wandb version 0.19.6
wandb: Run data is saved locally in /data/user_data/chunings/RecSys-Benchmark/RecBole/wandb/run-20250309_183259-locyxzg8
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run s3rec_amzn-sports
wandb: ⭐️ View project at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark
wandb: 🚀 View run at https://wandb.ai/chunings-carnegie-mellon-university/RecSys-Benchmark/runs/locyxzg8
/data/user_data/chunings/RecSys-Benchmark/RecBole/recbole/trainer/trainer.py:235: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = amp.GradScaler(enabled=self.enable_scaler)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib64/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/home/chunings/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 546, in rebuild_storage_fd
storage = cls._new_shared_fd_cpu(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size: Invalid argument (22)
描述这个 bug
你好,我在Recbole上pretrain S3Rec的时候尝试用2个GPU训练,但是每个epoch的训练速度和用1个GPU的时候几乎相似。
用2个GPU时:
用一个GPU时
并且当我用2个GPU时,如果worker没有设置为0的话,会报以下错误
如何复现
复现这个 bug 的步骤:
我的bash file关于GPU的设置是这样:
运行的command也设置了--nproc 2。
预期
预期用多个GPU训练S3Rec会比用单个GPU更快。现在用单个/多个GPU训练S3Rec一个epoch都需要3.5小时左右。
The text was updated successfully, but these errors were encountered: