You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks very much for your efforts. I used your Runit in my two servers and found it really useful.
But I encouted an issue when i used the commit two years ago in your repo.
When i run the Runit, there would be an error said "RuntimeError: CUDA error: invalid device ordinal".
No matter how I change my script in config.txt, it's still this error.
But when i did the same stuff in my another server, it run properly.
This is my command: python /home/user/RunIt/run_it.py --interpreter python --verbose --gpu-pool 0 1 --max-workers 2--cmd-pool /home/user/RunIt/ProjCompressor/config.txt
This is output:
Namespace(cmd_pool='/home/user/RunIt/ProjCompressor/config.txt', gpu_pool=[0], interpreter='python', max_used_ratio=0.5, max_workers=2, verbose=True)
[YOUR CMDS]
/home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[CREATE PROCESS OBJECTS]
[ID 0 INFO] NEW PROCESS SLOT ON GPU 0 IS CREATED!
[ID 0 INFO] /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[NEW TASK PID: 16243] CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[ID: 1/1 GPU: 0] Error executing job with overrides: ['epochs=50', 'dividor_value=100000', 'dgroup_id=0', 'dgroups=2', 'model.n_channels=80', 'model.n_blocks=21', 'batch_size=6']
[ID: 1/1 GPU: 0] Traceback (most recent call last):
[ID: 1/1 GPU: 0] File "/home/user/Compressor/main.py", line 33, in main
[ID: 1/1 GPU: 0] torch.cuda.set_device(f'cuda:{list(cfg.DDP.gpu)[0]}')
[ID: 1/1 GPU: 0] File "/data/user/miniconda3/envs/compressor/lib/python3.8/site-packages/torch/cuda/__init__.py", line 404, in set_device
[ID: 1/1 GPU: 0] torch._C._cuda_setDevice(device)
[ID: 1/1 GPU: 0] RuntimeError: CUDA error: invalid device ordinal
[ID: 1/1 GPU: 0] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[ID: 1/1 GPU: 0] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[ID: 1/1 GPU: 0] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[ID: 1/1 GPU: 0]
[ID: 1/1 GPU: 0]
[ID: 1/1 GPU: 0] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NO MORE COMMANDS, DELETE THE PROCESS SLOT!]
[ALL COMMANDS HAVE BEEN COMPLETED!]
The text was updated successfully, but these errors were encountered:
Thanks very much for your efforts. I used your Runit in my two servers and found it really useful.
But I encouted an issue when i used the commit two years ago in your repo.
When i run the Runit, there would be an error said "RuntimeError: CUDA error: invalid device ordinal".
No matter how I change my script in config.txt, it's still this error.
But when i did the same stuff in my another server, it run properly.
This is my command:
python /home/user/RunIt/run_it.py --interpreter python --verbose --gpu-pool 0 1 --max-workers 2--cmd-pool /home/user/RunIt/ProjCompressor/config.txt
This is output:
The text was updated successfully, but these errors were encountered: