RuntimeError: CUDA error: invalid device ordinal #2

BitCalSaul · 2024-02-27T03:24:45Z

Thanks very much for your efforts. I used your Runit in my two servers and found it really useful.
But I encouted an issue when i used the commit two years ago in your repo.
When i run the Runit, there would be an error said "RuntimeError: CUDA error: invalid device ordinal".
No matter how I change my script in config.txt, it's still this error.
But when i did the same stuff in my another server, it run properly.
This is my command:
python /home/user/RunIt/run_it.py --interpreter python --verbose --gpu-pool 0 1 --max-workers 2--cmd-pool /home/user/RunIt/ProjCompressor/config.txt
This is output:

Namespace(cmd_pool='/home/user/RunIt/ProjCompressor/config.txt', gpu_pool=[0], interpreter='python', max_used_ratio=0.5, max_workers=2, verbose=True)
[YOUR CMDS]
/home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[CREATE PROCESS OBJECTS]
[ID 0 INFO] NEW PROCESS SLOT ON GPU 0 IS CREATED!
[ID 0 INFO] /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[NEW TASK PID: 16243] CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[ID: 1/1 GPU: 0] Error executing job with overrides: ['epochs=50', 'dividor_value=100000', 'dgroup_id=0', 'dgroups=2', 'model.n_channels=80', 'model.n_blocks=21', 'batch_size=6']
[ID: 1/1 GPU: 0] Traceback (most recent call last):
[ID: 1/1 GPU: 0]   File "/home/user/Compressor/main.py", line 33, in main
[ID: 1/1 GPU: 0]     torch.cuda.set_device(f'cuda:{list(cfg.DDP.gpu)[0]}')
[ID: 1/1 GPU: 0]   File "/data/user/miniconda3/envs/compressor/lib/python3.8/site-packages/torch/cuda/__init__.py", line 404, in set_device
[ID: 1/1 GPU: 0]     torch._C._cuda_setDevice(device)
[ID: 1/1 GPU: 0] RuntimeError: CUDA error: invalid device ordinal
[ID: 1/1 GPU: 0] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[ID: 1/1 GPU: 0] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[ID: 1/1 GPU: 0] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NO MORE COMMANDS, DELETE THE PROCESS SLOT!]
[ALL COMMANDS HAVE BEEN COMPLETED!]

The text was updated successfully, but these errors were encountered:

lartpang · 2024-02-28T01:21:10Z

@BitCalSaul

CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6

This is the actual command that is executed.

Perhaps in your code, you manually specified GPUs with non-zero index numbers.

BitCalSaul · 2024-02-29T18:24:13Z

Yeah it seems like I specify GPU in the code, when I change the index from [1] to [0], it runs properly. Thank you

lartpang closed this as completed Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal #2

RuntimeError: CUDA error: invalid device ordinal #2

BitCalSaul commented Feb 27, 2024

lartpang commented Feb 28, 2024

BitCalSaul commented Feb 29, 2024

RuntimeError: CUDA error: invalid device ordinal #2

RuntimeError: CUDA error: invalid device ordinal #2

Comments

BitCalSaul commented Feb 27, 2024

lartpang commented Feb 28, 2024

BitCalSaul commented Feb 29, 2024