Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal #2

Closed
BitCalSaul opened this issue Feb 27, 2024 · 2 comments
Closed

RuntimeError: CUDA error: invalid device ordinal #2

BitCalSaul opened this issue Feb 27, 2024 · 2 comments

Comments

@BitCalSaul
Copy link

Thanks very much for your efforts. I used your Runit in my two servers and found it really useful.
But I encouted an issue when i used the commit two years ago in your repo.
When i run the Runit, there would be an error said "RuntimeError: CUDA error: invalid device ordinal".
No matter how I change my script in config.txt, it's still this error.
But when i did the same stuff in my another server, it run properly.
This is my command:
python /home/user/RunIt/run_it.py --interpreter python --verbose --gpu-pool 0 1 --max-workers 2--cmd-pool /home/user/RunIt/ProjCompressor/config.txt
This is output:

Namespace(cmd_pool='/home/user/RunIt/ProjCompressor/config.txt', gpu_pool=[0], interpreter='python', max_used_ratio=0.5, max_workers=2, verbose=True)
[YOUR CMDS]
/home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[CREATE PROCESS OBJECTS]
[ID 0 INFO] NEW PROCESS SLOT ON GPU 0 IS CREATED!
[ID 0 INFO] /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[NEW TASK PID: 16243] CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[ID: 1/1 GPU: 0] Error executing job with overrides: ['epochs=50', 'dividor_value=100000', 'dgroup_id=0', 'dgroups=2', 'model.n_channels=80', 'model.n_blocks=21', 'batch_size=6']
[ID: 1/1 GPU: 0] Traceback (most recent call last):
[ID: 1/1 GPU: 0]   File "/home/user/Compressor/main.py", line 33, in main
[ID: 1/1 GPU: 0]     torch.cuda.set_device(f'cuda:{list(cfg.DDP.gpu)[0]}')
[ID: 1/1 GPU: 0]   File "/data/user/miniconda3/envs/compressor/lib/python3.8/site-packages/torch/cuda/__init__.py", line 404, in set_device
[ID: 1/1 GPU: 0]     torch._C._cuda_setDevice(device)
[ID: 1/1 GPU: 0] RuntimeError: CUDA error: invalid device ordinal
[ID: 1/1 GPU: 0] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[ID: 1/1 GPU: 0] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[ID: 1/1 GPU: 0] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NO MORE COMMANDS, DELETE THE PROCESS SLOT!]
[ALL COMMANDS HAVE BEEN COMPLETED!]
@lartpang
Copy link
Owner

@BitCalSaul

CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6

This is the actual command that is executed.

Perhaps in your code, you manually specified GPUs with non-zero index numbers.

@BitCalSaul
Copy link
Author

Yeah it seems like I specify GPU in the code, when I change the index from [1] to [0], it runs properly. Thank you

@lartpang lartpang closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants