-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Fail while replaying trajectory using gpu backend #908
Comments
It seems there may be something up with your GPU / nvidia drivers. Can you tell me what your OS is and what does Another test is then to run
|
the OS is Ubuntu 20.04 the (dp) ➜ ~ nvidia-smi
Fri Mar 7 09:03:59 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 D Off | 00000000:05:00.0 On | Off |
| 31% 36C P0 39W / 425W | 1029MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1111 G /usr/lib/xorg/Xorg 90MiB |
| 0 N/A N/A 2426 G /usr/lib/xorg/Xorg 491MiB |
| 0 N/A N/A 2662 G /usr/bin/gnome-shell 77MiB |
| 0 N/A N/A 128129 G ...AAAAAAAACAAAAAAAAAA= --shared-files 33MiB |
| 0 N/A N/A 2427059 G ...seed-version=20250303-050056.241000 104MiB |
| 0 N/A N/A 3459530 G ...erProcess --variations-seed-version 184MiB |
+---------------------------------------------------------------------------------------+
and the output of the evaluate code is (dp) ➜ ~ python
Python 3.9.21 (main, Dec 11 2024, 16:24:11)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> |
Do any of the other demo scripts work like creating cpu or gpu sim environment? |
while replaying the trajectory generated by motion planning that using physx_cpu backend, it works python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PickCube-v1/motionplanning/trajectory.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200 while replaying the trajectory generated by rl that using physx_cuda_cuda backend, it fails python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PokeCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200 |
@fbxiang any idea? It seems the GPU physx system cannot be created |
This error seems to be produced by GPU PhysX when CUDA cannot be initialized. Maybe there is a GPU driver problem like an outdated driver or incomplete installation? |
Actually one more test i forget to ask, since sometimes torch says cuda is available when it doesn't work. @lyxichigoichie Can you do import torch
print(torch.tensor([1., 2., 3.]).cuda().mean()) If that doesn't work then your driver is definitely not setup correctly and I recommend a complete reinstall of your nvidia drivers + restart computer. |
Hi, Dr. Stone, I execute the code and it work correctly (dp) ➜ dp_liyx git:(master) ✗ python
Python 3.9.21 (main, Dec 11 2024, 16:24:11)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.tensor([1., 2., 3.]).cuda().mean())
tensor(2., device='cuda:0')
>>> I could not restart my computer within a period of time to test the replay scripts again since my computer is connecting to the server for training. Oh, by the way, I can train using gpu at my computer normally. |
I will test the replay scripts on my partner's computer to check if it's my graphics card's problem |
Hi Dr. Stone @StoneT2000, I run the replay scripts at the gpu server with 3090 and it works. I think it may be the fault of my computer or the 4090D???. I will try to reboot it and try it again. But while I run the scripts at the server using the following command (dp) ➜ dp_liyx git:(master) ✗ python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PickCube-v1/rl/trajectory.none.pd_ee_delta_pose.physx_cuda.h5 \
--use-first-env-state -b "cuda:5" \
-c pd_ee_delta_pose -o state \
--save-traj It will return the error
which means that 'cuda:5' is not the key of the sim_backend_name_mapping dictionary. From this I find a small bug at sim_backend_name_mapping = {
"cpu": "physx_cpu",
"cuda": "physx_cuda",
"gpu": "physx_cuda",
"physx_cpu": "physx_cpu",
"physx_cuda": "physx_cuda",
}
render_backend_name_mapping = {
"cpu": "sapien_cpu",
"cuda": "sapien_cuda",
"gpu": "sapien_cuda",
"sapien_cpu": "sapien_cpu",
"sapien_cuda": "sapien_cuda",
}
def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
sim_backend = sim_backend_name_mapping[sim_backend] # If we send 'cuda:5' to sim_backend, it will be firstly pass to the sim_backend_name_mapping dictionary but 'cuda:5' is not the key of sim_backend_name_mapping.
# so the mapping must be the position that after the if sim_backend[:4] == "cuda": statement
render_backend = render_backend_name_mapping[render_backend]
if sim_backend == "physx_cpu":
device = torch.device("cpu")
sim_device = sapien.Device("cpu")
elif sim_backend == "physx_cuda":
device = torch.device("cuda")
sim_device = sapien.Device("cuda")
elif sim_backend[:4] == "cuda":
device = torch.device(sim_backend)
sim_device = sapien.Device(sim_backend)
else:
raise ValueError(f"Invalid simulation backend: {sim_backend}")
# TODO (stao): handle checking if system is mac, in which we must then use render_backend = "sapien_cpu"
# determine render device
if render_backend == "sapien_cuda":
render_device = sapien.Device("cuda")
elif render_backend == "sapien_cpu":
render_device = sapien.Device("cpu")
elif render_backend[:4] == "cuda":
render_device = sapien.Device(render_backend)
else:
# handle special cases such as for AMD gpus, render_backend must be defined as pci:... instead as cuda is not available.
render_device = sapien.Device(render_backend)
return BackendInfo(
device=device,
sim_device=sim_device,
sim_backend=sim_backend_name_mapping[sim_backend],
render_device=render_device,
render_backend=render_backend,
) I make the following changes to fix the bug. (moving the cuda checking to the first position) def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
if sim_backend[:4] == 'cuda':
device = torch.device(sim_backend)
sim_device = sapien.Device(sim_backend)
else:
try:
sim_backend = sim_backend_name_mapping[sim_backend]
except KeyError:
raise ValueError(f"Invalid simulation backend: {sim_backend}")
else:
if sim_backend == "physx_cpu":
device = torch.device("cpu")
sim_device = sapien.Device("cpu")
elif sim_backend == "physx_cuda":
device = torch.device("cuda")
sim_device = sapien.Device("cuda")
if render_backend[:4] == "cuda":
render_device = sapien.Device(render_backend)
else:
try:
render_backend = render_backend_name_mapping[render_backend]
except KeyError:
raise ValueError(f"Invalid render backend: {render_backend}")
else:
if render_backend == "sapien_cuda":
render_device = sapien.Device("cuda")
elif render_backend == "sapien_cpu":
render_device = sapien.Device("cpu")
return BackendInfo(
device=device,
sim_device=sim_device,
sim_backend=sim_backend,
render_device=render_device,
render_backend=render_backend,
) I will make a pull requests Best wishes |
If you want to use gpu number 5, use |
while I replay trajectory using the following command (the command is adapt from the documentation)
python -m mani_skill.trajectory.replay_trajectory \ --traj-path data/PullCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \ --use-first-env-state -b "physx_cuda" \ -c pd_joint_delta_pos -o state \ --save-traj
and then is return the following error. I upgrade the maniskill from github and rerun this, but the error still occurs.
I feel sorry that I can't provide any other information.
Best wishes
The text was updated successfully, but these errors were encountered: