CUDA Fail while replaying trajectory using gpu backend #908

lyxichigoichie · 2025-03-06T13:36:28Z

while I replay trajectory using the following command (the command is adapt from the documentation)

python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PullCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \
  --use-first-env-state -b "physx_cuda" \
  -c pd_joint_delta_pos -o state \
  --save-traj

and then is return the following error. I upgrade the maniskill from github and rerun this, but the error still occurs.

0step [00:00, ?step/s]Traceback (most recent call last):                                                                       | 0/1024 [00:00<?, ?step/s]
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 618, in <module>
    main(parse_args())
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 548, in main
    _, replay_result = _main(
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 401, in _main
    env = gym.make(env_id, **env_kwargs)
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/gymnasium/envs/registration.py", line 802, in make
    env = env_creator(**env_spec_kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/utils/registration.py", line 182, in make
    env = env_spec.make(**kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/utils/registration.py", line 79, in make
    return self.cls(**_kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/tasks/tabletop/pull_cube.py", line 42, in __init__
    super().__init__(*args, robot_uids=robot_uids, **kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 309, in __init__
    obs, _ = self.reset(seed=[2022 + i for i in range(self.num_envs)], options=dict(reconfigure=True))
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 811, in reset
    self._reconfigure(options)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 647, in _reconfigure
    self._setup_scene()
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 1060, in _setup_scene
    physx_system = physx.PhysxGpuSystem(device=self._sim_device)
RuntimeError: CUDA failed
0step [00:00, ?step/s]
  0%|                                                                                                                          | 0/1024 [00:00<?, ?step/s]

I feel sorry that I can't provide any other information.

Best wishes

StoneT2000 · 2025-03-06T19:02:10Z

It seems there may be something up with your GPU / nvidia drivers.

Can you tell me what your OS is and what does
nvidia-smi output.

Another test is then to run

import torch
print(torch.cuda.is_available())

lyxichigoichie · 2025-03-07T01:05:55Z

It seems there may be something up with your GPU / nvidia drivers.

Can you tell me what your OS is and what does nvidia-smi output.

Another test is then to run
import torch
print(torch.cuda.is_available())

the OS is Ubuntu 20.04

the nvidia-smi output is

(dp) ➜  ~ nvidia-smi
Fri Mar  7 09:03:59 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off | 00000000:05:00.0  On |                  Off |
| 31%   36C    P0              39W / 425W |   1029MiB / 24564MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1111      G   /usr/lib/xorg/Xorg                           90MiB |
|    0   N/A  N/A      2426      G   /usr/lib/xorg/Xorg                          491MiB |
|    0   N/A  N/A      2662      G   /usr/bin/gnome-shell                         77MiB |
|    0   N/A  N/A    128129      G   ...AAAAAAAACAAAAAAAAAA= --shared-files       33MiB |
|    0   N/A  N/A   2427059      G   ...seed-version=20250303-050056.241000      104MiB |
|    0   N/A  N/A   3459530      G   ...erProcess --variations-seed-version      184MiB |
+---------------------------------------------------------------------------------------+

and the output of the evaluate code is

(dp) ➜  ~ python
Python 3.9.21 (main, Dec 11 2024, 16:24:11) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>>

StoneT2000 · 2025-03-07T01:42:25Z

Do any of the other demo scripts work like creating cpu or gpu sim environment?

lyxichigoichie · 2025-03-07T03:17:26Z

while replaying the trajectory generated by motion planning that using physx_cpu backend, it works

python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PickCube-v1/motionplanning/trajectory.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200

while replaying the trajectory generated by rl that using physx_cuda_cuda backend, it fails

python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PokeCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200

StoneT2000 · 2025-03-07T21:33:08Z

@fbxiang any idea? It seems the GPU physx system cannot be created

fbxiang · 2025-03-07T21:43:26Z

This error seems to be produced by GPU PhysX when CUDA cannot be initialized. Maybe there is a GPU driver problem like an outdated driver or incomplete installation?

StoneT2000 · 2025-03-07T21:45:55Z

Actually one more test i forget to ask, since sometimes torch says cuda is available when it doesn't work.

@lyxichigoichie Can you do

import torch
print(torch.tensor([1., 2., 3.]).cuda().mean())

If that doesn't work then your driver is definitely not setup correctly and I recommend a complete reinstall of your nvidia drivers + restart computer.

lyxichigoichie · 2025-03-08T03:07:57Z

Hi, Dr. Stone, I execute the code and it work correctly

(dp) ➜  dp_liyx git:(master) ✗ python                                                                                                                     
Python 3.9.21 (main, Dec 11 2024, 16:24:11) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.tensor([1., 2., 3.]).cuda().mean())
tensor(2., device='cuda:0')
>>>

I could not restart my computer within a period of time to test the replay scripts again since my computer is connecting to the server for training. Oh, by the way, I can train using gpu at my computer normally.

lyxichigoichie · 2025-03-08T03:10:10Z

I will test the replay scripts on my partner's computer to check if it's my graphics card's problem

lyxichigoichie · 2025-03-09T13:30:44Z

Hi Dr. Stone @StoneT2000, I run the replay scripts at the gpu server with 3090 and it works. I think it may be the fault of my computer or the 4090D???. I will try to reboot it and try it again.

But while I run the scripts at the server using the following command

(dp) ➜  dp_liyx git:(master) ✗ python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PickCube-v1/rl/trajectory.none.pd_ee_delta_pose.physx_cuda.h5 \
  --use-first-env-state -b "cuda:5" \
  -c pd_ee_delta_pose -o state \
  --save-traj

It will return the error

(dp) ➜  dp_liyx git:(master) ✗ python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PickCube-v1/rl/trajectory.none.pd_ee_delta_pose.physx_cuda.h5 \
  --use-first-env-state -b "cuda:5" \
  -c pd_joint_delta_pos -o state \
  --save-traj
0step [00:00, ?step/s]Traceback (most recent call last):                                                                       | 0/1013 [00:00<?, ?step/s]
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 618, in <module>
    main(parse_args())
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 548, in main
    _, replay_result = _main(
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 401, in _main
    env = gym.make(env_id, **env_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/gymnasium/envs/registration.py", line 802, in make
    env = env_creator(**env_spec_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/utils/registration.py", line 182, in make
    env = env_spec.make(**kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/utils/registration.py", line 79, in make
    return self.cls(**_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/tasks/tabletop/pick_cube.py", line 46, in __init__
    super().__init__(*args, robot_uids=robot_uids, **kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/sapien_env.py", line 231, in __init__
    self.backend = parse_sim_and_render_backend(sim_backend, render_backend)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/utils/system/backend.py", line 43, in parse_sim_and_render_backend
    sim_backend = sim_backend_name_mapping[sim_backend]
KeyError: 'cuda:5'
0step [00:00, ?step/s]

which means that 'cuda:5' is not the key of the sim_backend_name_mapping dictionary. From this I find a small bug at mani_skill/envs/utils/system/backend.py which will show at the code comments

sim_backend_name_mapping = {
    "cpu": "physx_cpu",
    "cuda": "physx_cuda",
    "gpu": "physx_cuda",
    "physx_cpu": "physx_cpu",
    "physx_cuda": "physx_cuda",
}
render_backend_name_mapping = {
    "cpu": "sapien_cpu",
    "cuda": "sapien_cuda",
    "gpu": "sapien_cuda",
    "sapien_cpu": "sapien_cpu",
    "sapien_cuda": "sapien_cuda",
}


def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
    sim_backend = sim_backend_name_mapping[sim_backend]  # If we send 'cuda:5' to sim_backend, it will be firstly pass to the  sim_backend_name_mapping dictionary but 'cuda:5' is not the key of sim_backend_name_mapping.
# so the mapping must be the position that after the   if sim_backend[:4] == "cuda":  statement
    render_backend = render_backend_name_mapping[render_backend]
    if sim_backend == "physx_cpu":
        device = torch.device("cpu")
        sim_device = sapien.Device("cpu")
    elif sim_backend == "physx_cuda":
        device = torch.device("cuda")
        sim_device = sapien.Device("cuda")
    elif sim_backend[:4] == "cuda":
        device = torch.device(sim_backend)
        sim_device = sapien.Device(sim_backend)
    else:
        raise ValueError(f"Invalid simulation backend: {sim_backend}")

    # TODO (stao): handle checking if system is mac, in which we must then use render_backend = "sapien_cpu"
    # determine render device
    if render_backend == "sapien_cuda":
        render_device = sapien.Device("cuda")
    elif render_backend == "sapien_cpu":
        render_device = sapien.Device("cpu")
    elif render_backend[:4] == "cuda":
        render_device = sapien.Device(render_backend)
    else:
        # handle special cases such as for AMD gpus, render_backend must be defined as pci:... instead as cuda is not available.
        render_device = sapien.Device(render_backend)
    return BackendInfo(
        device=device,
        sim_device=sim_device,
        sim_backend=sim_backend_name_mapping[sim_backend],
        render_device=render_device,
        render_backend=render_backend,
    )

I make the following changes to fix the bug. (moving the cuda checking to the first position)

def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
    if sim_backend[:4] == 'cuda':
        device = torch.device(sim_backend)
        sim_device = sapien.Device(sim_backend)
    else:
        try:
            sim_backend = sim_backend_name_mapping[sim_backend]
        except KeyError:
            raise ValueError(f"Invalid simulation backend: {sim_backend}")
        else:
            if sim_backend == "physx_cpu":
                device = torch.device("cpu")
                sim_device = sapien.Device("cpu")
            elif sim_backend == "physx_cuda":
                device = torch.device("cuda")
                sim_device = sapien.Device("cuda")
    if render_backend[:4] == "cuda":
        render_device = sapien.Device(render_backend)
    else:
        try:
            render_backend = render_backend_name_mapping[render_backend]
        except KeyError:
            raise ValueError(f"Invalid render backend: {render_backend}")
        else:
            if render_backend == "sapien_cuda":
                render_device = sapien.Device("cuda")
            elif render_backend == "sapien_cpu":
                render_device = sapien.Device("cpu")
    return BackendInfo(
        device=device,
        sim_device=sim_device,
        sim_backend=sim_backend,
        render_device=render_device,
        render_backend=render_backend,
    )

I will make a pull requests

Best wishes

StoneT2000 · 2025-03-09T20:18:28Z

If you want to use gpu number 5, use export CUDA_VISIBLE_DEVICES=5 python ...

lyxichigoichie mentioned this issue Mar 9, 2025

[BugFix]fix the sim_backend mapping bug #919

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Fail while replaying trajectory using gpu backend #908

CUDA Fail while replaying trajectory using gpu backend #908

lyxichigoichie commented Mar 6, 2025

StoneT2000 commented Mar 6, 2025

lyxichigoichie commented Mar 7, 2025 •

edited

Loading

StoneT2000 commented Mar 7, 2025

lyxichigoichie commented Mar 7, 2025

StoneT2000 commented Mar 7, 2025 •

edited

Loading

fbxiang commented Mar 7, 2025

StoneT2000 commented Mar 7, 2025

lyxichigoichie commented Mar 8, 2025

lyxichigoichie commented Mar 8, 2025

lyxichigoichie commented Mar 9, 2025

StoneT2000 commented Mar 9, 2025

CUDA Fail while replaying trajectory using gpu backend #908

CUDA Fail while replaying trajectory using gpu backend #908

Comments

lyxichigoichie commented Mar 6, 2025

StoneT2000 commented Mar 6, 2025

lyxichigoichie commented Mar 7, 2025 • edited Loading

StoneT2000 commented Mar 7, 2025

lyxichigoichie commented Mar 7, 2025

StoneT2000 commented Mar 7, 2025 • edited Loading

fbxiang commented Mar 7, 2025

StoneT2000 commented Mar 7, 2025

lyxichigoichie commented Mar 8, 2025

lyxichigoichie commented Mar 8, 2025

lyxichigoichie commented Mar 9, 2025

StoneT2000 commented Mar 9, 2025

lyxichigoichie commented Mar 7, 2025 •

edited

Loading

StoneT2000 commented Mar 7, 2025 •

edited

Loading