Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Fail while replaying trajectory using gpu backend #908

Open
lyxichigoichie opened this issue Mar 6, 2025 · 11 comments
Open

CUDA Fail while replaying trajectory using gpu backend #908

lyxichigoichie opened this issue Mar 6, 2025 · 11 comments

Comments

@lyxichigoichie
Copy link

while I replay trajectory using the following command (the command is adapt from the documentation)

python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PullCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \
  --use-first-env-state -b "physx_cuda" \
  -c pd_joint_delta_pos -o state \
  --save-traj

and then is return the following error. I upgrade the maniskill from github and rerun this, but the error still occurs.

0step [00:00, ?step/s]Traceback (most recent call last):                                                                       | 0/1024 [00:00<?, ?step/s]
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 618, in <module>
    main(parse_args())
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 548, in main
    _, replay_result = _main(
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/trajectory/replay_trajectory.py", line 401, in _main
    env = gym.make(env_id, **env_kwargs)
  File "/home/lyxichigoichie/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/gymnasium/envs/registration.py", line 802, in make
    env = env_creator(**env_spec_kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/utils/registration.py", line 182, in make
    env = env_spec.make(**kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/utils/registration.py", line 79, in make
    return self.cls(**_kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/tasks/tabletop/pull_cube.py", line 42, in __init__
    super().__init__(*args, robot_uids=robot_uids, **kwargs)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 309, in __init__
    obs, _ = self.reset(seed=[2022 + i for i in range(self.num_envs)], options=dict(reconfigure=True))
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 811, in reset
    self._reconfigure(options)
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 647, in _reconfigure
    self._setup_scene()
  File "/home/lyxichigoichie/3rdparty/ManiSkill/mani_skill/envs/sapien_env.py", line 1060, in _setup_scene
    physx_system = physx.PhysxGpuSystem(device=self._sim_device)
RuntimeError: CUDA failed
0step [00:00, ?step/s]
  0%|                                                                                                                          | 0/1024 [00:00<?, ?step/s]

I feel sorry that I can't provide any other information.

Best wishes

@StoneT2000
Copy link
Member

It seems there may be something up with your GPU / nvidia drivers.

Can you tell me what your OS is and what does
nvidia-smi output.

Another test is then to run

import torch
print(torch.cuda.is_available())

@lyxichigoichie
Copy link
Author

lyxichigoichie commented Mar 7, 2025

It seems there may be something up with your GPU / nvidia drivers.

Can you tell me what your OS is and what does nvidia-smi output.

Another test is then to run

import torch
print(torch.cuda.is_available())

the OS is Ubuntu 20.04

the nvidia-smi output is

(dp) ➜  ~ nvidia-smi
Fri Mar  7 09:03:59 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off | 00000000:05:00.0  On |                  Off |
| 31%   36C    P0              39W / 425W |   1029MiB / 24564MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1111      G   /usr/lib/xorg/Xorg                           90MiB |
|    0   N/A  N/A      2426      G   /usr/lib/xorg/Xorg                          491MiB |
|    0   N/A  N/A      2662      G   /usr/bin/gnome-shell                         77MiB |
|    0   N/A  N/A    128129      G   ...AAAAAAAACAAAAAAAAAA= --shared-files       33MiB |
|    0   N/A  N/A   2427059      G   ...seed-version=20250303-050056.241000      104MiB |
|    0   N/A  N/A   3459530      G   ...erProcess --variations-seed-version      184MiB |
+---------------------------------------------------------------------------------------+

and the output of the evaluate code is

(dp) ➜  ~ python
Python 3.9.21 (main, Dec 11 2024, 16:24:11) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> 

@StoneT2000
Copy link
Member

Do any of the other demo scripts work like creating cpu or gpu sim environment?

@lyxichigoichie
Copy link
Author

while replaying the trajectory generated by motion planning that using physx_cpu backend, it works

python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PickCube-v1/motionplanning/trajectory.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200 

while replaying the trajectory generated by rl that using physx_cuda_cuda backend, it fails

python -m mani_skill.trajectory.replay_trajectory \
--traj-path data/PokeCube-v1/rl/trajectory.none.pd_joint_delta_pos.physx_cuda.h5 \
--use-first-env-state \
-c pd_joint_pos \
-o rgbd \
--save-traj \
--count 200 

@StoneT2000
Copy link
Member

StoneT2000 commented Mar 7, 2025

@fbxiang any idea? It seems the GPU physx system cannot be created

@fbxiang
Copy link
Contributor

fbxiang commented Mar 7, 2025

This error seems to be produced by GPU PhysX when CUDA cannot be initialized. Maybe there is a GPU driver problem like an outdated driver or incomplete installation?

@StoneT2000
Copy link
Member

Actually one more test i forget to ask, since sometimes torch says cuda is available when it doesn't work.

@lyxichigoichie Can you do

import torch
print(torch.tensor([1., 2., 3.]).cuda().mean())

If that doesn't work then your driver is definitely not setup correctly and I recommend a complete reinstall of your nvidia drivers + restart computer.

@lyxichigoichie
Copy link
Author

Hi, Dr. Stone, I execute the code and it work correctly

(dp) ➜  dp_liyx git:(master) ✗ python                                                                                                                     
Python 3.9.21 (main, Dec 11 2024, 16:24:11) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.tensor([1., 2., 3.]).cuda().mean())
tensor(2., device='cuda:0')
>>> 

I could not restart my computer within a period of time to test the replay scripts again since my computer is connecting to the server for training. Oh, by the way, I can train using gpu at my computer normally.

@lyxichigoichie
Copy link
Author

I will test the replay scripts on my partner's computer to check if it's my graphics card's problem

@lyxichigoichie
Copy link
Author

Hi Dr. Stone @StoneT2000, I run the replay scripts at the gpu server with 3090 and it works. I think it may be the fault of my computer or the 4090D???. I will try to reboot it and try it again.

But while I run the scripts at the server using the following command

(dp) ➜  dp_liyx git:(master) ✗ python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PickCube-v1/rl/trajectory.none.pd_ee_delta_pose.physx_cuda.h5 \
  --use-first-env-state -b "cuda:5" \
  -c pd_ee_delta_pose -o state \
  --save-traj

It will return the error

(dp) ➜  dp_liyx git:(master) ✗ python -m mani_skill.trajectory.replay_trajectory \
  --traj-path data/PickCube-v1/rl/trajectory.none.pd_ee_delta_pose.physx_cuda.h5 \
  --use-first-env-state -b "cuda:5" \
  -c pd_joint_delta_pos -o state \
  --save-traj
0step [00:00, ?step/s]Traceback (most recent call last):                                                                       | 0/1013 [00:00<?, ?step/s]
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 618, in <module>
    main(parse_args())
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 548, in main
    _, replay_result = _main(
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/trajectory/replay_trajectory.py", line 401, in _main
    env = gym.make(env_id, **env_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/gymnasium/envs/registration.py", line 802, in make
    env = env_creator(**env_spec_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/utils/registration.py", line 182, in make
    env = env_spec.make(**kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/utils/registration.py", line 79, in make
    return self.cls(**_kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/tasks/tabletop/pick_cube.py", line 46, in __init__
    super().__init__(*args, robot_uids=robot_uids, **kwargs)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/sapien_env.py", line 231, in __init__
    self.backend = parse_sim_and_render_backend(sim_backend, render_backend)
  File "/home/liyx/3rdparty/miniforge3/envs/dp/lib/python3.9/site-packages/mani_skill/envs/utils/system/backend.py", line 43, in parse_sim_and_render_backend
    sim_backend = sim_backend_name_mapping[sim_backend]
KeyError: 'cuda:5'
0step [00:00, ?step/s]

which means that 'cuda:5' is not the key of the sim_backend_name_mapping dictionary. From this I find a small bug at mani_skill/envs/utils/system/backend.py which will show at the code comments

sim_backend_name_mapping = {
    "cpu": "physx_cpu",
    "cuda": "physx_cuda",
    "gpu": "physx_cuda",
    "physx_cpu": "physx_cpu",
    "physx_cuda": "physx_cuda",
}
render_backend_name_mapping = {
    "cpu": "sapien_cpu",
    "cuda": "sapien_cuda",
    "gpu": "sapien_cuda",
    "sapien_cpu": "sapien_cpu",
    "sapien_cuda": "sapien_cuda",
}


def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
    sim_backend = sim_backend_name_mapping[sim_backend]  # If we send 'cuda:5' to sim_backend, it will be firstly pass to the  sim_backend_name_mapping dictionary but 'cuda:5' is not the key of sim_backend_name_mapping.
# so the mapping must be the position that after the   if sim_backend[:4] == "cuda":  statement
    render_backend = render_backend_name_mapping[render_backend]
    if sim_backend == "physx_cpu":
        device = torch.device("cpu")
        sim_device = sapien.Device("cpu")
    elif sim_backend == "physx_cuda":
        device = torch.device("cuda")
        sim_device = sapien.Device("cuda")
    elif sim_backend[:4] == "cuda":
        device = torch.device(sim_backend)
        sim_device = sapien.Device(sim_backend)
    else:
        raise ValueError(f"Invalid simulation backend: {sim_backend}")

    # TODO (stao): handle checking if system is mac, in which we must then use render_backend = "sapien_cpu"
    # determine render device
    if render_backend == "sapien_cuda":
        render_device = sapien.Device("cuda")
    elif render_backend == "sapien_cpu":
        render_device = sapien.Device("cpu")
    elif render_backend[:4] == "cuda":
        render_device = sapien.Device(render_backend)
    else:
        # handle special cases such as for AMD gpus, render_backend must be defined as pci:... instead as cuda is not available.
        render_device = sapien.Device(render_backend)
    return BackendInfo(
        device=device,
        sim_device=sim_device,
        sim_backend=sim_backend_name_mapping[sim_backend],
        render_device=render_device,
        render_backend=render_backend,
    )

I make the following changes to fix the bug. (moving the cuda checking to the first position)

def parse_sim_and_render_backend(sim_backend: str, render_backend: str) -> BackendInfo:
    if sim_backend[:4] == 'cuda':
        device = torch.device(sim_backend)
        sim_device = sapien.Device(sim_backend)
    else:
        try:
            sim_backend = sim_backend_name_mapping[sim_backend]
        except KeyError:
            raise ValueError(f"Invalid simulation backend: {sim_backend}")
        else:
            if sim_backend == "physx_cpu":
                device = torch.device("cpu")
                sim_device = sapien.Device("cpu")
            elif sim_backend == "physx_cuda":
                device = torch.device("cuda")
                sim_device = sapien.Device("cuda")
    if render_backend[:4] == "cuda":
        render_device = sapien.Device(render_backend)
    else:
        try:
            render_backend = render_backend_name_mapping[render_backend]
        except KeyError:
            raise ValueError(f"Invalid render backend: {render_backend}")
        else:
            if render_backend == "sapien_cuda":
                render_device = sapien.Device("cuda")
            elif render_backend == "sapien_cpu":
                render_device = sapien.Device("cpu")
    return BackendInfo(
        device=device,
        sim_device=sim_device,
        sim_backend=sim_backend,
        render_device=render_device,
        render_backend=render_backend,
    )

I will make a pull requests

Best wishes

@StoneT2000
Copy link
Member

If you want to use gpu number 5, use export CUDA_VISIBLE_DEVICES=5 python ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants