[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

gaocegege · 2025-01-28T14:34:51Z

🚀 The feature, motivation and pitch

We currently support Ray-based distributed inference, which requires Ray. This issue requests multi-node support for torch.distributed.

Usage Example:

# Server 1
vllm serve model_tag --nnodes 2 --rank 0 --dist-init-addr 192.168.0.1:5000 

# Server 2
vllm serve model_tag --nnodes 2 --rank 1 --dist-init-addr 192.168.0.2:5000

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

plops655 · 2025-02-02T23:30:47Z

Why not simply use TorchTrainer in the RayTrain library?

gaocegege · 2025-02-07T11:25:04Z

Why not simply use TorchTrainer in the RayTrain library?

I aim to simplify the deployment of multi-node inference using the vLLM Production Stack instead of configuring a Ray cluster on Kubernetes. I'm concerned that TorchTrainer may not be beneficial for this purpose.

tsaoyu · 2025-02-10T08:27:21Z

I am up for this proposal, due to the Ray setup requires knowledge is huge if there is anything wrong with it. Providing a Ray free version for those who just want inference and Ray based SPMD version for advanced users such as OpenRLHF is valid.

Jeffwan · 2025-02-10T23:58:18Z

Yeah, this is reasonable, I raise an similar issue earlier. #3902

gaocegege · 2025-02-20T01:40:27Z

I’ll give it a try, though I don’t have much time to dedicate to it. We could adopt a design similar to this PR. The key difference is that workers (excluding rank 0) should enter a loop and wait for inputs from the driver (rank 0 worker).

plops655 · 2025-02-23T01:44:21Z

I am working on this and have a question. The main API for distributing multi-node inference using pytorch is FSDP. However, FSDP manually shards model data across GPUs by taking the full model as input.

Having to manually shard models seems to be orthogonal to our current implementation of multi-node inference using Ray and multiprocessing (for single-node).

I did not look at the Ray distributed executor yet, but when looking over the mp_distributed_executor, I noticed that memory sharding of the model happened at a much lower level. In _init_executor, we call _run_workers("load_model", ...), calls load_model in gpu_model_runner.py which calls load_model in (WLOG) the ShardedStateLoader class in loader.py

I am assuming we want to use FSDP for multi-node inference, but the architecture will be very different than Ray-based distributed inference.

Am I overthinking this?

gaocegege · 2025-02-23T01:54:37Z

From my perspective, I do not think we could use FSDP, since we call workers to load the model.

gaocegege · 2025-02-23T01:57:43Z

After chatting with @youkaichao, we agreed that using torchrun might be a better fit for launching processes compared to vllm

torchrun will look like #3902 (comment)

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

torchrun has a robust ecosystem and is a well-established launcher. For instance, it supports different backends like c10d and etcd as rdzv backends, making it highly versatile.

jeffreyjeffreywang · 2025-03-07T07:34:23Z

Hey @gaocegege, I'm new to vLLM and took a quick stab on this issue. I was able to use torchrun with nproc-per-node set to launch the vLLM server without setting --tensor-parallel-size greater than 1. I wanted to get some clarity on the potential conflicts between the processes created by torchrun and those created internally by vLLM.

When both torchrun's --nproc-per-node and vLLM's --tensor-parallel-size are set, how many GPUs should the model run on?

If we respect tensor-parallel-size, and nproc-per-node is greater than tensor-parallel-size, we reuse the processes created by torchrun and skip vLLM's internal process creation, while still going through distributed group setup in MQLLMEngine or AsyncLLMEngine. On the other hand, if tensor-parallel-size is larger than nproc-per-node, vLLM needs to create additional processes to satisfy the parallelism. Does this assumption sound reasonable?
By default, does the vLLM server run on the process with rank 0? Or should we allow users to specify which process the server runs on?

gaocegege · 2025-03-07T07:46:24Z

Hi Jeffery

When both torchrun's --nproc-per-node and vLLM's --tensor-parallel-size are set, how many GPUs should the model run on?

For now, I think it should be equal. --nproc-per-node= 2 --tensor-parallel-size=2

By default, does the vLLM server run on the process with rank 0? Or should we allow users to specify which process the server runs on?

Now vLLM will launch N processes, not only rank 0. With torchrun, vLLM should only launch one process, and torchrun will launch N vLLM processes.

jeffreyjeffreywang · 2025-03-10T03:15:13Z

Hey @gaocegege, circling back after digging a bit deeper. Let's first focus on the single-node scenario. For offline inference with torchrun (introduced by #12071), each torchrun process creates its own LLMEngine, receives the same prompt, and produces the same outputs. TP kicks in automatically when llm.generate() is called on each rank. To align with the offline inference approach, here's a proposed solution:

Goal

Maintain a single HTTP endpoint (on rank 0) while ensuring all ranks process each request and engage in TP.

Proposed Solution

Introduce a DistributedEngineClient wrapper around the existing engine client.
Use ZeroMQ pub-sub for request propagation from rank 0 to other ranks.
Ensure each rank maintains its own LLMEngine.

Request Flow

Client sends HTTP request to the API server on rank 0.
Rank 0 processes the request and broadcasts it to other ranks via ZMQ.
All ranks receive the request and invoke their local LLMEngine.
TP naturally occurs during model execution.
Rank 0 responds to the client with the final output.

Am I overthinking? Would love to hear your thoughts on this approach or if you see any pitfalls!

gaocegege · 2025-03-10T03:18:45Z

Thanks for the proposal.

Rank 0 processes the request and broadcasts it to other ranks via ZMQ.

I am not sure if we need ZMQ here, we can use torch distributed (nccl) to broadcast, I think.

Introduce a DistributedEngineClient wrapper around the existing engine client.

Is it used to launch the apiserver only in rank 0?

Others LGTM

gaocegege added the feature request New feature or request label Jan 28, 2025

gaocegege mentioned this issue Feb 19, 2025

Discussion: Pipeline parallelism support vllm-project/production-stack#101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

gaocegege commented Jan 28, 2025 •

edited

Loading

plops655 commented Feb 2, 2025

gaocegege commented Feb 7, 2025

tsaoyu commented Feb 10, 2025

Jeffwan commented Feb 10, 2025

gaocegege commented Feb 20, 2025

plops655 commented Feb 23, 2025

gaocegege commented Feb 23, 2025

gaocegege commented Feb 23, 2025 •

edited

Loading

jeffreyjeffreywang commented Mar 7, 2025 •

edited

Loading

gaocegege commented Mar 7, 2025

jeffreyjeffreywang commented Mar 10, 2025

gaocegege commented Mar 10, 2025

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

Comments

gaocegege commented Jan 28, 2025 • edited Loading

🚀 The feature, motivation and pitch

Usage Example:

Alternatives

Additional context

Before submitting a new issue...

plops655 commented Feb 2, 2025

gaocegege commented Feb 7, 2025

tsaoyu commented Feb 10, 2025

Jeffwan commented Feb 10, 2025

gaocegege commented Feb 20, 2025

plops655 commented Feb 23, 2025

gaocegege commented Feb 23, 2025

gaocegege commented Feb 23, 2025 • edited Loading

jeffreyjeffreywang commented Mar 7, 2025 • edited Loading

gaocegege commented Mar 7, 2025

jeffreyjeffreywang commented Mar 10, 2025

Goal

Proposed Solution

Request Flow

gaocegege commented Mar 10, 2025

gaocegege commented Jan 28, 2025 •

edited

Loading

gaocegege commented Feb 23, 2025 •

edited

Loading

jeffreyjeffreywang commented Mar 7, 2025 •

edited

Loading