Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

Open
1 task done
gaocegege opened this issue Jan 28, 2025 · 12 comments
Open
1 task done
Labels
feature request New feature or request

Comments

@gaocegege
Copy link
Contributor

gaocegege commented Jan 28, 2025

🚀 The feature, motivation and pitch

We currently support Ray-based distributed inference, which requires Ray. This issue requests multi-node support for torch.distributed.

Usage Example:

# Server 1
vllm serve model_tag --nnodes 2 --rank 0 --dist-init-addr 192.168.0.1:5000 

# Server 2
vllm serve model_tag --nnodes 2 --rank 1 --dist-init-addr 192.168.0.2:5000

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@gaocegege gaocegege added the feature request New feature or request label Jan 28, 2025
@plops655
Copy link

plops655 commented Feb 2, 2025

Why not simply use TorchTrainer in the RayTrain library?

@gaocegege
Copy link
Contributor Author

Why not simply use TorchTrainer in the RayTrain library?

I aim to simplify the deployment of multi-node inference using the vLLM Production Stack instead of configuring a Ray cluster on Kubernetes. I'm concerned that TorchTrainer may not be beneficial for this purpose.

@tsaoyu
Copy link

tsaoyu commented Feb 10, 2025

I am up for this proposal, due to the Ray setup requires knowledge is huge if there is anything wrong with it. Providing a Ray free version for those who just want inference and Ray based SPMD version for advanced users such as OpenRLHF is valid.

@Jeffwan
Copy link
Contributor

Jeffwan commented Feb 10, 2025

Yeah, this is reasonable, I raise an similar issue earlier. #3902

@gaocegege
Copy link
Contributor Author

I’ll give it a try, though I don’t have much time to dedicate to it. We could adopt a design similar to this PR. The key difference is that workers (excluding rank 0) should enter a loop and wait for inputs from the driver (rank 0 worker).

@plops655
Copy link

I am working on this and have a question. The main API for distributing multi-node inference using pytorch is FSDP. However, FSDP manually shards model data across GPUs by taking the full model as input.

Having to manually shard models seems to be orthogonal to our current implementation of multi-node inference using Ray and multiprocessing (for single-node).

I did not look at the Ray distributed executor yet, but when looking over the mp_distributed_executor, I noticed that memory sharding of the model happened at a much lower level. In _init_executor, we call _run_workers("load_model", ...), calls load_model in gpu_model_runner.py which calls load_model in (WLOG) the ShardedStateLoader class in loader.py

I am assuming we want to use FSDP for multi-node inference, but the architecture will be very different than Ray-based distributed inference.

Am I overthinking this?

@gaocegege
Copy link
Contributor Author

From my perspective, I do not think we could use FSDP, since we call workers to load the model.

@gaocegege
Copy link
Contributor Author

gaocegege commented Feb 23, 2025

After chatting with @youkaichao, we agreed that using torchrun might be a better fit for launching processes compared to vllm

torchrun will look like #3902 (comment)

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

torchrun has a robust ecosystem and is a well-established launcher. For instance, it supports different backends like c10d and etcd as rdzv backends, making it highly versatile.

@jeffreyjeffreywang
Copy link

jeffreyjeffreywang commented Mar 7, 2025

Hey @gaocegege, I'm new to vLLM and took a quick stab on this issue. I was able to use torchrun with nproc-per-node set to launch the vLLM server without setting --tensor-parallel-size greater than 1. I wanted to get some clarity on the potential conflicts between the processes created by torchrun and those created internally by vLLM.

  • When both torchrun's --nproc-per-node and vLLM's --tensor-parallel-size are set, how many GPUs should the model run on?

    If we respect tensor-parallel-size, and nproc-per-node is greater than tensor-parallel-size, we reuse the processes created by torchrun and skip vLLM's internal process creation, while still going through distributed group setup in MQLLMEngine or AsyncLLMEngine. On the other hand, if tensor-parallel-size is larger than nproc-per-node, vLLM needs to create additional processes to satisfy the parallelism. Does this assumption sound reasonable?

  • By default, does the vLLM server run on the process with rank 0? Or should we allow users to specify which process the server runs on?

@gaocegege
Copy link
Contributor Author

Hi Jeffery

When both torchrun's --nproc-per-node and vLLM's --tensor-parallel-size are set, how many GPUs should the model run on?

For now, I think it should be equal. --nproc-per-node= 2 --tensor-parallel-size=2

By default, does the vLLM server run on the process with rank 0? Or should we allow users to specify which process the server runs on?

Now vLLM will launch N processes, not only rank 0. With torchrun, vLLM should only launch one process, and torchrun will launch N vLLM processes.

@jeffreyjeffreywang
Copy link

Hey @gaocegege, circling back after digging a bit deeper. Let's first focus on the single-node scenario. For offline inference with torchrun (introduced by #12071), each torchrun process creates its own LLMEngine, receives the same prompt, and produces the same outputs. TP kicks in automatically when llm.generate() is called on each rank. To align with the offline inference approach, here's a proposed solution:

Goal

Maintain a single HTTP endpoint (on rank 0) while ensuring all ranks process each request and engage in TP.

Proposed Solution

  • Introduce a DistributedEngineClient wrapper around the existing engine client.
  • Use ZeroMQ pub-sub for request propagation from rank 0 to other ranks.
  • Ensure each rank maintains its own LLMEngine.

Request Flow

  • Client sends HTTP request to the API server on rank 0.
  • Rank 0 processes the request and broadcasts it to other ranks via ZMQ.
  • All ranks receive the request and invoke their local LLMEngine.
  • TP naturally occurs during model execution.
  • Rank 0 responds to the client with the final output.

Am I overthinking? Would love to hear your thoughts on this approach or if you see any pitfalls!

@gaocegege
Copy link
Contributor Author

Thanks for the proposal.

Rank 0 processes the request and broadcasts it to other ranks via ZMQ.

I am not sure if we need ZMQ here, we can use torch distributed (nccl) to broadcast, I think.

Introduce a DistributedEngineClient wrapper around the existing engine client.

Is it used to launch the apiserver only in rank 0?

Others LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants