Skip to content

Commit 9907fc4

Browse files
authored
[Docs] Data Parallel deployment documentation (#20768)
Signed-off-by: Nick Hill <[email protected]>
1 parent d47661f commit 9907fc4

File tree

6 files changed

+118
-2
lines changed

6 files changed

+118
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ vLLM is flexible and easy to use with:
6969

7070
- Seamless integration with popular Hugging Face models
7171
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
72-
- Tensor parallelism and pipeline parallelism support for distributed inference
72+
- Tensor, pipeline, data and expert parallelism support for distributed inference
7373
- Streaming outputs
7474
- OpenAI-compatible API server
7575
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ vLLM is flexible and easy to use with:
3636

3737
- Seamless integration with popular HuggingFace models
3838
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
39-
- Tensor parallelism and pipeline parallelism support for distributed inference
39+
- Tensor, pipeline, data and expert parallelism support for distributed inference
4040
- Streaming outputs
4141
- OpenAI-compatible API server
4242
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
84.1 KB
Loading
67.7 KB
Loading
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Data Parallel Deployment
2+
3+
vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests.
4+
5+
This will work with both dense and MoE models.
6+
7+
For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Latent Attention), it can be advantageous to use data parallel for the attention layers and expert or tensor parallel (EP or TP) for the expert layers.
8+
9+
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks.
10+
11+
The expert layers will by default form a (DP x TP) sized tensor parallel group. To enable expert parallelism, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case).
12+
13+
In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size.
14+
15+
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
16+
17+
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
18+
19+
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>.
20+
21+
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
22+
23+
## Internal Load Balancing
24+
25+
vLLM supports "self-contained" data parallel deployments that expose a single API endpoint.
26+
27+
It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs.
28+
29+
Running a single data parallel deployment across multiple nodes requires a different `vllm serve` to be run on each node, specifying which DP ranks should run on that node. In this case, there will still be a single HTTP entrypoint - the API server(s) will run only on one node, but it doesn't necessarily need to be co-located with the DP ranks.
30+
31+
This will run DP=4, TP=2 on a single 8-GPU node:
32+
33+
```bash
34+
vllm serve $MODEL --data-parallel-size 4 --tensor-parallel-size 2
35+
```
36+
37+
This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node:
38+
39+
```bash
40+
# Node 0 (with ip address 10.99.48.128)
41+
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
42+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
43+
# Node 1
44+
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 \
45+
--data-parallel-start-rank 2 \
46+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
47+
```
48+
49+
This will run DP=4 with only the API server on the first node and all engines on the second node:
50+
51+
```bash
52+
# Node 0 (with ip address 10.99.48.128)
53+
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
54+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
55+
# Node 1
56+
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
57+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
58+
```
59+
60+
This DP mode can also be used with Ray, in which case only a single launch command is needed irrespective of the number of nodes:
61+
62+
```bash
63+
vllm serve $MODEL --data-parallel-size 16 --tensor-parallel-size 2 --data-parallel-backend=ray
64+
```
65+
66+
Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic.
67+
68+
When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node.
69+
70+
<figure markdown="1">
71+
![DP Internal LB Diagram](../assets/deployment/dp_internal_lb.png)
72+
</figure>
73+
74+
## External Load Balancing
75+
76+
For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
77+
78+
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
79+
80+
This can already be done trivially for non-MoE models, since each deployed server is fully independent. No data parallel CLI options need to be used for this.
81+
82+
We support an equivalent topology for MoE DP+EP which can be configured via the following CLI arguments.
83+
84+
If DP ranks are co-located (same node / ip address), a default RPC port is used, but a different HTTP server port must be specified for each rank:
85+
86+
```bash
87+
# Rank 0
88+
CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \
89+
--port 8000
90+
# Rank 1
91+
CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \
92+
--port 8001
93+
```
94+
95+
For multi-node cases, the address/port of rank 0 must also be specified:
96+
97+
```bash
98+
# Rank 0 (with ip address 10.99.48.128)
99+
vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \
100+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
101+
# Rank 1
102+
vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \
103+
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
104+
```
105+
106+
The coordinator process also runs in this scenario, co-located with the DP rank 0 engine.
107+
108+
<figure markdown="1">
109+
![DP External LB Diagram](../assets/deployment/dp_external_lb.png)
110+
</figure>
111+
112+
In the above diagram, each of the dotted boxes corresponds to a separate launch of `vllm serve` - these could be separate Kubernetes pods, for example.

docs/serving/distributed_serving.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ After adding enough GPUs and nodes to hold the model, you can run vLLM first, wh
1515
!!! note
1616
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
1717

18+
### Distributed serving of MoE (Mixture of Experts) models
19+
20+
It is often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. See the page on [Data Parallel Deployment](data_parallel_deployment.md) for more information.
21+
1822
## Running vLLM on a single node
1923

2024
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray.

0 commit comments

Comments
 (0)