Elastic Expert Parallel Initial Support #20775

ruisearch42 · 2025-07-10T17:42:36Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This corresponds to Milestone 1 of #20323 .

Co-authored with @libertyeagle

Supported functionality:

Retained engine-core state destroy & reinitialization
- Distributed environment
- Distributed communicators
- Model structure & weights: including EPLB weight reshuffle
Scale up: new engine-core startup
- KV cache initialization: use available GPU memory information from existing engine-core to skip expensive profiling
Scale down: unneeded engine-core shutdown
Control plane
- API server endpoint
- DP engine-core scheduling: e.g. collective operations (from retained and new engine-cores) need to happen at the same time
- Traffic handling with a simple strategy of draining and dropping during scaling

TODO for this PR:

More testing with repeated scale up/down
Address FIXME
Minor refactors and cleanups
- e.g., remove/move/cleanup scripts in experimental or examples directory

Follow-ups after this PR

Support multi-node, which should be a straightforward change
Cudagraph support, which will be done as part of Milestone 2 (optimizations)
Other milestones in [RFC]: Elastic Expert Parallelism #20323

Test Plan

Test with PPLX kernel and DeepSeek-V2-Lite

Test Result

Can alternate scale up and down multiple times (e.g., scale from 4->5->6->7->8->7->6->5->4), and drain/drop traffic

(Optional) Documentation Update

github-actions · 2025-07-10T17:42:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @ruisearch42, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for Elastic Expert Parallelism (EEP) in vLLM, enabling dynamic scaling of data parallel (DP) engine cores, particularly beneficial for Mixture-of-Experts (MoE) models. The changes encompass re-initializing distributed environments, managing KV cache states across scaling events, and orchestrating expert weight reshuffling to adapt to the new parallel configurations. A new API endpoint is added to trigger these scaling operations, with mechanisms to handle in-flight requests during transitions.

Highlights

Dynamic Data Parallel Scaling: Introduces core functionality to dynamically scale up and down the number of data parallel (DP) engine cores, enabling flexible resource allocation for vLLM deployments.
Elastic Expert Parallelism (EPLB) Integration: Implements logic to rebalance and reshuffle expert weights across the new set of available GPUs during scaling operations, ensuring efficient utilization for Mixture-of-Experts (MoE) models.
Distributed Environment Re-initialization: Adds mechanisms to gracefully tear down and re-initialize PyTorch distributed process groups and related communication states across engine cores during dynamic scaling events.
KV Cache State Management: Enables newly added engine cores to initialize their KV cache based on the available memory information from existing engine cores, optimizing startup time during scale-up.
API for Scaling Control: Exposes a new /scale API endpoint on the vLLM server, allowing external systems to programmatically trigger scale-up or scale-down operations.
Traffic Handling During Scaling: Incorporates a traffic draining and dropping strategy during scaling transitions to minimize disruption to in-flight requests, ensuring service continuity.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Signed-off-by: Rui Qiao <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a major new feature: elastic expert parallelism. This is a substantial change that touches many parts of the vLLM engine, from the low-level distributed communication and worker management to the high-level API server. The implementation seems well-thought-out, with a multi-phase approach to scaling to handle the complexities of a distributed system. The use of a scaling flag to gracefully handle traffic during scaling is a good design choice.

My review focuses on a few key areas:

Correctness: I found a type hint mismatch that should be fixed. I also pointed out a commented-out assertion that might hide potential issues.
Maintainability & Robustness: I've suggested improvements for a magic number and a custom communication protocol to make the code more robust and easier to maintain.

Overall, this is a great step towards elastic inference in vLLM. The changes are complex, and I appreciate the effort that went into this.

vllm/config.py

vllm/distributed/eplb/eplb_state.py

vllm/v1/engine/core.py

vllm/v1/engine/core_client.py

vllm/distributed/eplb/rebalance_execute.py

vllm/model_executor/layers/fused_moe/layer.py

mergify · 2025-07-10T20:58:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ruisearch42.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 · 2025-07-16T22:32:54Z

Hi @abmfy , could you help review the EPLB part, thanks!

Signed-off-by: Rui Qiao <[email protected]>

tlrmchlsmth · 2025-07-18T02:06:48Z

examples/online_serving/elastic_ep/serve_deepseek_v2.sh

+PORT=8006
+DATA_PARALLEL_SIZE=4
+REDUNDANT_EXPERTS=0
+MODEL_NAME="/models/models--deepseek-ai--DeepSeek-V2-Lite/snapshots/604d5664dddd88a0433dbae533b7fe9472482de0"


I've been recommending Qwen/Qwen3-30B-A3B-FP8 for a small example EP+DP model. It's very strong plus it is a good stand-in for DeepSeek models since they use the same quantization scheme.

thanks for the suggestion.

In the initial EEP support we assume presence of EPLB, which is not supported in Qwen3 right now. So I guess we still need DeepSeek V2 Lite for now.

tlrmchlsmth · 2025-07-18T02:13:28Z

tools/ep_kernels/elastic_ep/eep_nvshmem.patch

FYI @mnicely - this PR is important for autoscaling large-scale distributed MoE inference. It would be great to upstream any changes necessary for changing the world_size

Thanks for the ping. I'll bring to the team

tlrmchlsmth · 2025-07-18T02:22:14Z

tools/ep_kernels/elastic_ep/install_eep_libraries.sh

+    wget https://developer.download.nvidia.com/compute/redist/nvshmem/3.2.5/source/nvshmem_src_3.2.5-1.txz
+    tar -xvf nvshmem_src_3.2.5-1.txz -C nvshmem_src --strip-components=1
+    pushd nvshmem_src
+    wget https://github.com/deepseek-ai/DeepEP/raw/main/third-party/nvshmem.patch
+    git init
+    git apply -vvv nvshmem.patch
+    git apply --reject --whitespace=fix ../../eep_nvshmem.patch 
+else


Could you upgrade to 3.3.9, since it has the performance improvements from the DeepEP patch? (BTW please double check performance as well, if you have the bandwidth to do so)

deepseek-ai/DeepEP#267 (comment)

thanks. Can we do it as a follow up?

Right now in this initial PR we only support PPLX. And the version 3.2.5-1 is consistent with current DeepEP installation script.

The DeepEP nvshmem.patch is applied now for a few reasons: 1) we will support DeepEP eventually; 2) it is consistent with current DeepEP installation script; 3) it removes the need for GDRCOPY, without the patch the nvshmem compilation fails

Currently, we only need our nvshmem patch that clears out all global communication states during nvshmem_finalize so we can create a new communication group with a new set of participant GPUs.

tlrmchlsmth · 2025-07-18T02:23:42Z

tools/ep_kernels/elastic_ep/install_eep_libraries.sh

We need to get nvshmem + deepep built in the vLLM image

thanks, can we do it as a follow up?

yeah that was just a sidenote, not something for this PR

vllm/distributed/eplb/eplb_state.py

vllm/entrypoints/openai/api_server.py

tlrmchlsmth · 2025-07-18T02:34:41Z

vllm/entrypoints/openai/api_server.py

+class ScalingMiddleware:
+    """
+    Middleware that checks if the model is currently scaling and
+    returns a 503 Service Unavailable response if it is.
+
+    This middleware applies to all HTTP requests and prevents
+    processing when the model is in a scaling state.
+    """


How long does this take typically? Would it be better to allow requests to queue?

Also we should add an API to return whether the vLLM instance is currently unavailable due to autoscaling, so that external routers can take this into account.

Added is_scaling_elastic_ep API

Right now scaling up 4->5 takes ~55 seconds, scaling down 5->4 takes ~40 seconds. At this stage we are using a simple strategy of dropping since this interruption time is expected to be minimized when we optimize in Milestone 2. Maybe better to revisit at that stage?

I think the idea is good though. Were you thinking about buffering requests at API server or at the scheduler?

Buffering requests in the API server seems more natural but I haven't thought about it too hard.

Any idea how far you'll be able to optimize it?

Had some ideas to reduce this to a few seconds, which requires changes to the communicator reinit, cudagraph etc. Will work on it next.

The ideal target would be very minimal or 0. Will experiment how far the techniques could help us.

vllm/executor/uniproc_executor.py

vllm/v1/engine/core.py

vllm/v1/engine/coordinator.py

vllm/v1/worker/gpu_worker.py

mergify · 2025-07-18T04:34:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ruisearch42.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Rui Qiao <[email protected]>

tlrmchlsmth · 2025-07-18T14:32:14Z

vllm/v1/engine/async_llm.py

+            if new_data_parallel_size > old_data_parallel_size:
+                await self.engine_core.scale_up_elastic_ep(
+                    new_data_parallel_size)
+            else:
+                await self.engine_core.scale_down_elastic_ep(
+                    new_data_parallel_size)


Why have separate scale_up vs scale_down calls?

We have different logics for scale up vs down in the backend.
For scale up: allocate new GPUs -> start new workers -> reinit comm -> reshard experts
For scale down: reshard experts -> shutdown workers -> reinit comm
We definitely can integrate the frontend API into a unified one, while only the interaction between EngineCore/workers and CoreClient have separate logics.

I think it makes sense and is cleaner to have a single API for CoreClient. I've updated the code. We can later refine the implementations.

Signed-off-by: Rui Qiao <[email protected]>

Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Himanshu Jaju <[email protected]>

ruisearch42 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad, hmellor and aarnphm as code owners July 10, 2025 17:42

mergify bot added deepseek Related to DeepSeek models frontend v1 labels Jul 10, 2025

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

ruisearch42 added 2 commits July 10, 2025 10:44

eep basic

5c78497

Signed-off-by: Rui Qiao <[email protected]>

fixes

cbab40d

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the eep_m1 branch from baf15de to cbab40d Compare July 10, 2025 17:45

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 10, 2025

ruisearch42 added 6 commits July 10, 2025 15:01

clean up placement_group functions

ea87424

Signed-off-by: Rui Qiao <[email protected]>

cleanup

cb183c2

Signed-off-by: Rui Qiao <[email protected]>

cleanup

4b543ea

Signed-off-by: Rui Qiao <[email protected]>

cleanup

a9cbefe

Signed-off-by: Rui Qiao <[email protected]>

cleanup

637aca2

Signed-off-by: Rui Qiao <[email protected]>

ray cleanup

26af1a8

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 added 2 commits July 16, 2025 23:15

SCALE_DP & port alloc

ac505d6

Signed-off-by: Rui Qiao <[email protected]>

fix

a1f13b2

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 removed the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2025

ruisearch42 added 3 commits July 17, 2025 15:30

add install files

329c445

Signed-off-by: Rui Qiao <[email protected]>

update serve.sh

9250d78

Signed-off-by: Rui Qiao <[email protected]>

update bench.sh

81b14bb

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2025

tlrmchlsmth reviewed Jul 18, 2025

View reviewed changes

vllm/distributed/eplb/eplb_state.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jul 18, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jul 18, 2025

View reviewed changes

vllm/executor/uniproc_executor.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/engine/coordinator.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

ruisearch42 removed the ready ONLY add when PR is ready to merge/full CI is needed label Jul 18, 2025

mergify bot added the needs-rebase label Jul 18, 2025

ruisearch42 and others added 4 commits July 17, 2025 22:04

address comments

37c897f

Signed-off-by: Rui Qiao <[email protected]>

comments

be969ec

Signed-off-by: Rui Qiao <[email protected]>

refactor reinitialize_distributed

32b96be

Signed-off-by: Rui Qiao <[email protected]>

Merge branch 'main' into eep_m1

0ab9675

ruisearch42 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 18, 2025

mergify bot removed the needs-rebase label Jul 18, 2025

tlrmchlsmth approved these changes Jul 18, 2025

View reviewed changes

tlrmchlsmth reviewed Jul 18, 2025

View reviewed changes

ruisearch42 added 2 commits July 18, 2025 09:18

single scale API

e59fd3a

Signed-off-by: Rui Qiao <[email protected]>

up

cbd9966

Signed-off-by: Rui Qiao <[email protected]>

simon-mo merged commit 2179372 into vllm-project:main Jul 19, 2025
75 of 78 checks passed

hj-mistral pushed a commit to hj-mistral/vllm that referenced this pull request Jul 19, 2025

Elastic Expert Parallel Initial Support (vllm-project#20775)

e6bf33b

Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Himanshu Jaju <[email protected]>

Uh oh!

Elastic Expert Parallel Initial Support #20775

Elastic Expert Parallel Initial Support #20775

Conversation

ruisearch42 commented Jul 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jul 10, 2025

Uh oh!

ruisearch42 commented Jul 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jul 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisearch42 Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisearch42 commented Jul 10, 2025 •

edited by github-actions bot

Loading

ruisearch42 Jul 18, 2025 •

edited

Loading