[2/N] Elastic EP Milestone 2: Integrating NIXL-EP by libertyeagle · Pull Request #29630 · vllm-project/vllm

libertyeagle · 2025-11-28T00:58:26Z

Purpose

This is the 2nd PR towards milestone 2 of elastic EP. The 1st PR is #26278.
This PR integrates NIXL EP kernel.
NIXL EP is a EP kernel based on NIXL's device API. It provides elastic scaling capabilities, enabling dynamic addition and removal of processes (ranks) during runtime, without the need to destroy and recreate communicators during scaling up/down.

Test Plan

Basic elastic scaling up/down functionality with vLLM Elastic EP
Performance benchmark

Performance testing script:
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 model on 8xH100 with EP=8.

vllm bench serve \
    --model $MODEL_NAME \
    --host $HOST \
    --port $PORT \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 512 \
    --num-prompts 512

Test Result

============ Serving Benchmark Result ============                                                                                                                        [5/915]
Successful requests:                     512                                                                                                                                     
Failed requests:                         0                                                                                                                                       
Benchmark duration (s):                  7.84                                                                                                                                    
Total input tokens:                      65536                                                                                                                                   
Total generated tokens:                  262144                                                                                                                                  
Request throughput (req/s):              65.28     
Output token throughput (tok/s):         33425.72  
Peak output token throughput (tok/s):    21263.00  
Peak concurrent requests:                512.00    
Total Token throughput (tok/s):          41782.15  
---------------Time to First Token----------------
Mean TTFT (ms):                          1646.63   
Median TTFT (ms):                        1712.47   
P99 TTFT (ms):                           1740.90   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           13.75     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.62     
Median ITL (ms):                         23.87    
P99 ITL (ms):                            57.22     
==================================================

CC List

@ruisearch42 @tlrmchlsmth @kouroshHakha

support request serving during scaling up/down Signed-off-by: Yongji Wu <wuyongji317@gmail.com> misc fixes Signed-off-by: Yongji Wu <wuyongji317@gmail.com> minor fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> minor fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> scaling test: 2->4->2 Signed-off-by: Yongji Wu <wuyongji317@gmail.com> tiny fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> small fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> small fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> small fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com>

Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com>

chatgpt-codex-connector · 2025-11-28T01:01:25Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/8ba94c2ec34f40b9b03752287e21c0e6baec2d00/vllm/distributed/stateless_coordinator.py#L285-L289
Preserve tensors when receiving over stateless CPU paths

When a stateless group receives tensors on the CPU path, the data is dropped: recv_tensor_dict assigns the return value of self.tcp_store_group.recv(...) to tensor, but StatelessProcessGroup.recv (vllm/distributed/utils.py:256-262) mutates the buffer in place and returns None. Any CPU tensor received through this branch is therefore stored as None in tensor_dict, corrupting the payload or triggering downstream failures. This occurs whenever elastic/stateless groups communicate non-CUDA tensors or when no device communicator is available.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a significant and complex feature: elastic scaling for expert parallelism (EP) by integrating the NIXL-EP kernel. The changes are extensive, touching many core components of vLLM's distributed infrastructure, including communication primitives, model execution, and configuration management. The core of this feature is the introduction of stateless communication groups, which allows for dynamic reconfiguration of the cluster topology without requiring a full restart. A state machine has been implemented to orchestrate the scaling operations (both up and down), which is a robust approach for such a complex distributed process. The implementation also includes optimizations for new worker startup, where they receive model weights from peers instead of loading from disk. Overall, the changes appear well-architected and the logic is consistent across the various components. I have found one high-severity issue related to a debug print statement that should be removed.

mergify · 2025-12-01T17:28:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @libertyeagle.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

edit: wrong PR

mergify · 2025-12-05T13:07:28Z

Hi @libertyeagle, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

github-actions · 2026-03-06T02:31:24Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2026-03-08T02:15:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @libertyeagle.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

libertyeagle requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners November 28, 2025 00:58

mergify bot added ci/build nvidia rocm Related to AMD ROCm v1 labels Nov 28, 2025

github-project-automation bot added this to NVIDIA Nov 28, 2025

mergify bot added the kv-connector label Nov 28, 2025

libertyeagle force-pushed the nixl-ep branch from 8ba94c2 to 297bec9 Compare November 28, 2025 01:01

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 1, 2025

tlrmchlsmth reviewed Dec 3, 2025

View reviewed changes

tlrmchlsmth mentioned this pull request Dec 18, 2025

[RFC]: Elastic DP Rank Recovery: Graceful Handling of GPU Hardware Failures Without Full Cluster Restart #30112

Open

1 task

tlrmchlsmth added this to Large-Scale Serving Jan 27, 2026

github-project-automation bot moved this to Backlog in Large-Scale Serving Jan 27, 2026

itayalroy mentioned this pull request Feb 28, 2026

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #35627

Merged

github-actions bot added the stale Over 90 days of inactivity label Mar 6, 2026

github-project-automation bot added this to AMD Mar 6, 2026

github-project-automation bot moved this to Todo in AMD Mar 6, 2026

wangshangsam assigned libertyeagle Mar 6, 2026

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Mar 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP#29630

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP#29630
libertyeagle wants to merge 2 commits intovllm-project:mainfrom
libertyeagle:nixl-ep

libertyeagle commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

tlrmchlsmth left a comment •

edited

Loading

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

libertyeagle commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

CC List

Uh oh!

chatgpt-codex-connector bot commented Nov 28, 2025

💡 Codex Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

libertyeagle commented Nov 28, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth left a comment •

edited

Loading