[Feature] Integrate NIXL-EP into SGLang by zackyoray · Pull Request #17605 · sgl-project/sglang

zackyoray · 2026-01-22T20:23:09Z

Overview

This PR introduces support for the NIXL-EP MoE backend in SGLang, enabling efficient expert parallelism through NVIDIA's NIXL framework. This implementation leverages the elastic expert parallelism infrastructure being developed as part of the Elastic EP Support roadmap (PR #8961), and is based on top of PR #11837 [Draft] (Elastic EP support deepep backend).

What is NIXL-EP?

NIXL-EP is a complete implementation of expert-parallel communication for Mixture of Experts (MoE) models built on top of NIXL's device API. It provides elastic scaling and fault tolerance, enabling dynamic addition and removal of processes (ranks) during runtime without disrupting existing connections, and leverages NIXL's RDMA and NVLink support for optimal performance.

Testing & Performance

The implementation has been validated with DeepSeek-V3-Lite using the standard python -m sglang.bench_serving benchmark tool.

Test Configuration

Parameter	1 Node	2 Nodes
Model	DeepSeek-V3-Lite (FP8)	DeepSeek-V3-Lite (FP8)
Tensor Parallelism	8	16
Data Parallelism	8	8
Max Concurrency	256	256
Number of Prompts	4096	4096
Input Length	128 tokens	128 tokens
Output Length	128 tokens	128 tokens
Redundant Experts	24	24
Memory Fraction	0.78	0.78

Performance Results

1 Node (8 GPUs)

Backend	TTFT Mean (ms)	TTFT Median (ms)	E2E Latency Mean (ms)	E2E Latency Median (ms)	Request Throughput (req/s)
nixl	288.32	171.68	14,587.05	9,359.68	12.97
deepep	278.12	169.55	14,297.31	8,839.91	13.07
mooncake	591.25	359.19	23,178.43	14,894.46	9.02

2 Nodes (16 GPUs)

Backend	TTFT Mean (ms)	TTFT Median (ms)	E2E Latency Mean (ms)	E2E Latency Median (ms)	Request Throughput (req/s)
nixl	309.36	197.24	15,933.40	10,051.18	11.70
deepep	305.58	194.37	15,678.13	10,009.55	11.95
mooncake	586.30	333.80	24,503.35	15,325.52	8.33

Additional testing across different model scales and cluster configurations is ongoing.

Related Work

Based on top of: PR [4/N]Elastic EP support deepep backend #11837 [Draft] (DeepEP backend support)
Based on roadmap: PR Elastic EP Support (Milestone 1 & 2) #8961 (Elastic EP Support - Milestones 1 & 2)
External: NIXL-EP Documentation

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

fix fix fix fix fix fix fix ut ut ut fix fit

…a` (sgl-project#13)

fi fi fix fix fix fix fix fix fix fix fix fit fix

gemini-code-assist · 2026-01-22T20:23:42Z

Summary of Changes

Hello @zackyoray, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's Mixture of Experts (MoE) capabilities by integrating the NIXL-EP backend. This integration provides robust elastic expert parallelism, allowing for dynamic scaling and improved fault tolerance. The changes also introduce an elasticity-aware load balancing algorithm and update the system's configuration and documentation to support these advanced features, ensuring more efficient and adaptable MoE model serving.

Highlights

NIXL-EP Backend Integration: Introduced support for the NIXL-EP MoE backend, leveraging NVIDIA's NIXL framework for efficient expert parallelism, including elastic scaling and fault tolerance.
Elasticity-Aware Load Balancing: Added a new elasticity_aware algorithm for Expert Parallelism Load Balancing (EPLB) to dynamically manage expert distribution based on active ranks.
Server Argument and Documentation Updates: Updated server arguments (--elastic-ep-backend and --moe-a2a-backend) and corresponding documentation to include 'deepep' and 'nixl' options, enhancing configurability.
Elastic EP State Management: Implemented ElasticEPStateManager to centralize the management of active ranks and state for elastic expert parallelism across different backends (Mooncake, DeepEP, NIXL).
Refactored MoE Dispatchers and Model Loading: Integrated the new NIXL-EP dispatcher into the MoE layer and refactored model weight updating to support dynamic expert location changes in elastic EP scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the NIXL-EP MoE backend in SGLang, which is a significant feature for enabling efficient and elastic expert parallelism. The changes are well-structured, adding a new manager for elastic EP state, a corresponding load balancing algorithm, and the NIXL token dispatcher. The integration into the existing MoE framework seems solid. I have found one critical issue that would cause a runtime error under certain configurations. My feedback is focused on fixing this issue to ensure the stability of this new feature.

gemini-code-assist · 2026-01-22T20:26:27Z

python/sglang/srt/eplb/eplb_algorithms/__init__.py

+            active_ranks=(
+                ElasticEPStateManager.instance()._active_ranks
+                if ElasticEPStateManager.instance() is not None
+                else ElasticEPStateManager.healthy_rank_state()


The method ElasticEPStateManager.healthy_rank_state() does not exist. This will raise an AttributeError if the elasticity_aware algorithm is used when ElasticEPStateManager is not initialized (i.e., no elastic EP backend is active).

To prevent this crash, you should provide a fallback that creates a tensor indicating all ranks are healthy. A tensor of ones with a shape corresponding to the number of GPUs would be a suitable default.

Suggested change

else ElasticEPStateManager.healthy_rank_state()

else torch.ones(num_physical_experts // num_local_physical_experts, dtype=torch.int32)

Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Signed-off-by: Yoray Zack <yorayz@nvidia.com>

HanHan009527 and others added 26 commits October 16, 2025 01:00

pr2 eplb

8ae4347

fix fix fix fix fix fix fix ut ut ut fix fit

Let token_dispatcher/mooncake.py use the `global_elastic_ep_metadat…

f01ba58

…a` (sgl-project#13)

fix

cd65b69

fi fi fix fix fix fix fix fix fix fix fix fit fix

some fix

3e38276

fxi

484058c

lint

cb41b43

fix

560c595

test

e6875fc

add ut

63d7b0e

test

8d8aca9

test

6d36b5b

fix

56fb09c

fix

2434821

fix

8c0e187

fix

37eeaab

lint

2200898

fix

9808c8d

fix

cb54875

test

7b1bd4e

t

2606322

ut

9a99351

Merge branch 'main' into mooncake-pr-eplb

06563c0

Merge branch 'main' into mooncake-pr-eplb

fd8cc23

Merge branch 'main' into mooncake-pr-eplb

7b06878

Merge branch 'main' into mooncake-pr-eplb

4da41cd

support deepep elastic

f6e86b5

github-actions bot added documentation Improvements or additions to documentation deepseek labels Jan 22, 2026

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

[NIXL] Support NIXL EP MoE a2a backend

96653d3

Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>

zackyoray force-pushed the nixl_ep_moe branch from cba42a4 to 96653d3 Compare January 22, 2026 21:14

Add support for tcp_store

ce83ac7

Signed-off-by: Yoray Zack <yorayz@nvidia.com>

zackyoray closed this Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Integrate NIXL-EP into SGLang#17605

[Feature] Integrate NIXL-EP into SGLang#17605
zackyoray wants to merge 28 commits intosgl-project:mainfrom
zackyoray:nixl_ep_moe

zackyoray commented Jan 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	else ElasticEPStateManager.healthy_rank_state()
	else torch.ones(num_physical_experts // num_local_physical_experts, dtype=torch.int32)

Conversation

zackyoray commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What is NIXL-EP?

Testing & Performance

Test Configuration

Performance Results

1 Node (8 GPUs)

2 Nodes (16 GPUs)

Related Work

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zackyoray commented Jan 22, 2026 •

edited

Loading