Skip to content

[Feature] Integrate NIXL-EP into SGLang#17605

Closed
zackyoray wants to merge 28 commits intosgl-project:mainfrom
zackyoray:nixl_ep_moe
Closed

[Feature] Integrate NIXL-EP into SGLang#17605
zackyoray wants to merge 28 commits intosgl-project:mainfrom
zackyoray:nixl_ep_moe

Conversation

@zackyoray
Copy link
Contributor

@zackyoray zackyoray commented Jan 22, 2026

Overview

This PR introduces support for the NIXL-EP MoE backend in SGLang, enabling efficient expert parallelism through NVIDIA's NIXL framework. This implementation leverages the elastic expert parallelism infrastructure being developed as part of the Elastic EP Support roadmap (PR #8961), and is based on top of PR #11837 [Draft] (Elastic EP support deepep backend).

What is NIXL-EP?

NIXL-EP is a complete implementation of expert-parallel communication for Mixture of Experts (MoE) models built on top of NIXL's device API. It provides elastic scaling and fault tolerance, enabling dynamic addition and removal of processes (ranks) during runtime without disrupting existing connections, and leverages NIXL's RDMA and NVLink support for optimal performance.

Testing & Performance

The implementation has been validated with DeepSeek-V3-Lite using the standard python -m sglang.bench_serving benchmark tool.

Test Configuration

Parameter 1 Node 2 Nodes
Model DeepSeek-V3-Lite (FP8) DeepSeek-V3-Lite (FP8)
Tensor Parallelism 8 16
Data Parallelism 8 8
Max Concurrency 256 256
Number of Prompts 4096 4096
Input Length 128 tokens 128 tokens
Output Length 128 tokens 128 tokens
Redundant Experts 24 24
Memory Fraction 0.78 0.78

Performance Results

1 Node (8 GPUs)

Backend TTFT Mean (ms) TTFT Median (ms) E2E Latency Mean (ms) E2E Latency Median (ms) Request Throughput (req/s)
nixl 288.32 171.68 14,587.05 9,359.68 12.97
deepep 278.12 169.55 14,297.31 8,839.91 13.07
mooncake 591.25 359.19 23,178.43 14,894.46 9.02

2 Nodes (16 GPUs)

Backend TTFT Mean (ms) TTFT Median (ms) E2E Latency Mean (ms) E2E Latency Median (ms) Request Throughput (req/s)
nixl 309.36 197.24 15,933.40 10,051.18 11.70
deepep 305.58 194.37 15,678.13 10,009.55 11.95
mooncake 586.30 333.80 24,503.35 15,325.52 8.33

Additional testing across different model scales and cluster configurations is ongoing.

Related Work


Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added documentation Improvements or additions to documentation deepseek labels Jan 22, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zackyoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's Mixture of Experts (MoE) capabilities by integrating the NIXL-EP backend. This integration provides robust elastic expert parallelism, allowing for dynamic scaling and improved fault tolerance. The changes also introduce an elasticity-aware load balancing algorithm and update the system's configuration and documentation to support these advanced features, ensuring more efficient and adaptable MoE model serving.

Highlights

  • NIXL-EP Backend Integration: Introduced support for the NIXL-EP MoE backend, leveraging NVIDIA's NIXL framework for efficient expert parallelism, including elastic scaling and fault tolerance.
  • Elasticity-Aware Load Balancing: Added a new elasticity_aware algorithm for Expert Parallelism Load Balancing (EPLB) to dynamically manage expert distribution based on active ranks.
  • Server Argument and Documentation Updates: Updated server arguments (--elastic-ep-backend and --moe-a2a-backend) and corresponding documentation to include 'deepep' and 'nixl' options, enhancing configurability.
  • Elastic EP State Management: Implemented ElasticEPStateManager to centralize the management of active ranks and state for elastic expert parallelism across different backends (Mooncake, DeepEP, NIXL).
  • Refactored MoE Dispatchers and Model Loading: Integrated the new NIXL-EP dispatcher into the MoE layer and refactored model weight updating to support dynamic expert location changes in elastic EP scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the NIXL-EP MoE backend in SGLang, which is a significant feature for enabling efficient and elastic expert parallelism. The changes are well-structured, adding a new manager for elastic EP state, a corresponding load balancing algorithm, and the NIXL token dispatcher. The integration into the existing MoE framework seems solid. I have found one critical issue that would cause a runtime error under certain configurations. My feedback is focused on fixing this issue to ensure the stability of this new feature.

active_ranks=(
ElasticEPStateManager.instance()._active_ranks
if ElasticEPStateManager.instance() is not None
else ElasticEPStateManager.healthy_rank_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The method ElasticEPStateManager.healthy_rank_state() does not exist. This will raise an AttributeError if the elasticity_aware algorithm is used when ElasticEPStateManager is not initialized (i.e., no elastic EP backend is active).

To prevent this crash, you should provide a fallback that creates a tensor indicating all ranks are healthy. A tensor of ones with a shape corresponding to the number of GPUs would be a suitable default.

Suggested change
else ElasticEPStateManager.healthy_rank_state()
else torch.ones(num_physical_experts // num_local_physical_experts, dtype=torch.int32)

Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
@zackyoray zackyoray closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants