[Feature] Integrate NIXL-EP into SGLang#17605
[Feature] Integrate NIXL-EP into SGLang#17605zackyoray wants to merge 28 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zackyoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's Mixture of Experts (MoE) capabilities by integrating the NIXL-EP backend. This integration provides robust elastic expert parallelism, allowing for dynamic scaling and improved fault tolerance. The changes also introduce an elasticity-aware load balancing algorithm and update the system's configuration and documentation to support these advanced features, ensuring more efficient and adaptable MoE model serving. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the NIXL-EP MoE backend in SGLang, which is a significant feature for enabling efficient and elastic expert parallelism. The changes are well-structured, adding a new manager for elastic EP state, a corresponding load balancing algorithm, and the NIXL token dispatcher. The integration into the existing MoE framework seems solid. I have found one critical issue that would cause a runtime error under certain configurations. My feedback is focused on fixing this issue to ensure the stability of this new feature.
| active_ranks=( | ||
| ElasticEPStateManager.instance()._active_ranks | ||
| if ElasticEPStateManager.instance() is not None | ||
| else ElasticEPStateManager.healthy_rank_state() |
There was a problem hiding this comment.
The method ElasticEPStateManager.healthy_rank_state() does not exist. This will raise an AttributeError if the elasticity_aware algorithm is used when ElasticEPStateManager is not initialized (i.e., no elastic EP backend is active).
To prevent this crash, you should provide a fallback that creates a tensor indicating all ranks are healthy. A tensor of ones with a shape corresponding to the number of GPUs would be a suitable default.
| else ElasticEPStateManager.healthy_rank_state() | |
| else torch.ones(num_physical_experts // num_local_physical_experts, dtype=torch.int32) |
Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>
cba42a4 to
96653d3
Compare
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Overview
This PR introduces support for the NIXL-EP MoE backend in SGLang, enabling efficient expert parallelism through NVIDIA's NIXL framework. This implementation leverages the elastic expert parallelism infrastructure being developed as part of the Elastic EP Support roadmap (PR #8961), and is based on top of PR #11837 [Draft] (Elastic EP support deepep backend).
What is NIXL-EP?
NIXL-EP is a complete implementation of expert-parallel communication for Mixture of Experts (MoE) models built on top of NIXL's device API. It provides elastic scaling and fault tolerance, enabling dynamic addition and removal of processes (ranks) during runtime without disrupting existing connections, and leverages NIXL's RDMA and NVLink support for optimal performance.
Testing & Performance
The implementation has been validated with DeepSeek-V3-Lite using the standard
python -m sglang.bench_servingbenchmark tool.Test Configuration
Performance Results
1 Node (8 GPUs)
2 Nodes (16 GPUs)
Additional testing across different model scales and cluster configurations is ongoing.
Related Work
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci