[Feature] Integrate Elastic NIXL-EP into SGLang#19248
[Feature] Integrate Elastic NIXL-EP into SGLang#19248ShangmingCai merged 13 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zackyoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's Mixture of Experts (MoE) capabilities by integrating the Elastic NIXL-EP backend. This integration provides a new, high-performance option for expert parallelism, building upon existing elastic EP infrastructure. The changes involve adding NIXL as a recognized communication backend, implementing its specific dispatcher logic, and establishing a robust coordination mechanism for distributed operations. This expands the framework's flexibility and efficiency for large-scale MoE model serving. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces support for the NIXL-EP MoE backend, which is a significant feature addition. The changes are well-structured, adding nixl as a new backend option and integrating it across the serving stack, from server arguments to the MoE token dispatcher. The new NixlEPDispatcher implementation seems to correctly follow the patterns of existing dispatchers. The addition of a global TCP store for coordination is also handled cleanly. My review has a couple of minor suggestions for code cleanup, but overall the implementation looks solid.
| ) | ||
|
|
||
| # Create a global TCPStore for coordination (used by NIXL) | ||
| _create_global_tcp_store(rank, world_size) |
There was a problem hiding this comment.
Maybe we should call this only when nixl-ep is enabled.
There was a problem hiding this comment.
Make sense, added an if before.
| from sglang.srt.compilation.compilation_config import register_split_op | ||
| from sglang.srt.compilation.piecewise_context_manager import is_in_piecewise_cuda_graph | ||
| from sglang.srt.distributed.utils import set_global_tcp_store | ||
| from sglang.srt.environ import envs |
| self.num_max_dispatch_tokens_per_rank = get_int_env_var( | ||
| "SGLANG_NIXL_EP_NUM_MAX_DISPATCH_TOKENS_PER_RANK", 128 |
| hidden_states: torch.Tensor, | ||
| topk_idx: torch.Tensor, | ||
| ): | ||
| use_fp8 = not get_bool_env_var("SGLANG_NIXL_EP_BF16_DISPATCH") |
ShangmingCai
left a comment
There was a problem hiding this comment.
Please fix the comments. Others LGTM. Please ping @ch-wan for review and another approval. He is the core maintainer of the EP module.
868539b to
48880c8
Compare
Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
48880c8 to
ab90cdb
Compare
Thanks @ShangmingCai, fixed the comments and rebased. |
|
qq: is this related to elastic ep as indicated in the title? |
Yes, This PR integrates NIXL-EP, an Elastic EP communication library designed with elasticity as its core feature, natively supporting fault tolerance (with rank recovery) and dynamic scale up/down |
ch-wan
left a comment
There was a problem hiding this comment.
can we add some ci (or manual) tests for nixl a2a and elastic ep?
@ch-wan first of all thanks for your review, I added manual tests in test/manual/ep/test_nixl_ep.py covering:
|
|
/tag-and-rerun-ci |
|
Need to fix lint with |
Thanks, fixed that. |
|
/rerun-failed-ci |
|
Thanks @ShangmingCai @ch-wan, @ShangmingCai what should be the next step for merging this PR? should i expect another round of review? |
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM. You can ping @ch-wan for a final check.
ch-wan
left a comment
There was a problem hiding this comment.
LGTM. Please fix the conflict.
Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Barak Biber <bbiber@nvidia.com>
Overview
This PR introduces support for the NIXL-EP MoE backend in SGLang, enabling efficient expert parallelism through NVIDIA's NIXL framework. This implementation leverages the elastic expert parallelism infrastructure being developed as part of the Elastic EP Support roadmap (PR #8961).
What is NIXL-EP?
NIXL-EP is a complete implementation of expert-parallel communication for Mixture of Experts (MoE) models built on top of NIXL's device API. It provides elastic scaling and fault tolerance, enabling dynamic addition and removal of processes (ranks) during runtime without disrupting existing connections, and leverages NIXL's RDMA and NVLink support for optimal performance.
Testing & Performance
The implementation has been validated with DeepSeek-V3-Lite using the standard
python -m sglang.bench_servingbenchmark tool.Test Configuration
Performance Results
1 Node (8 GPUs)
2 Nodes (16 GPUs)
Additional testing across different model scales and cluster configurations is ongoing.
Related Work
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci