[JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2)#19880
[JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2)#19880BBuf merged 20 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new JIT-compiled custom all-reduce implementation (version 2) for SGLang, designed to enhance distributed communication performance, particularly for intra-node GPU setups. It provides a flexible and optimized alternative to existing all-reduce methods, leveraging CUDA IPC and custom kernels for efficient data exchange and synchronization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a JIT-compiled custom all-reduce implementation (v2) as an opt-in feature, including new Python bindings, CUDA kernels, and host-side control logic. It supports both one-shot and two-shot algorithms, integrates with CUDA graph capturing, and is enabled via the SGLANG_USE_JIT_ALL_REDUCE environment variable. However, a significant security vulnerability was identified: memory offsets used for CUDA graph registration are truncated from 64-bit to 32-bit integers during inter-process communication. This could lead to incorrect memory access, memory corruption, or information disclosure on the GPU. It is strongly recommended to use 64-bit integers for all memory-related offsets. Additionally, the code review focused on improving clarity, maintainability, and configuration flexibility, with suggestions for clarifying comments, removing unused code, and making hardcoded parameters configurable.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
3256786 to
00be07d
Compare
|
cc @BBuf @yuan-luo @HydraQYH . For now we implement a push-mode 1-shot all reduce and normal pull-mode 1/2-shot all-reduce, which can be significantly faster than AOT custom all-reduce. Currently the |
00be07d to
4d85591
Compare
|
Some performance results for TP=4 on H200/B200 H200
B200
|
| - Regex: '^<sgl_kernel/.*\.h>$' | ||
| Priority: 0 | ||
| - Regex: '^<sgl_kernel/impl/.*>$' | ||
| - Regex: '^<sgl_kernel/.*/.*>$' |
There was a problem hiding this comment.
Why we need to update this regex?
There was a problem hiding this comment.
Because there's more secondary headers in JIT kernel. In this PR we introduce <sgl_kernel/distributed/xxx.cuh> . This rule can work for all of them and there's no break for existing code.
| graph = torch.cuda.CUDAGraph() | ||
| graph_inp = torch.zeros((TEST_LAYERS, size), dtype=dtype, device=device) | ||
| out_jits = [] | ||
| with comm.capture(): |
There was a problem hiding this comment.
How do you plan to handle CUDA graph compatibility for the pull-based custom all-reduce path in real LLM runs? It seems this path depends on the extra comm.capture() address-registration flow, so I’m not sure what the intended graph capture / recapture lifecycle is here.
There was a problem hiding this comment.
The AOT custom all reduce already uses a similar graph register method. We just follow their design
The buffer memory usage of is |
|
/rerun-failed-ci |
|
Could you paste the benchmark results for this new CA kernel? That would be great. |
benchmark result here @yuan-luo |
@DarkSharpness Awesome benchmark result. Could we put it in the PR description benchmark and profiling column? |
| """ | ||
| # HARDCODED: opt-in flag for v2 JIT all-reduce. | ||
| # Set SGLANG_USE_JIT_ALL_REDUCE=1 to enable. | ||
| if _is_cuda and get_bool_env_var("SGLANG_USE_JIT_ALL_REDUCE", default="false"): |
There was a problem hiding this comment.
If it is stable and outperforms than other ARs, can we set it default true?
There was a problem hiding this comment.
I will enable this in another PR. This PR is too large which involves some clean up in parallel states. We should ensure the correctness of that part first.
| # NOTE: This result is based on benchmarks on H200 GPUs | ||
| THRESHOLD_2_SHOT_MAP = { | ||
| 2: ModeConfig(2 * MB, INF), | ||
| 3: ModeConfig(512 * MB, 512 * KB), |
There was a problem hiding this comment.
Is this 512 * MB expected? Seems differ too much.
There was a problem hiding this comment.
Should be KB
…gl-project#19880) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…gl-project#19880) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…gl-project#19880) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…gl-project#19880) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…gl-project#19880) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Motivation
Modifications
This PR implements a clean version of custom all reduce which is highly configurable (we can set the number SMs, the recommended CTA size). We also integrate post-hopper features like PDL into the custom-all-reduce, which improves the latency by up to 40% in small batch sizes.
We also implement a push-mode 1 shot all reduce, which is significantly faster than pull-mode under small batch sizes.
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci