[Dependency] Flashinfer 0.6.8post1 -> 0.6.11#24452
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
✅ Triggered |
<!-- .github/pull_request_template.md --> ## 📌 Description Caused by #2955 Currently, it's causing a bug in SGLang. in missing `group=` parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues #2955 <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes sgl-project/sglang#24452 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior. * **Documentation** * Updated documentation to describe the new parameter and its default behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md --> ## 📌 Description Caused by #2955 Currently, it's causing a bug in SGLang. in missing `group=` parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues #2955 <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes sgl-project/sglang#24452 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior. * **Documentation** * Updated documentation to describe the new parameter and its default behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Thanks Alex @aleozlx . Just bumped it, I think we should be good to go after NV CI passes. |
5d0e28a to
3760bb6
Compare
|
github.com/sgl-project/sglang/actions/runs/25711744448/job/75567004924?pr=24452 H20 is failing on main |
|
@b8zhong H20 failing test is unrelated. Let me merge this PR |
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Root cause: d5f3254 (sgl-project#24452, Flashinfer 0.6.8.post1 -> 0.6.11) introduced a strict shape check on `globalScale` in `flashinfer.fp4_quantize` that rejects the per-expert tensor sglang's `compressed_tensors_w4a4_nvfp4_moe.apply_weights` passes at line 315. Confirms regression by local A/B: 13afe8a (flashinfer 0.6.8.post1, sgl-kernel 0.4.1.post1+cu130, torch 2.9.1+cu130) passes with gsm8k 0.951; 34c0029 (flashinfer 0.6.11.post1, torch 2.11.0+cu130) fails with "RuntimeError: shape '[1]' is invalid for input of size 128" in nvfp4_quantize_cute_dsl. Patching fp4_utils.py to backend="cuda" reveals the underlying flashinfer-side assertion ("globalScale should have shape [1] or [num_tokens]"), confirming the kernel - not the cute-dsl wrapper - is the true gate. Refutes the previous session's hypothesis (51a9403, 28758d3): 51a9403 is a patch-version bump (0.6.11 -> 0.6.11.post1) with no relevant kernel change; 28758d3 is SM90+MXFP4 only and never runs on B200 NVFP4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Commits of interest:
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.10
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.9
Note: if this is not cherry-picked, maybe just use 0.6.11, because it will break main and bad performance of this features
Confirming not falling back from TRTLLM allreduce fusio

n: