Skip to content

[Dependency] Flashinfer 0.6.8post1 -> 0.6.11#24452

Merged
Fridge003 merged 9 commits into
mainfrom
dep/flashinfer-0610
May 12, 2026
Merged

[Dependency] Flashinfer 0.6.8post1 -> 0.6.11#24452
Fridge003 merged 9 commits into
mainfrom
dep/flashinfer-0610

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented May 5, 2026

Commits of interest:

https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.10
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.9

Note: if this is not cherry-picked, maybe just use 0.6.11, because it will break main and bad performance of this features

Confirming not falling back from TRTLLM allreduce fusio
Screenshot 2026-05-06 at 3 48 23 PM
n:

CUDA_VISIBLE_DEVICES=4,5,6,7 pytest test/registered/4-gpu-models/test_qwen3_30b.py
========================================================================= test session starts ==========================================================================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0
rootdir: /sgl-workspace/sglang/test
configfile: pytest.ini
plugins: anyio-4.13.0, typeguard-4.5.1
collected 2 items                                                                                                                                                      

test/registered/4-gpu-models/test_qwen3_30b.py ..                                                                                                                [100%]

=========================================================================== warnings summary ===========================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================== 2 passed, 3 warnings in 244.72s (0:04:04) ===============================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong b8zhong changed the title [Dependency] Flashinfer 0.6.10 [Dependency] Flashinfer 0.6.8post1 -> 0.6.10 May 5, 2026
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 5, 2026
@b8zhong b8zhong self-assigned this May 6, 2026
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented May 6, 2026

Screenshot 2026-05-06 at 12 09 18 PM

It's related to due hang in Flashinfer allreduce fusion. This test can pass with --enforce-disable-flashinfer-allreduce-fusion

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented May 6, 2026

/rerun-stage stage-c-test-4-gpu-h100

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented May 6, 2026

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@b8zhong b8zhong marked this pull request as draft May 7, 2026 04:06
aleozlx pushed a commit to flashinfer-ai/flashinfer that referenced this pull request May 7, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

Caused by #2955
Currently, it's causing a bug in SGLang. in missing `group=` parameter,
(with scenario of 4 devices and world size = 2), the rendezvous will
expect all 4 to respond, and cause a hang in warmup.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

#2955

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

sgl-project/sglang#24452


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added optional process group parameter for AllReduce fusion on TRTLLM
backends, enabling users to configure symmetric memory rendezvous
behavior.

* **Documentation**
* Updated documentation to describe the new parameter and its default
behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
aleozlx pushed a commit to flashinfer-ai/flashinfer that referenced this pull request May 7, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

Caused by #2955
Currently, it's causing a bug in SGLang. in missing `group=` parameter,
(with scenario of 4 devices and world size = 2), the rendezvous will
expect all 4 to respond, and cause a hang in warmup.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

#2955

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

sgl-project/sglang#24452


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added optional process group parameter for AllReduce fusion on TRTLLM
backends, enabling users to configure symmetric memory rendezvous
behavior.

* **Documentation**
* Updated documentation to describe the new parameter and its default
behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@b8zhong b8zhong marked this pull request as ready for review May 7, 2026 11:06
@b8zhong b8zhong changed the title [Dependency] Flashinfer 0.6.8post1 -> 0.6.10 [Dependency] Flashinfer 0.6.8post1 -> 0.6.10post1 May 7, 2026
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented May 7, 2026

Thanks Alex @aleozlx . Just bumped it, I think we should be good to go after NV CI passes.

@Swipe4057
Copy link
Copy Markdown
Contributor

@b8zhong b8zhong changed the title [Dependency] Flashinfer 0.6.8post1 -> 0.6.10post1 [Dependency] Flashinfer 0.6.8post1 -> 0.6.11 May 9, 2026
@b8zhong b8zhong changed the title [Dependency] Flashinfer 0.6.8post1 -> 0.6.11 [Dependency] Flashinfer 0.6.8post1 -> 0.6.10.post1 May 9, 2026
@b8zhong b8zhong changed the title [Dependency] Flashinfer 0.6.8post1 -> 0.6.10.post1 [Dependency] Flashinfer 0.6.8post1 -> 0.6.11 May 9, 2026
@b8zhong b8zhong requested review from AniZpZ and FlamingoPg as code owners May 11, 2026 23:49
@b8zhong b8zhong force-pushed the dep/flashinfer-0610 branch from 5d0e28a to 3760bb6 Compare May 11, 2026 23:58
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented May 12, 2026

github.com/sgl-project/sglang/actions/runs/25711744448/job/75567004924?pr=24452 H20 is failing on main

@Fridge003
Copy link
Copy Markdown
Collaborator

@b8zhong H20 failing test is unrelated. Let me merge this PR

@Fridge003 Fridge003 merged commit d5f3254 into main May 12, 2026
249 of 273 checks passed
@Fridge003 Fridge003 deleted the dep/flashinfer-0610 branch May 12, 2026 21:38
xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
Root cause: d5f3254 (sgl-project#24452, Flashinfer 0.6.8.post1 -> 0.6.11) introduced a
strict shape check on `globalScale` in `flashinfer.fp4_quantize` that rejects
the per-expert tensor sglang's
`compressed_tensors_w4a4_nvfp4_moe.apply_weights` passes at line 315.

Confirms regression by local A/B: 13afe8a (flashinfer 0.6.8.post1, sgl-kernel
0.4.1.post1+cu130, torch 2.9.1+cu130) passes with gsm8k 0.951; 34c0029
(flashinfer 0.6.11.post1, torch 2.11.0+cu130) fails with
"RuntimeError: shape '[1]' is invalid for input of size 128" in
nvfp4_quantize_cute_dsl. Patching fp4_utils.py to backend="cuda" reveals the
underlying flashinfer-side assertion ("globalScale should have shape [1] or
[num_tokens]"), confirming the kernel - not the cute-dsl wrapper - is the
true gate.

Refutes the previous session's hypothesis (51a9403, 28758d3): 51a9403 is a
patch-version bump (0.6.11 -> 0.6.11.post1) with no relevant kernel change;
28758d3 is SM90+MXFP4 only and never runs on B200 NVFP4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-fastfail dependencies Pull requests that update a dependency file high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants