Update flashinfer to 0.6.1 by Fridge003 · Pull Request #15551 · sgl-project/sglang

Fridge003 · 2025-12-21T00:23:57Z

Motivation

flashinfer -> 0.6.1
flashinfer-cubin -> 0.6.1

PRs dependent on this upgrade:
#15546
#15422
#15514
#15347
#14668
#16232
#16279
#16892
#16534
#12787
...

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-21T00:24:11Z

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the FlashInfer library to version 0.6.0 across the project's configuration and dependency management files. This essential upgrade ensures that the project is aligned with the latest FlashInfer release, which is a prerequisite for several other pending pull requests and helps maintain compatibility and leverage new optimizations from the library.

Highlights

FlashInfer Version Update: The FLASHINFER_VERSION argument in the Dockerfile has been updated from 0.5.3 to 0.6.0.
Python Dependency Alignment: The flashinfer_python and flashinfer_cubin dependencies in pyproject.toml have been upgraded to version 0.6.0 to match the new FlashInfer release.
Engine Version Assertion: The version assertion for flashinfer_python in engine.py has been updated to 0.6.0 to ensure compatibility checks are current.
CI Script Update: The FLASHINFER_VERSION variable in the ci_install_dependency.sh script has been updated to 0.6.0.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Fridge003 · 2025-12-21T00:24:14Z

cc @DarkSharpness

gemini-code-assist

Code Review

This pull request updates the flashinfer dependency to version 0.6.0. The changes are consistent across the Dockerfile, Python project dependencies, runtime version assertions, and CI installation scripts. My review includes one suggestion to improve maintainability by centralizing the version number.

scripts/ci/ci_install_dependency.sh

Fridge003 · 2026-01-08T04:55:52Z

/tag-and-rerun-ci again

elvischenv

The Flashinfer MoE API breakage changes(remove tile_tokens_dim) should come together with this PR, otherwise the tests using Flashinfer MoE will fail.

Fridge003 · 2026-01-08T08:19:11Z

We are currently blocked at the update of FA4 #15182
Seems the FA kernels need to be upgraded first

sgl-kernel/python/sgl_kernel/_fa4_interface.py

Fridge003 · 2026-01-16T15:51:33Z

Just checking this failing case: https://github.com/sgl-project/sglang/actions/runs/21020274512/job/60599048636?pr=15551（Seems to be the only non flaky test)

Seems the behavior of BatchDecodeWithPagedKVCacheWrapper in flashinfer backend has changed when upgrading from 0.5.3 to 0.6.1:

In 0.5.3, it will be dispatched to BatchPrefillWithPagedKVCacheKernel in cuda graph

While in 0.6.1, it will be dispatched to PrefillWithKVCacheKernel instead. But it's like 2x slower on the same workload

As a workaround, I will set fa3 as the default backend for Mixtral model to unblock CI.

My commands for profiling:

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --tp 2 --device cuda --host 127.0.0.1 --port 21000

python -m sglang.bench_serving --backend sglang --model mistralai/Mixtral-8x7B-Instruct-v0.1 --num-prompts 10 --sharegpt-output-len 10 --profile --port 21000

Fridge003 · 2026-01-16T16:18:01Z

/rerun-stage performance-test-2-gpu

github-actions · 2026-01-16T16:18:30Z

✅ Triggered performance-test-2-gpu to run independently (skipping dependencies).

github-actions · 2026-01-16T16:18:36Z

🔗 View workflow run

Fridge003 · 2026-01-16T16:21:37Z

/rerun-stage unit-test-backend-4-gpu-gb200

github-actions · 2026-01-16T16:21:59Z

✅ Triggered unit-test-backend-4-gpu-gb200 to run independently (skipping dependencies).

github-actions · 2026-01-16T16:22:05Z

🔗 View workflow run

Swipe4057 · 2026-01-19T21:10:52Z

Just checking this failing case: https://github.com/sgl-project/sglang/actions/runs/21020274512/job/60599048636?pr=15551（Seems to be the only non flaky test)

Seems the behavior of BatchDecodeWithPagedKVCacheWrapper in flashinfer backend has changed when upgrading from 0.5.3 to 0.6.1:

In 0.5.3, it will be dispatched to BatchPrefillWithPagedKVCacheKernel in cuda graph
While in 0.6.1, it will be dispatched to `PrefillWithKVCacheKernel` instead. But it's like 2x slower on the same workload
As a workaround, I will set fa3 as the default backend for Mixtral model to unblock CI.

My commands for profiling:
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --tp 2 --device cuda --host 127.0.0.1 --port 21000

python -m sglang.bench_serving --backend sglang --model mistralai/Mixtral-8x7B-Instruct-v0.1 --num-prompts 10 --sharegpt-output-len 10 --profile --port 21000

Not only mixtral, Qwen 3 235B bandwidth collapsed on flashinfer

Edenzzzz · 2026-01-21T18:00:33Z

@Fridge003 PrefillWithKVCacheKernel is the FA3 paged attention in FlashInfer. I have tested a long time ago that it's not faster than FA2(flashinfer-ai/flashinfer#1340), and @zihaoye is working on optimizing FA3 (flashinfer-ai/flashinfer#2192).
Though I'm not sure why upgrading leads to calling FA3, here it's fixed to FA2.

sglang/python/sglang/srt/layers/attention/flashinfer_backend.py

Line 267 in 3908985

backend="fa2",

Update flashinfer version to 0.6.0

6a477f0

github-actions bot added the dependencies Pull requests that update a dependency file label Dec 21, 2025

gemini-code-assist bot reviewed Dec 21, 2025

View reviewed changes

scripts/ci/ci_install_dependency.sh Outdated Show resolved Hide resolved

This was referenced Dec 24, 2025

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS #15729

Open

[Perf] Eliminate the slice op for Flashinfer trtllm_fp4_block_scale_moe #15731

Merged

Fridge003 marked this pull request as ready for review January 8, 2026 04:55

Fridge003 requested review from CatherineSue, HaiShaw, JustinTong0323, ishandhanani, ispobock, merrymercy and slin1237 as code owners January 8, 2026 04:55

github-actions bot added the run-ci label Jan 8, 2026

Fridge003 added 2 commits January 8, 2026 13:17

Merge remote-tracking branch 'origin/main' into flashinfer

2a40027

update cutedsl to 4.3.4

db2266a

Fridge003 force-pushed the upd-flashinfer branch from 5277493 to db2266a Compare January 8, 2026 05:19

Fridge003 requested review from BBuf, FlamingoPg, yizhang2077 and zhyncs as code owners January 8, 2026 05:19

github-actions bot added the sgl-kernel label Jan 8, 2026

Fridge003 changed the title ~~Update flashinfer version to 0.6.0~~ Update flashinfer/cutedsl version Jan 8, 2026

elvischenv reviewed Jan 8, 2026

View reviewed changes

Fridge003 added the high priority label Jan 8, 2026

remove deprecated tile_tokens_dim

07f880a

github-actions bot added blackwell SM100/SM120 npu piecewise-cuda-graph diffusion SGLang Diffusion model-gateway labels Jan 15, 2026

Merge remote-tracking branch 'origin/main' into flashfiner

a0421af

Fridge003 force-pushed the upd-flashinfer branch from 65411d4 to a0421af Compare January 15, 2026 04:56

Fridge003 changed the title ~~Update flashinfer to 0.6.0~~ Update flashinfer to 0.6.1 Jan 15, 2026

Fridge003 and others added 2 commits January 15, 2026 12:57

upgrade to 0.6.1

a63e739

Merge branch 'main' into upd-flashinfer

48d8129

zhyncs approved these changes Jan 16, 2026

View reviewed changes

zhyncs reviewed Jan 16, 2026

View reviewed changes

sgl-kernel/python/sgl_kernel/_fa4_interface.py Outdated Show resolved Hide resolved

zhyncs and others added 2 commits January 15, 2026 16:34

Update sgl-kernel/python/sgl_kernel/_fa4_interface.py

be385e7

Merge branch 'main' into upd-flashinfer

fc51281

fix performance

b1aec80

Fridge003 merged commit a046758 into main Jan 16, 2026
119 of 171 checks passed

Fridge003 deleted the upd-flashinfer branch January 16, 2026 16:48

Fridge003 mentioned this pull request Jan 16, 2026

[Minor] Remove deprecated tile_tokens_dim kwargs #15414

Closed

6 tasks

changhuaixin mentioned this pull request Jan 20, 2026

Performance dropped on H20 for Qwen3-235B after installing flashinfer 0.6.1 #17411

Closed

5 tasks

XuZhang99 mentioned this pull request Jan 28, 2026

[Perf] BatchPrefillWithPagedKVCacheKernel with 'fa3' backend performance degradation in v0.6.1. flashinfer-ai/flashinfer#2400

Open

DarkSharpness mentioned this pull request Feb 6, 2026

[Fix] Fix backend selection after flashinfer version update #18364

Merged

5 tasks

Conversation

Fridge003 commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Fridge003 commented Dec 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Fridge003 commented Jan 8, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elvischenv left a comment

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

Uh oh!

Fridge003 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Swipe4057 commented Jan 19, 2026

Uh oh!

Edenzzzz commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fridge003 commented Dec 21, 2025 •

edited

Loading

Fridge003 commented Jan 8, 2026 •

edited by b8zhong

Loading

Fridge003 commented Jan 16, 2026 •

edited

Loading

Edenzzzz commented Jan 21, 2026 •

edited

Loading