Skip to content

Update flashinfer to 0.6.1#15551

Merged
Fridge003 merged 10 commits intomainfrom
upd-flashinfer
Jan 16, 2026
Merged

Update flashinfer to 0.6.1#15551
Fridge003 merged 10 commits intomainfrom
upd-flashinfer

Conversation

@Fridge003
Copy link
Collaborator

@Fridge003 Fridge003 commented Dec 21, 2025

Motivation

flashinfer -> 0.6.1
flashinfer-cubin -> 0.6.1

PRs dependent on this upgrade:
#15546
#15422
#15514
#15347
#14668
#16232
#16279
#16892
#16534
#12787
...

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Dec 21, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the FlashInfer library to version 0.6.0 across the project's configuration and dependency management files. This essential upgrade ensures that the project is aligned with the latest FlashInfer release, which is a prerequisite for several other pending pull requests and helps maintain compatibility and leverage new optimizations from the library.

Highlights

  • FlashInfer Version Update: The FLASHINFER_VERSION argument in the Dockerfile has been updated from 0.5.3 to 0.6.0.
  • Python Dependency Alignment: The flashinfer_python and flashinfer_cubin dependencies in pyproject.toml have been upgraded to version 0.6.0 to match the new FlashInfer release.
  • Engine Version Assertion: The version assertion for flashinfer_python in engine.py has been updated to 0.6.0 to ensure compatibility checks are current.
  • CI Script Update: The FLASHINFER_VERSION variable in the ci_install_dependency.sh script has been updated to 0.6.0.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Fridge003
Copy link
Collaborator Author

cc @DarkSharpness

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the flashinfer dependency to version 0.6.0. The changes are consistent across the Dockerfile, Python project dependencies, runtime version assertions, and CI installation scripts. My review includes one suggestion to improve maintainability by centralizing the version number.

@Fridge003
Copy link
Collaborator Author

Fridge003 commented Jan 8, 2026

/tag-and-rerun-ci again

@github-actions github-actions bot added the run-ci label Jan 8, 2026
@Fridge003 Fridge003 changed the title Update flashinfer version to 0.6.0 Update flashinfer/cutedsl version Jan 8, 2026
Copy link
Contributor

@elvischenv elvischenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flashinfer MoE API breakage changes(remove tile_tokens_dim) should come together with this PR, otherwise the tests using Flashinfer MoE will fail.

@Fridge003
Copy link
Collaborator Author

We are currently blocked at the update of FA4 #15182
Seems the FA kernels need to be upgraded first

@Fridge003 Fridge003 changed the title Update flashinfer to 0.6.0 Update flashinfer to 0.6.1 Jan 15, 2026
@Fridge003
Copy link
Collaborator Author

Fridge003 commented Jan 16, 2026

Just checking this failing case: https://github.com/sgl-project/sglang/actions/runs/21020274512/job/60599048636?pr=15551(Seems to be the only non flaky test)

Seems the behavior of BatchDecodeWithPagedKVCacheWrapper in flashinfer backend has changed when upgrading from 0.5.3 to 0.6.1:

In 0.5.3, it will be dispatched to BatchPrefillWithPagedKVCacheKernel in cuda graph

截屏2026-01-16 23 49 43

While in 0.6.1, it will be dispatched to PrefillWithKVCacheKernel instead. But it's like 2x slower on the same workload
截屏2026-01-16 23 50 37

As a workaround, I will set fa3 as the default backend for Mixtral model to unblock CI.

My commands for profiling:

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --tp 2 --device cuda --host 127.0.0.1 --port 21000

python -m sglang.bench_serving --backend sglang --model mistralai/Mixtral-8x7B-Instruct-v0.1 --num-prompts 10 --sharegpt-output-len 10 --profile --port 21000

@Fridge003
Copy link
Collaborator Author

/rerun-stage performance-test-2-gpu

@github-actions
Copy link
Contributor

✅ Triggered performance-test-2-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Contributor

🔗 View workflow run

@Fridge003
Copy link
Collaborator Author

/rerun-stage unit-test-backend-4-gpu-gb200

@github-actions
Copy link
Contributor

✅ Triggered unit-test-backend-4-gpu-gb200 to run independently (skipping dependencies).

@github-actions
Copy link
Contributor

🔗 View workflow run

@Fridge003 Fridge003 merged commit a046758 into main Jan 16, 2026
119 of 171 checks passed
@Fridge003 Fridge003 deleted the upd-flashinfer branch January 16, 2026 16:48
@Swipe4057
Copy link
Contributor

Just checking this failing case: https://github.com/sgl-project/sglang/actions/runs/21020274512/job/60599048636?pr=15551(Seems to be the only non flaky test)

Seems the behavior of BatchDecodeWithPagedKVCacheWrapper in flashinfer backend has changed when upgrading from 0.5.3 to 0.6.1:

In 0.5.3, it will be dispatched to BatchPrefillWithPagedKVCacheKernel in cuda graph

截屏2026-01-16 23 49 43 While in 0.6.1, it will be dispatched to `PrefillWithKVCacheKernel` instead. But it's like 2x slower on the same workload 截屏2026-01-16 23 50 37

As a workaround, I will set fa3 as the default backend for Mixtral model to unblock CI.

My commands for profiling:

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --tp 2 --device cuda --host 127.0.0.1 --port 21000

python -m sglang.bench_serving --backend sglang --model mistralai/Mixtral-8x7B-Instruct-v0.1 --num-prompts 10 --sharegpt-output-len 10 --profile --port 21000

Not only mixtral, Qwen 3 235B bandwidth collapsed on flashinfer

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Jan 21, 2026

@Fridge003 PrefillWithKVCacheKernel is the FA3 paged attention in FlashInfer. I have tested a long time ago that it's not faster than FA2(flashinfer-ai/flashinfer#1340), and @zihaoye is working on optimizing FA3 (flashinfer-ai/flashinfer#2192).
Though I'm not sure why upgrading leads to calling FA3, here it's fixed to FA2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd blackwell SM100/SM120 deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang high priority lora model-gateway Multi-modal multi-modal language model npu piecewise-cuda-graph quant LLM Quantization run-ci sgl-kernel speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants