Update flashinfer to 0.6.4 by nvjullin · Pull Request #19238 · sgl-project/sglang

nvjullin · 2026-02-24T09:38:37Z

Motivation

Modifications

Changed all flashinfer version 0.6.3 to 0.6.4.

Accuracy Tests

Tests are ran on B200 cuda13 with pip installed environment. Server is Deepseek-R1, example for TEP8 is

python3 -m sglang.launch_server --port 8080 --model deepseek-ai/DeepSeek-R1-0528 --trust-remote-code --kv-cache-dtype fp8_e4m3 --tensor-parallel-size 8 --data-parallel-size 1 --expert-parallel-size 8 --enable-dp-lm-head --max-running-requests 256 --cuda-graph-max-bs 256 --mem-fraction-static 0.85 --chunked-prefill-size 32768 --max-prefill-tokens 70000 --enable-flashinfer-allreduce-fusion --disable-radix-cache --quantization fp8 --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --model-loader-extra-config {"enable_multithread_load": true} --stream-interval 30

~~I'll update with more configs once they finish.~~ Done. Everything looks normal.

GPQA Accuracy

flashinfer	parallelism	mtp	mean	std	scores
0.6.3	dep8	off	0.802	0.412	0.808, 0.808, 0.793, 0.793, 0.828, 0.818, 0.788, 0.783
0.6.4	dep8	off	0.801	0.402	0.818, 0.818, 0.803, 0.788, 0.773, 0.788, 0.818, 0.798
0.6.3	dep8	on	0.814	0.394	0.823, 0.823, 0.808, 0.823, 0.803, 0.803, 0.823, 0.808
0.6.4	dep8	on	0.799	0.405	0.788, 0.833, 0.793, 0.798, 0.788, 0.803, 0.798, 0.793
0.6.3	tep8	off	0.797	0.405	0.788, 0.793, 0.798, 0.798, 0.808, 0.803, 0.793, 0.793
0.6.4	tep8	off	0.798	0.398	0.793, 0.793, 0.778, 0.798, 0.803, 0.823, 0.793, 0.803
0.6.3	tep8	on	0.798	0.398	0.758, 0.808, 0.793, 0.828, 0.798, 0.788, 0.808, 0.803
0.6.4	tep8	on	0.783	0.390	0.778, 0.793, 0.818, 0.763, 0.768, 0.768, 0.768, 0.813
0.6.3	tp8	off	0.807	0.398	0.803, 0.803, 0.823, 0.793, 0.828, 0.798, 0.803, 0.803
0.6.4	tp8	off	0.801	0.416	0.793, 0.808, 0.813, 0.793, 0.798, 0.808, 0.813, 0.778
0.6.3	tp8	on	0.806	0.390	0.783, 0.823, 0.823, 0.833, 0.823, 0.758, 0.793, 0.813
0.6.4	tp8	on	0.802	0.368	0.783, 0.788, 0.798, 0.798, 0.803, 0.808, 0.798, 0.838

Benchmarking and Profiling

In tables below, numerical column names is concurrency. Client is bench_serving with ISL=4096 OSL=512.

Median TTFT (ms)

flashinfer	parallelism	mtp	1	2	4	8	16	32	64	128	256
0.6.3	dep8	off	507.3	886.0	1107.0	1775.5	2904.3	4949.0	7481.7	12324.5	22043.6
0.6.4	dep8	off	512.3	887.4	1111.9	1424.3	2902.2	4915.6	7560.9	12261.1	22025.7
0.6.3	dep8	on	526.3	547.5	549.5	553.6	562.2	637.9	896.0	1050.2	2473.6
0.6.4	dep8	on	531.9	549.5	552.2	555.8	562.4	676.1	912.4	1040.8	2544.3
0.6.3	tep8	off	222.9	431.3	794.8	1519.4	2805.6	4870.5	8478.6	14276.3	26214.6
0.6.4	tep8	off	223.2	388.8	794.7	1518.5	2804.2	4873.2	7840.9	14107.8	26063.3
0.6.3	tep8	on	230.9	250.3	244.3	252.4	263.1	332.6	476.3	637.2	825.2
0.6.4	tep8	on	233.7	251.5	244.7	252.3	264.4	330.5	471.3	617.4	826.3
0.6.3	tp8	off	199.0	380.3	684.4	1301.1	2407.5	4186.9	7216.8	12094.3	22187.2
0.6.4	tp8	off	198.1	380.6	681.4	1297.7	2397.5	4166.5	7328.2	11903.5	21983.4
0.6.3	tp8	on	204.5	218.6	226.0	222.1	232.4	292.3	416.0	574.5	719.9
0.6.4	tp8	on	205.6	220.1	227.8	222.8	233.8	293.2	418.9	553.4	706.0

Median Output TPS/user (1000 / TPOT)

flashinfer	parallelism	mtp	1	2	4	8	16	32	64	128	256
0.6.3	dep8	off	64.5	62.8	61.1	59.1	51.8	43.8	33.2	23.3	15.1
0.6.4	dep8	off	64.1	62.9	61.0	55.0	51.8	44.0	33.1	23.1	15.1
0.6.3	dep8	on	99.7	85.0	75.2	55.9	38.3	24.1	14.8	8.8	5.1
0.6.4	dep8	on	94.4	86.8	73.3	55.1	38.7	24.8	15.0	8.8	5.0
0.6.3	tep8	off	123.4	108.4	97.9	83.9	67.2	49.3	32.3	20.4	12.4
0.6.4	tep8	off	123.5	108.8	97.9	84.2	67.3	49.3	32.3	20.7	12.3
0.6.3	tep8	on	190.7	159.2	127.4	90.9	60.4	38.0	23.1	13.1	7.3
0.6.4	tep8	on	188.8	160.8	127.6	90.3	60.0	37.5	23.2	13.0	7.3
0.6.3	tp8	off	133.8	124.1	115.0	96.5	76.3	53.6	36.1	22.5	13.1
0.6.4	tp8	off	134.3	125.0	115.9	96.7	76.5	53.7	36.2	21.6	13.2
0.6.3	tp8	on	221.0	180.1	138.3	97.9	64.9	41.4	25.0	14.1	8.0
0.6.4	tp8	on	214.5	172.6	139.0	98.5	65.0	41.3	24.9	14.0	8.0

Output TPS/GPU

flashinfer	parallelism	mtp	1	2	4	8	16	32	64	128	256
0.6.3	dep8	off	7.6	14.2	27.0	49.1	79.7	122.9	177.9	236.7	291.0
0.6.4	dep8	off	7.6	14.2	27.0	48.2	79.5	123.1	177.8	237.2	291.6
0.6.3	dep8	on	11.1	20.2	33.6	53.3	73.3	96.9	122.1	150.5	167.7
0.6.4	dep8	on	10.7	19.7	33.3	52.4	73.6	98.6	123.2	147.9	165.7
0.6.3	tep8	off	14.7	24.7	42.5	67.4	98.5	134.2	171.9	210.6	242.7
0.6.4	tep8	off	14.7	25.1	42.6	67.5	98.5	134.4	173.2	211.1	242.3
0.6.3	tep8	on	21.9	37.0	58.0	85.1	113.0	147.6	182.1	210.3	235.4
0.6.4	tep8	on	21.8	37.6	57.1	86.1	115.2	146.3	181.0	209.5	235.5
0.6.3	tp8	off	15.9	28.2	50.0	77.6	112.3	149.3	191.8	235.1	267.9
0.6.4	tp8	off	16.0	28.4	50.3	77.9	112.8	149.4	191.3	233.4	269.2
0.6.3	tp8	on	26.9	41.4	64.6	93.4	125.4	162.3	196.6	223.9	255.9
0.6.4	tp8	on	25.1	41.6	63.9	93.7	125.8	161.0	195.9	223.9	256.6

Total TPS/GPU

flashinfer	parallelism	mtp	1	2	4	8	16	32	64	128	256
0.6.3	dep8	off	68.2	128.1	242.9	442.3	717.0	1106.4	1601.3	2130.2	2618.8
0.6.4	dep8	off	68.0	128.0	243.1	433.5	715.4	1107.9	1599.9	2134.8	2624.2
0.6.3	dep8	on	99.8	182.1	302.2	479.5	659.6	871.7	1098.7	1354.2	1508.9
0.6.4	dep8	on	96.7	177.3	299.9	471.8	662.0	887.3	1108.5	1330.8	1491.6
0.6.3	tep8	off	132.0	222.4	382.9	606.6	886.2	1207.9	1547.5	1895.4	2184.1
0.6.4	tep8	off	132.1	226.2	383.6	607.7	886.8	1210.0	1559.1	1899.9	2180.5
0.6.3	tep8	on	196.7	333.3	521.6	765.7	1017.0	1328.6	1639.3	1893.1	2118.6
0.6.4	tep8	on	196.5	338.7	513.5	774.9	1036.5	1316.5	1629.4	1885.8	2119.8
0.6.3	tp8	off	143.3	254.0	449.7	698.5	1010.9	1343.3	1726.6	2116.2	2411.3
0.6.4	tp8	off	143.9	255.6	452.7	700.7	1014.8	1344.2	1721.5	2100.9	2423.0
0.6.3	tp8	on	242.0	372.2	581.8	840.9	1128.4	1460.4	1769.6	2014.7	2303.0
0.6.4	tp8	on	226.1	374.0	574.9	842.9	1132.6	1448.7	1763.2	2015.0	2309.0

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-24T09:38:55Z

Summary of Changes

Hello @nvjullin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the FlashInfer library dependency to version 0.6.4. This upgrade ensures the project utilizes the latest optimizations and features provided by FlashInfer, a high-performance attention kernel library. Benchmarking results indicate that this update maintains similar accuracy and generally stable performance, with minor improvements in some concurrency scenarios.

Highlights

FlashInfer Version Update: The flashinfer dependency has been upgraded from version 0.6.3 to 0.6.4 across the project's Dockerfile, Python dependencies, and runtime version checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docker/Dockerfile
- Updated the FLASHINFER_VERSION build argument to 0.6.4.
python/pyproject.toml
- Updated flashinfer_python and flashinfer_cubin dependencies to version 0.6.4.
python/sglang/srt/entrypoints/engine.py
- Updated the required flashinfer_python version in the assertion check to 0.6.4.
python/sglang/srt/server_args.py
- Adjusted the version check for enabling flashinfer_trtllm as the MoE runner backend to require FlashInfer version 0.6.4 or newer.
python/sglang/srt/utils/common.py
- Updated the example min_version in the docstring for check_pkg_version_at_least to 0.6.4.
scripts/ci/cuda/ci_install_dependency.sh
- Updated the FLASHINFER_VERSION environment variable used in CI scripts to 0.6.4.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/release-docker-cu13-framework.yml

Activity

Accuracy tests were performed on B200 cuda13 with a pip-installed environment, showing comparable GPQA accuracy between FlashInfer 0.6.3 and 0.6.4.
Benchmarking and profiling were conducted to compare median TTFT, median output TPS/user, output TPS/GPU, and total TPS/GPU across various concurrency levels for both FlashInfer versions.
The pull request includes a checklist for formatting, unit tests, documentation, and benchmark results, indicating the author's progress and awareness of contribution guidelines.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the flashinfer dependency from version 0.6.3 to 0.6.4. The changes are applied consistently across all relevant files, including the Dockerfile, Python dependencies in pyproject.toml, version checks in the source code, and CI scripts. The author has also included comprehensive benchmark and accuracy test results, which show that the new version maintains or improves performance without regressions. The changes are clear, correct, and well-tested. This is a solid update.

b8zhong · 2026-02-27T23:42:33Z

@nvjullin Hi, thanks for the comprehensive performance analysis. Though, It should be covered in existing PR here:

#19005

I add you as co-author to that PR.

update flashinfer to 0.6.4

de0cc34

nvjullin requested review from CatherineSue, Fridge003, HaiShaw, JustinTong0323, Kangyan-Zhou, ishandhanani, ispobock, merrymercy, slin1237 and yctseng0211 as code owners February 24, 2026 09:38

github-actions bot added the dependencies Pull requests that update a dependency file label Feb 24, 2026

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

b8zhong closed this Feb 27, 2026

b8zhong mentioned this pull request Feb 28, 2026

[FlashInfer] Bump FlashInfer version from 0.6.3 to 0.6.4 #19005

Merged

5 tasks

nvjullin deleted the update-flashinfer branch March 19, 2026 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update flashinfer to 0.6.4#19238

Update flashinfer to 0.6.4#19238
nvjullin wants to merge 1 commit intosgl-project:mainfrom
nvjullin:update-flashinfer

nvjullin commented Feb 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

b8zhong commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvjullin commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

GPQA Accuracy

Benchmarking and Profiling

Median TTFT (ms)

Median Output TPS/user (1000 / TPOT)

Output TPS/GPU

Total TPS/GPU

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

b8zhong commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvjullin commented Feb 24, 2026 •

edited

Loading