Skip to content

Update flashinfer to 0.6.4#19238

Closed
nvjullin wants to merge 1 commit intosgl-project:mainfrom
nvjullin:update-flashinfer
Closed

Update flashinfer to 0.6.4#19238
nvjullin wants to merge 1 commit intosgl-project:mainfrom
nvjullin:update-flashinfer

Conversation

@nvjullin
Copy link
Contributor

@nvjullin nvjullin commented Feb 24, 2026

Motivation

Modifications

Changed all flashinfer version 0.6.3 to 0.6.4.

Accuracy Tests

Tests are ran on B200 cuda13 with pip installed environment. Server is Deepseek-R1, example for TEP8 is

python3 -m sglang.launch_server --port 8080 --model deepseek-ai/DeepSeek-R1-0528 --trust-remote-code --kv-cache-dtype fp8_e4m3 --tensor-parallel-size 8 --data-parallel-size 1 --expert-parallel-size 8 --enable-dp-lm-head --max-running-requests 256 --cuda-graph-max-bs 256 --mem-fraction-static 0.85 --chunked-prefill-size 32768 --max-prefill-tokens 70000 --enable-flashinfer-allreduce-fusion --disable-radix-cache --quantization fp8 --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --model-loader-extra-config {"enable_multithread_load": true} --stream-interval 30

I'll update with more configs once they finish. Done. Everything looks normal.

GPQA Accuracy

flashinfer parallelism mtp mean std scores
0.6.3 dep8 off 0.802 0.412 0.808, 0.808, 0.793, 0.793, 0.828, 0.818, 0.788, 0.783
0.6.4 dep8 off 0.801 0.402 0.818, 0.818, 0.803, 0.788, 0.773, 0.788, 0.818, 0.798
0.6.3 dep8 on 0.814 0.394 0.823, 0.823, 0.808, 0.823, 0.803, 0.803, 0.823, 0.808
0.6.4 dep8 on 0.799 0.405 0.788, 0.833, 0.793, 0.798, 0.788, 0.803, 0.798, 0.793
0.6.3 tep8 off 0.797 0.405 0.788, 0.793, 0.798, 0.798, 0.808, 0.803, 0.793, 0.793
0.6.4 tep8 off 0.798 0.398 0.793, 0.793, 0.778, 0.798, 0.803, 0.823, 0.793, 0.803
0.6.3 tep8 on 0.798 0.398 0.758, 0.808, 0.793, 0.828, 0.798, 0.788, 0.808, 0.803
0.6.4 tep8 on 0.783 0.390 0.778, 0.793, 0.818, 0.763, 0.768, 0.768, 0.768, 0.813
0.6.3 tp8 off 0.807 0.398 0.803, 0.803, 0.823, 0.793, 0.828, 0.798, 0.803, 0.803
0.6.4 tp8 off 0.801 0.416 0.793, 0.808, 0.813, 0.793, 0.798, 0.808, 0.813, 0.778
0.6.3 tp8 on 0.806 0.390 0.783, 0.823, 0.823, 0.833, 0.823, 0.758, 0.793, 0.813
0.6.4 tp8 on 0.802 0.368 0.783, 0.788, 0.798, 0.798, 0.803, 0.808, 0.798, 0.838

Benchmarking and Profiling

In tables below, numerical column names is concurrency. Client is bench_serving with ISL=4096 OSL=512.

Median TTFT (ms)

flashinfer parallelism mtp 1 2 4 8 16 32 64 128 256
0.6.3 dep8 off 507.3 886.0 1107.0 1775.5 2904.3 4949.0 7481.7 12324.5 22043.6
0.6.4 dep8 off 512.3 887.4 1111.9 1424.3 2902.2 4915.6 7560.9 12261.1 22025.7
0.6.3 dep8 on 526.3 547.5 549.5 553.6 562.2 637.9 896.0 1050.2 2473.6
0.6.4 dep8 on 531.9 549.5 552.2 555.8 562.4 676.1 912.4 1040.8 2544.3
0.6.3 tep8 off 222.9 431.3 794.8 1519.4 2805.6 4870.5 8478.6 14276.3 26214.6
0.6.4 tep8 off 223.2 388.8 794.7 1518.5 2804.2 4873.2 7840.9 14107.8 26063.3
0.6.3 tep8 on 230.9 250.3 244.3 252.4 263.1 332.6 476.3 637.2 825.2
0.6.4 tep8 on 233.7 251.5 244.7 252.3 264.4 330.5 471.3 617.4 826.3
0.6.3 tp8 off 199.0 380.3 684.4 1301.1 2407.5 4186.9 7216.8 12094.3 22187.2
0.6.4 tp8 off 198.1 380.6 681.4 1297.7 2397.5 4166.5 7328.2 11903.5 21983.4
0.6.3 tp8 on 204.5 218.6 226.0 222.1 232.4 292.3 416.0 574.5 719.9
0.6.4 tp8 on 205.6 220.1 227.8 222.8 233.8 293.2 418.9 553.4 706.0

Median Output TPS/user (1000 / TPOT)

flashinfer parallelism mtp 1 2 4 8 16 32 64 128 256
0.6.3 dep8 off 64.5 62.8 61.1 59.1 51.8 43.8 33.2 23.3 15.1
0.6.4 dep8 off 64.1 62.9 61.0 55.0 51.8 44.0 33.1 23.1 15.1
0.6.3 dep8 on 99.7 85.0 75.2 55.9 38.3 24.1 14.8 8.8 5.1
0.6.4 dep8 on 94.4 86.8 73.3 55.1 38.7 24.8 15.0 8.8 5.0
0.6.3 tep8 off 123.4 108.4 97.9 83.9 67.2 49.3 32.3 20.4 12.4
0.6.4 tep8 off 123.5 108.8 97.9 84.2 67.3 49.3 32.3 20.7 12.3
0.6.3 tep8 on 190.7 159.2 127.4 90.9 60.4 38.0 23.1 13.1 7.3
0.6.4 tep8 on 188.8 160.8 127.6 90.3 60.0 37.5 23.2 13.0 7.3
0.6.3 tp8 off 133.8 124.1 115.0 96.5 76.3 53.6 36.1 22.5 13.1
0.6.4 tp8 off 134.3 125.0 115.9 96.7 76.5 53.7 36.2 21.6 13.2
0.6.3 tp8 on 221.0 180.1 138.3 97.9 64.9 41.4 25.0 14.1 8.0
0.6.4 tp8 on 214.5 172.6 139.0 98.5 65.0 41.3 24.9 14.0 8.0

Output TPS/GPU

flashinfer parallelism mtp 1 2 4 8 16 32 64 128 256
0.6.3 dep8 off 7.6 14.2 27.0 49.1 79.7 122.9 177.9 236.7 291.0
0.6.4 dep8 off 7.6 14.2 27.0 48.2 79.5 123.1 177.8 237.2 291.6
0.6.3 dep8 on 11.1 20.2 33.6 53.3 73.3 96.9 122.1 150.5 167.7
0.6.4 dep8 on 10.7 19.7 33.3 52.4 73.6 98.6 123.2 147.9 165.7
0.6.3 tep8 off 14.7 24.7 42.5 67.4 98.5 134.2 171.9 210.6 242.7
0.6.4 tep8 off 14.7 25.1 42.6 67.5 98.5 134.4 173.2 211.1 242.3
0.6.3 tep8 on 21.9 37.0 58.0 85.1 113.0 147.6 182.1 210.3 235.4
0.6.4 tep8 on 21.8 37.6 57.1 86.1 115.2 146.3 181.0 209.5 235.5
0.6.3 tp8 off 15.9 28.2 50.0 77.6 112.3 149.3 191.8 235.1 267.9
0.6.4 tp8 off 16.0 28.4 50.3 77.9 112.8 149.4 191.3 233.4 269.2
0.6.3 tp8 on 26.9 41.4 64.6 93.4 125.4 162.3 196.6 223.9 255.9
0.6.4 tp8 on 25.1 41.6 63.9 93.7 125.8 161.0 195.9 223.9 256.6

Total TPS/GPU

flashinfer parallelism mtp 1 2 4 8 16 32 64 128 256
0.6.3 dep8 off 68.2 128.1 242.9 442.3 717.0 1106.4 1601.3 2130.2 2618.8
0.6.4 dep8 off 68.0 128.0 243.1 433.5 715.4 1107.9 1599.9 2134.8 2624.2
0.6.3 dep8 on 99.8 182.1 302.2 479.5 659.6 871.7 1098.7 1354.2 1508.9
0.6.4 dep8 on 96.7 177.3 299.9 471.8 662.0 887.3 1108.5 1330.8 1491.6
0.6.3 tep8 off 132.0 222.4 382.9 606.6 886.2 1207.9 1547.5 1895.4 2184.1
0.6.4 tep8 off 132.1 226.2 383.6 607.7 886.8 1210.0 1559.1 1899.9 2180.5
0.6.3 tep8 on 196.7 333.3 521.6 765.7 1017.0 1328.6 1639.3 1893.1 2118.6
0.6.4 tep8 on 196.5 338.7 513.5 774.9 1036.5 1316.5 1629.4 1885.8 2119.8
0.6.3 tp8 off 143.3 254.0 449.7 698.5 1010.9 1343.3 1726.6 2116.2 2411.3
0.6.4 tp8 off 143.9 255.6 452.7 700.7 1014.8 1344.2 1721.5 2100.9 2423.0
0.6.3 tp8 on 242.0 372.2 581.8 840.9 1128.4 1460.4 1769.6 2014.7 2303.0
0.6.4 tp8 on 226.1 374.0 574.9 842.9 1132.6 1448.7 1763.2 2015.0 2309.0

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nvjullin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the FlashInfer library dependency to version 0.6.4. This upgrade ensures the project utilizes the latest optimizations and features provided by FlashInfer, a high-performance attention kernel library. Benchmarking results indicate that this update maintains similar accuracy and generally stable performance, with minor improvements in some concurrency scenarios.

Highlights

  • FlashInfer Version Update: The flashinfer dependency has been upgraded from version 0.6.3 to 0.6.4 across the project's Dockerfile, Python dependencies, and runtime version checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docker/Dockerfile
    • Updated the FLASHINFER_VERSION build argument to 0.6.4.
  • python/pyproject.toml
    • Updated flashinfer_python and flashinfer_cubin dependencies to version 0.6.4.
  • python/sglang/srt/entrypoints/engine.py
    • Updated the required flashinfer_python version in the assertion check to 0.6.4.
  • python/sglang/srt/server_args.py
    • Adjusted the version check for enabling flashinfer_trtllm as the MoE runner backend to require FlashInfer version 0.6.4 or newer.
  • python/sglang/srt/utils/common.py
    • Updated the example min_version in the docstring for check_pkg_version_at_least to 0.6.4.
  • scripts/ci/cuda/ci_install_dependency.sh
    • Updated the FLASHINFER_VERSION environment variable used in CI scripts to 0.6.4.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/release-docker-cu13-framework.yml
Activity
  • Accuracy tests were performed on B200 cuda13 with a pip-installed environment, showing comparable GPQA accuracy between FlashInfer 0.6.3 and 0.6.4.
  • Benchmarking and profiling were conducted to compare median TTFT, median output TPS/user, output TPS/GPU, and total TPS/GPU across various concurrency levels for both FlashInfer versions.
  • The pull request includes a checklist for formatting, unit tests, documentation, and benchmark results, indicating the author's progress and awareness of contribution guidelines.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the flashinfer dependency from version 0.6.3 to 0.6.4. The changes are applied consistently across all relevant files, including the Dockerfile, Python dependencies in pyproject.toml, version checks in the source code, and CI scripts. The author has also included comprehensive benchmark and accuracy test results, which show that the new version maintains or improves performance without regressions. The changes are clear, correct, and well-tested. This is a solid update.

@b8zhong
Copy link
Collaborator

b8zhong commented Feb 27, 2026

@nvjullin Hi, thanks for the comprehensive performance analysis. Though, It should be covered in existing PR here:

#19005

I add you as co-author to that PR.

@b8zhong b8zhong closed this Feb 27, 2026
@nvjullin nvjullin deleted the update-flashinfer branch March 19, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants