Conversation
Summary of ChangesHello @nvjullin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the FlashInfer library dependency to version 0.6.4. This upgrade ensures the project utilizes the latest optimizations and features provided by FlashInfer, a high-performance attention kernel library. Benchmarking results indicate that this update maintains similar accuracy and generally stable performance, with minor improvements in some concurrency scenarios. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the flashinfer dependency from version 0.6.3 to 0.6.4. The changes are applied consistently across all relevant files, including the Dockerfile, Python dependencies in pyproject.toml, version checks in the source code, and CI scripts. The author has also included comprehensive benchmark and accuracy test results, which show that the new version maintains or improves performance without regressions. The changes are clear, correct, and well-tested. This is a solid update.
Motivation
Modifications
Changed all flashinfer version 0.6.3 to 0.6.4.
Accuracy Tests
Tests are ran on B200 cuda13 with pip installed environment. Server is Deepseek-R1, example for TEP8 is
python3 -m sglang.launch_server --port 8080 --model deepseek-ai/DeepSeek-R1-0528 --trust-remote-code --kv-cache-dtype fp8_e4m3 --tensor-parallel-size 8 --data-parallel-size 1 --expert-parallel-size 8 --enable-dp-lm-head --max-running-requests 256 --cuda-graph-max-bs 256 --mem-fraction-static 0.85 --chunked-prefill-size 32768 --max-prefill-tokens 70000 --enable-flashinfer-allreduce-fusion --disable-radix-cache --quantization fp8 --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --model-loader-extra-config {"enable_multithread_load": true} --stream-interval 30I'll update with more configs once they finish.Done. Everything looks normal.GPQA Accuracy
Benchmarking and Profiling
In tables below, numerical column names is concurrency. Client is
bench_servingwith ISL=4096 OSL=512.Median TTFT (ms)
Median Output TPS/user (1000 / TPOT)
Output TPS/GPU
Total TPS/GPU
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci