[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate#33771
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request correctly reverts two previous commits that introduced performance regressions. The changes are consistent across all affected files, removing the problematic code related to on-device attention metadata computation and a temporary workaround for MoE layer compilation. The detailed description, including benchmark data, clearly justifies the revert. This is a solid contribution to restore performance and improve codebase maintainability. I find no issues with the changes.
227c9f3 to
c517eca
Compare
|
Hi @aabbccddwasd, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
c517eca to
7e72880
Compare
|
Hi @aabbccddwasd, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
7e72880 to
d637e75
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
please restrict the revert to flashinfer.py, we are still planning to deprecate these properties
7059c9c to
72f74d2
Compare
Fixed |
This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with seq_lens.cpu() in the FlashInfer backend. The property access provides better performance by avoiding unnecessary D2H transfers when the cached value is already available. Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with MTP (Multi-Token Prediction) enabled, where throughput dropped from 95 to 77 tps. Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Head branch was pushed to by a user without write access
72f74d2 to
5397ea8
Compare
|
@LucasWilkinson I'm sorry but my claude code accidentally disabled auto-merge when I try to reslove conflicts. could you help me to merge it? |
I'm curious - why not NVFP4 version on rtx6000 pro? |
1.bad performance ( around 60tps ), but may improve after #33417, but I don't think it can beat optimized MarlinGPTQ |
…cceptance rate (vllm-project#33771) Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
…cceptance rate (vllm-project#33771) Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Purpose
Fix performance regressions introduced by two recent commits:
e0327c9db([Attention][1/n] Remove usage of deprecatedseq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties #31773) - Decode performance regression-654a71fc3([torch.compile] Improve Cold Start for MoEs #32805) - MTP (Multi-Token Prediction) acceptance rate regressionAffected Model: GLM-4.7-GPTQ-INT4-INT8MIX
Test Plan
Performance Impact
The following benchmark data was collected on 4×RTX PRO 6000 BLACKWELL GPUs:
Validation
Test Result
✅ Performance recovered: Throughput restored from 77 tps back to 95 tps
✅ MTP acceptance rate restored: ~30% improvement
✅ Kernel selection corrected: Unquantized MTP now correctly uses fused_triton kernel
Note: Performance data was measured without optimized Triton configuration files. Additional performance gains are expected with proper Triton optimizations.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.