[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation#34863
Conversation
… propagation Allow block fp8 weight-only configs where is_static_input_scheme can be unset, and propagate loaded scalar q/k/v scales to *_scale_float. This ensures FlashInfer attention/cache paths use calibrated scales instead of default 1.0 values. Signed-off-by: Elias Oenal <git@eliasoenal.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces two important bug fixes for the compressed-tensors and FlashInfer integration. First, it correctly relaxes a strict assertion on is_static_input_scheme to handle None values in weight-only FP8 quantization, preventing incorrect assertion failures. Second, it ensures that loaded scalar Q/K/V scales are propagated to the host float fields used by FlashInfer, fixing a silent mis-scaling bug. Both changes are well-justified and correctly implemented, improving the robustness and correctness of mixed-quantization deployments. The code is clear and effectively resolves the identified issues.
There was a problem hiding this comment.
Pull request overview
This PR fixes two correctness issues in the compressed-tensors + FlashInfer integration path, primarily affecting FP8 block quant handling and FP8 KV/Q scaling after checkpoint load.
Changes:
- Relaxes a strict assertion in the FP8 W8A16 block strategy to allow
is_static_input_schemeto be unset (None) for weight-only configs. - Propagates loaded scalar
q/k/vscales into the host float*_scale_floatfields used by FlashInfer attention/cache paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py |
Makes block-strategy assertion accept None as “non-static” for weight-only FP8 configs. |
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py |
Copies loaded scalar q/k/v scales into host float fields consumed by FlashInfer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # TODO(rob): refactor block quant into separate class. | ||
| if self.strategy == QuantizationStrategy.BLOCK: | ||
| assert self.is_static_input_scheme is False | ||
| assert not self.is_static_input_scheme |
There was a problem hiding this comment.
is_static_input_scheme is effectively treated as tri-state in this codepath (bool or None). In CompressedTensorsConfig._get_scheme_from_parts, is_static_input_scheme is computed as input_quant and not input_quant.dynamic, which evaluates to None when input_quant is missing (weight-only). Consider updating the type hints for CompressedTensorsW8A16Fp8.__init__ / the attribute to bool | None (or normalizing to a strict bool before storing) so the intended contract is explicit and static type checking doesn’t flag this.
| # Also set float scales used by FlashInfer for attention/cache paths. | ||
| if layer.k_scale.numel() == 1: | ||
| layer._k_scale_float = layer.k_scale.item() | ||
| if layer.v_scale.numel() == 1: | ||
| layer._v_scale_float = layer.v_scale.item() | ||
| if layer.q_scale.numel() == 1: | ||
| layer._q_scale_float = layer.q_scale.item() |
There was a problem hiding this comment.
This change fixes a real runtime correctness issue for FlashInfer, but there’s no automated coverage asserting that loaded q/k/v scales are propagated into the host *_scale_float fields. Please add a unit/integration test that constructs an Attention layer (or minimal stub) with CompressedTensorsKVCacheMethod, sets k_scale/v_scale/q_scale to non-1.0 scalars, runs process_weights_after_loading, and verifies _k_scale_float/_v_scale_float/_q_scale_float match (and are not left at the default 1.0).
|
@mgoin @robertgshaw2-redhat would you like me to change anything, before this can be merged? |
|
@tlrmchlsmth @yewentao256 @pavanimajety Friendly ping on this PR. It is a small, minimally invasive fix for two correctness issues in the compressed-tensors + FlashInfer path. |
Mark PR vllm-project#33303 as applied. Add additional MiniMax-specific PRs: - vllm-project#34863: compressed-tensors FP8 scale propagation - vllm-project#32232: structural_tag support - vllm-project#35358: reasoning-end detection fix
Mark PR vllm-project#33303 as applied. Add additional MiniMax-specific PRs: - vllm-project#34863: compressed-tensors FP8 scale propagation - vllm-project#32232: structural_tag support - vllm-project#35358: reasoning-end detection fix
Purpose
Fix two issues in compressed-tensors + FlashInfer integration:
CompressedTensorsW8A16Fp8, block strategy currently asserts:self.is_static_input_scheme is Falseis_static_input_schememay be unset (None), which is semantically non-static but fails identity comparison.assert not self.is_static_input_schemeCompressedTensorsKVCacheMethod.process_weights_after_loading, loadedk_scale/v_scale/q_scaleare assigned to tensor fields (_k_scale/_v_scale/_q_scale) but corresponding host float fields remain at default1.0._k_scale_float,_v_scale_float,_q_scale_float(guarded bynumel() == 1).This fixes load/runtime correctness for mixed-quant deployments (AWQ + FP8 attn/KV) on Ampere.
Test Plan
EliasOenal/MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3is_static_input_schemebeing unset._k_scale_float/_v_scale_float/_q_scale_floatreflect loaded checkpoint scales (not default1.0).Test Result
is_static_input_schemein weight-only config.Release Notes Update