[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation by EliasOenal · Pull Request #34863 · vllm-project/vllm

EliasOenal · 2026-02-19T01:51:49Z

Purpose

Fix two issues in compressed-tensors + FlashInfer integration:

Block FP8 assertion is too strict
- In CompressedTensorsW8A16Fp8, block strategy currently asserts:
  self.is_static_input_scheme is False
- For weight-only FP8 configs, is_static_input_scheme may be unset (None), which is semantically non-static but fails identity comparison.
- This PR changes it to:
  assert not self.is_static_input_scheme
KV/Q host float scales are not propagated after load
- In CompressedTensorsKVCacheMethod.process_weights_after_loading, loaded k_scale/v_scale/q_scale are assigned to tensor fields (_k_scale/_v_scale/_q_scale) but corresponding host float fields remain at default 1.0.
- FlashInfer attention/cache paths read these host float scales.
- This PR propagates scalar loaded values to:
  _k_scale_float, _v_scale_float, _q_scale_float (guarded by numel() == 1).
  This fixes load/runtime correctness for mixed-quant deployments (AWQ + FP8 attn/KV) on Ampere.

Test Plan

Reproduce with the following model on Ampere using FlashInfer backend:
- EliasOenal/MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3
Validate:
1. Model initialization no longer fails in block FP8 path due to is_static_input_scheme being unset.
2. After weight loading, per-layer _k_scale_float/_v_scale_float/_q_scale_float reflect loaded checkpoint scales (not default 1.0).
3. End-to-end inference runs correctly in production serving flow.

Test Result

✅ Validated in production deployment on Ampere (4x RTX A6000) with the model above.
✅ Block FP8 load path works with unset is_static_input_scheme in weight-only config.
✅ FlashInfer uses propagated calibrated float scales for attention/cache paths.
✅ Model serves inference successfully with expected behavior.

Release Notes Update

Fixed compressed-tensors FP8 block weight-only handling by relaxing a strict is_static_input_scheme identity check so unset/non-static cases are accepted.
Fixed FlashInfer KV-cache scale propagation for compressed-tensors by copying loaded scalar q/k/v scales into host *_scale_float fields used in attention/cache paths.
Impact: avoids load-time assertion failures and prevents silent mis-scaling from default 1.0 host scales in mixed AWQ + FP8 attention/KV deployments.

… propagation Allow block fp8 weight-only configs where is_static_input_scheme can be unset, and propagate loaded scalar q/k/v scales to *_scale_float. This ensures FlashInfer attention/cache paths use calibrated scales instead of default 1.0 values. Signed-off-by: Elias Oenal <git@eliasoenal.com>

github-actions · 2026-02-19T01:51:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces two important bug fixes for the compressed-tensors and FlashInfer integration. First, it correctly relaxes a strict assertion on is_static_input_scheme to handle None values in weight-only FP8 quantization, preventing incorrect assertion failures. Second, it ensures that loaded scalar Q/K/V scales are propagated to the host float fields used by FlashInfer, fixing a silent mis-scaling bug. Both changes are well-justified and correctly implemented, improving the robustness and correctness of mixed-quantization deployments. The code is clear and effectively resolves the identified issues.

Copilot

Pull request overview

This PR fixes two correctness issues in the compressed-tensors + FlashInfer integration path, primarily affecting FP8 block quant handling and FP8 KV/Q scaling after checkpoint load.

Changes:

Relaxes a strict assertion in the FP8 W8A16 block strategy to allow is_static_input_scheme to be unset (None) for weight-only configs.
Propagates loaded scalar q/k/v scales into the host float *_scale_float fields used by FlashInfer attention/cache paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py`	Makes block-strategy assertion accept `None` as “non-static” for weight-only FP8 configs.
`vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py`	Copies loaded scalar q/k/v scales into host float fields consumed by FlashInfer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T01:56:13Z

...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py

        # TODO(rob): refactor block quant into separate class.
        if self.strategy == QuantizationStrategy.BLOCK:
-            assert self.is_static_input_scheme is False
+            assert not self.is_static_input_scheme


is_static_input_scheme is effectively treated as tri-state in this codepath (bool or None). In CompressedTensorsConfig._get_scheme_from_parts, is_static_input_scheme is computed as input_quant and not input_quant.dynamic, which evaluates to None when input_quant is missing (weight-only). Consider updating the type hints for CompressedTensorsW8A16Fp8.__init__ / the attribute to bool | None (or normalizing to a strict bool before storing) so the intended contract is explicit and static type checking doesn’t flag this.

Copilot · 2026-02-19T01:56:13Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

+        # Also set float scales used by FlashInfer for attention/cache paths.
+        if layer.k_scale.numel() == 1:
+            layer._k_scale_float = layer.k_scale.item()
+        if layer.v_scale.numel() == 1:
+            layer._v_scale_float = layer.v_scale.item()
+        if layer.q_scale.numel() == 1:
+            layer._q_scale_float = layer.q_scale.item()


This change fixes a real runtime correctness issue for FlashInfer, but there’s no automated coverage asserting that loaded q/k/v scales are propagated into the host *_scale_float fields. Please add a unit/integration test that constructs an Attention layer (or minimal stub) with CompressedTensorsKVCacheMethod, sets k_scale/v_scale/q_scale to non-1.0 scalars, runs process_weights_after_loading, and verifies _k_scale_float/_v_scale_float/_q_scale_float match (and are not left at the default 1.0).

EliasOenal · 2026-02-21T17:40:19Z

@mgoin @robertgshaw2-redhat would you like me to change anything, before this can be merged?

EliasOenal · 2026-02-25T19:13:36Z

@tlrmchlsmth @yewentao256 @pavanimajety Friendly ping on this PR. It is a small, minimally invasive fix for two correctness issues in the compressed-tensors + FlashInfer path.

Mark PR vllm-project#33303 as applied. Add additional MiniMax-specific PRs: - vllm-project#34863: compressed-tensors FP8 scale propagation - vllm-project#32232: structural_tag support - vllm-project#35358: reasoning-end detection fix

Copilot AI review requested due to automatic review settings February 19, 2026 01:51

EliasOenal requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 19, 2026 01:51

Copilot started reviewing on behalf of EliasOenal February 19, 2026 01:52 View session

mergify bot added the bug Something isn't working label Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

Copilot AI reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation#34863

[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation#34863
EliasOenal wants to merge 1 commit intovllm-project:mainfrom
EliasOenal:bugfix/compressed-tensors-fp8-flashinfer-scales

EliasOenal commented Feb 19, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

EliasOenal commented Feb 21, 2026

Uh oh!

EliasOenal commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

EliasOenal commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Release Notes Update

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

EliasOenal commented Feb 21, 2026

Uh oh!

EliasOenal commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EliasOenal commented Feb 19, 2026 •

edited by github-actions bot

Loading