Skip to content

[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation#34863

Open
EliasOenal wants to merge 1 commit intovllm-project:mainfrom
EliasOenal:bugfix/compressed-tensors-fp8-flashinfer-scales
Open

[Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation#34863
EliasOenal wants to merge 1 commit intovllm-project:mainfrom
EliasOenal:bugfix/compressed-tensors-fp8-flashinfer-scales

Conversation

@EliasOenal
Copy link

@EliasOenal EliasOenal commented Feb 19, 2026

Purpose

Fix two issues in compressed-tensors + FlashInfer integration:

  1. Block FP8 assertion is too strict
    • In CompressedTensorsW8A16Fp8, block strategy currently asserts:
      self.is_static_input_scheme is False
    • For weight-only FP8 configs, is_static_input_scheme may be unset (None), which is semantically non-static but fails identity comparison.
    • This PR changes it to:
      assert not self.is_static_input_scheme
  2. KV/Q host float scales are not propagated after load
    • In CompressedTensorsKVCacheMethod.process_weights_after_loading, loaded k_scale/v_scale/q_scale are assigned to tensor fields (_k_scale/_v_scale/_q_scale) but corresponding host float fields remain at default 1.0.
    • FlashInfer attention/cache paths read these host float scales.
    • This PR propagates scalar loaded values to:
      _k_scale_float, _v_scale_float, _q_scale_float (guarded by numel() == 1).
      This fixes load/runtime correctness for mixed-quant deployments (AWQ + FP8 attn/KV) on Ampere.

Test Plan

  • Reproduce with the following model on Ampere using FlashInfer backend:
    • EliasOenal/MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3
  • Validate:
    1. Model initialization no longer fails in block FP8 path due to is_static_input_scheme being unset.
    2. After weight loading, per-layer _k_scale_float/_v_scale_float/_q_scale_float reflect loaded checkpoint scales (not default 1.0).
    3. End-to-end inference runs correctly in production serving flow.

Test Result

  • ✅ Validated in production deployment on Ampere (4x RTX A6000) with the model above.
  • ✅ Block FP8 load path works with unset is_static_input_scheme in weight-only config.
  • ✅ FlashInfer uses propagated calibrated float scales for attention/cache paths.
  • ✅ Model serves inference successfully with expected behavior.

Release Notes Update

  • Fixed compressed-tensors FP8 block weight-only handling by relaxing a strict is_static_input_scheme identity check so unset/non-static cases are accepted.
  • Fixed FlashInfer KV-cache scale propagation for compressed-tensors by copying loaded scalar q/k/v scales into host *_scale_float fields used in attention/cache paths.
  • Impact: avoids load-time assertion failures and prevents silent mis-scaling from default 1.0 host scales in mixed AWQ + FP8 attention/KV deployments.

… propagation

Allow block fp8 weight-only configs where is_static_input_scheme can be unset, and propagate loaded scalar q/k/v scales to *_scale_float. This ensures FlashInfer attention/cache paths use calibrated scales instead of default 1.0 values.

Signed-off-by: Elias Oenal <git@eliasoenal.com>
Copilot AI review requested due to automatic review settings February 19, 2026 01:51
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the bug Something isn't working label Feb 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important bug fixes for the compressed-tensors and FlashInfer integration. First, it correctly relaxes a strict assertion on is_static_input_scheme to handle None values in weight-only FP8 quantization, preventing incorrect assertion failures. Second, it ensures that loaded scalar Q/K/V scales are propagated to the host float fields used by FlashInfer, fixing a silent mis-scaling bug. Both changes are well-justified and correctly implemented, improving the robustness and correctness of mixed-quantization deployments. The code is clear and effectively resolves the identified issues.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two correctness issues in the compressed-tensors + FlashInfer integration path, primarily affecting FP8 block quant handling and FP8 KV/Q scaling after checkpoint load.

Changes:

  • Relaxes a strict assertion in the FP8 W8A16 block strategy to allow is_static_input_scheme to be unset (None) for weight-only configs.
  • Propagates loaded scalar q/k/v scales into the host float *_scale_float fields used by FlashInfer attention/cache paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py Makes block-strategy assertion accept None as “non-static” for weight-only FP8 configs.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Copies loaded scalar q/k/v scales into host float fields consumed by FlashInfer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 112 to +114
# TODO(rob): refactor block quant into separate class.
if self.strategy == QuantizationStrategy.BLOCK:
assert self.is_static_input_scheme is False
assert not self.is_static_input_scheme
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_static_input_scheme is effectively treated as tri-state in this codepath (bool or None). In CompressedTensorsConfig._get_scheme_from_parts, is_static_input_scheme is computed as input_quant and not input_quant.dynamic, which evaluates to None when input_quant is missing (weight-only). Consider updating the type hints for CompressedTensorsW8A16Fp8.__init__ / the attribute to bool | None (or normalizing to a strict bool before storing) so the intended contract is explicit and static type checking doesn’t flag this.

Copilot uses AI. Check for mistakes.
Comment on lines +1096 to +1102
# Also set float scales used by FlashInfer for attention/cache paths.
if layer.k_scale.numel() == 1:
layer._k_scale_float = layer.k_scale.item()
if layer.v_scale.numel() == 1:
layer._v_scale_float = layer.v_scale.item()
if layer.q_scale.numel() == 1:
layer._q_scale_float = layer.q_scale.item()
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a real runtime correctness issue for FlashInfer, but there’s no automated coverage asserting that loaded q/k/v scales are propagated into the host *_scale_float fields. Please add a unit/integration test that constructs an Attention layer (or minimal stub) with CompressedTensorsKVCacheMethod, sets k_scale/v_scale/q_scale to non-1.0 scalars, runs process_weights_after_loading, and verifies _k_scale_float/_v_scale_float/_q_scale_float match (and are not left at the default 1.0).

Copilot uses AI. Check for mistakes.
@EliasOenal
Copy link
Author

@mgoin @robertgshaw2-redhat would you like me to change anything, before this can be merged?

@EliasOenal
Copy link
Author

@tlrmchlsmth @yewentao256 @pavanimajety Friendly ping on this PR. It is a small, minimally invasive fix for two correctness issues in the compressed-tensors + FlashInfer path.

scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
Mark PR vllm-project#33303 as applied. Add additional MiniMax-specific PRs:
- vllm-project#34863: compressed-tensors FP8 scale propagation
- vllm-project#32232: structural_tag support
- vllm-project#35358: reasoning-end detection fix
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Mark PR vllm-project#33303 as applied. Add additional MiniMax-specific PRs:
- vllm-project#34863: compressed-tensors FP8 scale propagation
- vllm-project#32232: structural_tag support
- vllm-project#35358: reasoning-end detection fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants