Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1093,6 +1093,14 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
layer._v_scale = layer.v_scale
layer._q_scale = layer.q_scale

# Also set float scales used by FlashInfer for attention/cache paths.
if layer.k_scale.numel() == 1:
layer._k_scale_float = layer.k_scale.item()
if layer.v_scale.numel() == 1:
layer._v_scale_float = layer.v_scale.item()
if layer.q_scale.numel() == 1:
layer._q_scale_float = layer.q_scale.item()
Comment on lines +1096 to +1102
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a real runtime correctness issue for FlashInfer, but there’s no automated coverage asserting that loaded q/k/v scales are propagated into the host *_scale_float fields. Please add a unit/integration test that constructs an Attention layer (or minimal stub) with CompressedTensorsKVCacheMethod, sets k_scale/v_scale/q_scale to non-1.0 scalars, runs process_weights_after_loading, and verifies _k_scale_float/_v_scale_float/_q_scale_float match (and are not left at the default 1.0).

Copilot uses AI. Check for mistakes.

# Discard all placeholders.
del layer.k_scale
del layer.v_scale
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
size_k_first = True
# TODO(rob): refactor block quant into separate class.
if self.strategy == QuantizationStrategy.BLOCK:
assert self.is_static_input_scheme is False
assert not self.is_static_input_scheme
Comment on lines 112 to +114
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_static_input_scheme is effectively treated as tri-state in this codepath (bool or None). In CompressedTensorsConfig._get_scheme_from_parts, is_static_input_scheme is computed as input_quant and not input_quant.dynamic, which evaluates to None when input_quant is missing (weight-only). Consider updating the type hints for CompressedTensorsW8A16Fp8.__init__ / the attribute to bool | None (or normalizing to a strict bool before storing) so the intended contract is explicit and static type checking doesn’t flag this.

Copilot uses AI. Check for mistakes.
size_k_first = False
weight, weight_scale = process_fp8_weight_block_strategy(
weight, weight_scale
Expand Down
Loading