Cherry-pick: Updated fix regression in Mistral-Large-3-675B (#1304) for v0.19.0 by skavulya · Pull Request #1374 · vllm-project/vllm-gaudi

skavulya · 2026-04-17T23:01:22Z

Renames FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053

github-actions · 2026-04-17T23:01:41Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Rename FP8 blockwise compressed tensors scales to match HPU ops Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>

Copilot

Pull request overview

Cherry-pick to v0.19.0 that fixes an HPU FP8 regression affecting Mistral-Large-3-675B (and similar) by aligning blockwise FP8 scale tensor naming/handling with HPU ops and ensuring the relevant post-processing paths behave correctly.

Changes:

Add a post-load scale “alias/rename” helper and update FP8 dequant/apply paths to prefer *_scale_inv where required by HPU ops (Linear + MoE).
Adjust block FP8 weight handling to route through HPU block FP8 linear application and avoid shape/layout mismatches.
Add/extend unit tests covering FP8 block-quantized Linear and MoE flows; make MLA kv_b_proj weights contiguous in the MLA HPU path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`vllm_gaudi/ops/hpu_compressed_tensors.py`	Introduces scale aliasing to `*_scale_inv`, updates FP8 block/channel execution paths, and adds MoE quant-config plumbing for renamed scale attributes.
`vllm_gaudi/attention/oot_mla.py`	Ensures dequantized/transposed kv_b_proj weights are contiguous to avoid runtime overhead/issues.
`tests/unit_tests/ops/test_hpu_compressed_tensors.py`	Adds new unit tests for FP8 block-quantized Linear and MoE, focused on the new scale naming + post-processing behavior.

Copilot · 2026-04-17T23:07:25Z

+    scale = scale.data if isinstance(scale, torch.nn.Parameter) else scale
+    layer.register_parameter(hpu_scale_name, torch.nn.Parameter(scale, requires_grad=False))


_hpu_weight_scale_alias uses .data to extract the underlying tensor before re-wrapping it in a new torch.nn.Parameter. Using .data is discouraged and can bypass autograd safety checks; since these scales are already non-trainable parameters, it’s safer to either move/rename the existing Parameter (preserving its subclass/metadata) or use detach() when creating the new Parameter.

Suggested change

scale = scale.data if isinstance(scale, torch.nn.Parameter) else scale

layer.register_parameter(hpu_scale_name, torch.nn.Parameter(scale, requires_grad=False))

if isinstance(scale, torch.nn.Parameter):

layer.register_parameter(hpu_scale_name, scale)

else:

aliased_scale = scale.detach() if isinstance(scale, torch.Tensor) else scale

layer.register_parameter(

hpu_scale_name,

torch.nn.Parameter(aliased_scale, requires_grad=False),

)

Copilot · 2026-04-17T23:07:25Z

+    weight_fp32 = torch.randn(output_size, input_size, dtype=torch.bfloat16, device="hpu")
+    weight_fp8 = weight_fp32.to(torch.float8_e4m3fn)


In this test, weight_fp32 is created with dtype=torch.bfloat16 (and then cast to FP8). Renaming the variable (or adjusting the dtype) would avoid confusion about what precision the tensor actually represents.

Suggested change

weight_fp32 = torch.randn(output_size, input_size, dtype=torch.bfloat16, device="hpu")

weight_fp8 = weight_fp32.to(torch.float8_e4m3fn)

weight_bf16 = torch.randn(output_size, input_size, dtype=torch.bfloat16, device="hpu")

weight_fp8 = weight_bf16.to(torch.float8_e4m3fn)

Copilot · 2026-04-17T23:07:25Z

+    # Execute layer with synthetic input
+    x = torch.randn(1, 4, input_size, dtype=torch.bfloat16, device="hpu")
+    out = oot_op.scheme.apply_weights(oot_op, x)
+    assert out.shape == (1, 4, output_size)
+    assert out.dtype == torch.bfloat16
+
+


This new block-quantized Linear test only asserts shape and dtype. Given the PR changes affect dequantization / scale naming, it would be stronger to also validate numerical correctness (e.g., compare against a reference computed from a dequantized BF16 weight and a BF16 linear/matmul).

Suggested change

# Execute layer with synthetic input

x = torch.randn(1, 4, input_size, dtype=torch.bfloat16, device="hpu")

out = oot_op.scheme.apply_weights(oot_op, x)

assert out.shape == (1, 4, output_size)

assert out.dtype == torch.bfloat16

# Execute layer with deterministic input and validate numerical correctness

# against a BF16 reference computed from the dequantized FP8 weight.

x = torch.ones(1, 4, input_size, dtype=torch.bfloat16, device="hpu")

out = oot_op.scheme.apply_weights(oot_op, x)

assert out.shape == (1, 4, output_size)

assert out.dtype == torch.bfloat16

ref_weight = weight_fp8.to(torch.bfloat16)

ref_out = torch.matmul(x, ref_weight.transpose(0, 1))

assert torch.allclose(out, ref_out, atol=1e-2, rtol=1e-2)

Copilot · 2026-04-17T23:07:26Z

+        out = oot_op.runner.forward_impl(oot_op, hidden_states, router_logits, hidden_states)
+
+    assert out.shape == hidden_states.shape
+    assert out.dtype == torch.bfloat16


This new block-quantized MoE test currently verifies only output shape and dtype. To better cover the regression being fixed, consider adding a correctness assertion (or at least a stronger sanity check like finite outputs) against a reference implementation for the same weights/scales.

Suggested change

assert out.dtype == torch.bfloat16

assert out.dtype == torch.bfloat16

assert torch.isfinite(

out).all(), "block-quantized MoE output should not contain NaN or Inf values"

github-actions · 2026-04-18T05:32:59Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a69949bdadf0e8942b7a1619b229cb475beef20

skavulya requested review from PatrykWo, mgawarkiewicz-intel and wpyszka as code owners April 17, 2026 23:01

Copilot AI review requested due to automatic review settings April 17, 2026 23:01

Copilot started reviewing on behalf of skavulya April 17, 2026 23:01 View session

skavulya and others added 8 commits April 17, 2026 16:04

Fix regression in Mistral-Large-3-675B

ebc1f73

Rename FP8 blockwise compressed tensors scales to match HPU ops Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Fix dynamic quantization with INC for Mistral-Large-3

b91f982

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Delete duplicate weight_scale FP8 blockwise parameter and add tests

9fe229b

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Align FP8 block compressed tensors with FP8 block in hpu_fp8.py

d117ae3

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Add get_fused_moe_quant_config from vllm upstream PR 38960

8a26348

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Update unit test for hpu block compressed tensors

4654581

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Update blockwise compressed tensor test for v0.19.0

cdcb48d

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Update input_scale in blockwise compressed tensors

61ae8c6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>

skavulya force-pushed the skavulya/mistral3-rename-scales-0.19.0 branch from 1e099b3 to 61ae8c6 Compare April 17, 2026 23:04

Copilot AI reviewed Apr 17, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 18, 2026

🚦 Team Review Dashboard #701

Open

afierka-intel approved these changes Apr 20, 2026

View reviewed changes

mgawarkiewicz-intel approved these changes Apr 20, 2026

View reviewed changes

mgawarkiewicz-intel merged commit 2e3ef72 into vllm-project:releases/v0.19.0 Apr 20, 2026
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick: Updated fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1374

Cherry-pick: Updated fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1374
mgawarkiewicz-intel merged 8 commits intovllm-project:releases/v0.19.0from
skavulya:skavulya/mistral3-rename-scales-0.19.0

skavulya commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		scale = scale.data if isinstance(scale, torch.nn.Parameter) else scale
		layer.register_parameter(hpu_scale_name, torch.nn.Parameter(scale, requires_grad=False))

		weight_fp32 = torch.randn(output_size, input_size, dtype=torch.bfloat16, device="hpu")
		weight_fp8 = weight_fp32.to(torch.float8_e4m3fn)

Conversation

skavulya commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 18, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants