Cherry-pick: Fix regression in Mistral-Large-3-675B (#1304) for v0.19.0 by skavulya · Pull Request #1345 · vllm-project/vllm-gaudi

skavulya · 2026-04-13T17:23:08Z

Cherry-pick of #1304 for v0.19.0
Renames FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053

Rename FP8 blockwise compressed tensors scales to match HPU ops Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Copilot

Pull request overview

Cherry-pick for v0.19.0 to fix an FP8 loading/runtime regression for blockwise-compressed tensors on HPU (notably affecting Mistral-Large-3-675B), primarily by aligning scale tensor attribute naming with what HPU FP8 ops expect.

Changes:

Add scale-attribute aliasing (*_weight_scale → *_weight_scale_inv) for FP8 block-quantized Linear and MoE paths to match HPU op conventions.
Route FP8 block Linear execution through an HPU block FP8 helper and normalize FP8 linear input/output reshaping.
Add unit tests covering FP8 block-quantized CompressedTensors Linear and MoE weight post-processing and execution.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`vllm_gaudi/ops/hpu_compressed_tensors.py`	Adds scale aliasing for block quant, updates dequant/apply paths, and adds MoE quant-config helper.
`vllm_gaudi/attention/oot_mla.py`	Ensures dequantized MLA kv projection weights are contiguous to avoid runtime overhead/issues.
`tests/unit_tests/ops/test_hpu_compressed_tensors.py`	Adds new unit tests for FP8 block CompressedTensors Linear and MoE flows.

github-actions · 2026-04-13T20:49:58Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a69949bdadf0e8942b7a1619b229cb475beef20

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>

github-actions · 2026-04-14T01:39:01Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a69949bdadf0e8942b7a1619b229cb475beef20

mgawarkiewicz-intel · 2026-04-14T10:07:46Z

We want to release tomorrow. This model is not of the highest priority, and this PR didn't go through full QA validation yet.
It's too late for merging this PR.

skavulya added 6 commits April 13, 2026 09:44

Fix regression in Mistral-Large-3-675B

650c61e

Rename FP8 blockwise compressed tensors scales to match HPU ops Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Fix dynamic quantization with INC for Mistral-Large-3

136593c

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Delete duplicate weight_scale FP8 blockwise parameter and add tests

f616de3

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Align FP8 block compressed tensors with FP8 block in hpu_fp8.py

af7eab8

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Add get_fused_moe_quant_config from vllm upstream PR 38960

341b2ff

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Update unit test for hpu block compressed tensors

3730227

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

skavulya requested a review from mgawarkiewicz-intel as a code owner April 13, 2026 17:23

Copilot AI review requested due to automatic review settings April 13, 2026 17:23

skavulya requested review from PatrykWo and wpyszka as code owners April 13, 2026 17:23

Update blockwise compressed tensor test for v0.19.0

f9758c7

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

skavulya force-pushed the skavulya/mistral3-rename-scales-0.19.0 branch from c0070da to f9758c7 Compare April 13, 2026 17:26

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py Outdated

github-actions Bot mentioned this pull request Apr 13, 2026

🚦 Team Review Dashboard #701

Open

Update input_scale in blockwise compressed tensors

1e099b3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>

mgawarkiewicz-intel closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick: Fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1345

Cherry-pick: Fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1345
skavulya wants to merge 8 commits intovllm-project:releases/v0.19.0from
skavulya:skavulya/mistral3-rename-scales-0.19.0

skavulya commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

mgawarkiewicz-intel commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

skavulya commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Apr 14, 2026

✅ CI Passed

Uh oh!

mgawarkiewicz-intel commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants