Skip to content

Cherry-pick: Fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1345

Closed
skavulya wants to merge 8 commits intovllm-project:releases/v0.19.0from
skavulya:skavulya/mistral3-rename-scales-0.19.0
Closed

Cherry-pick: Fix regression in Mistral-Large-3-675B (#1304) for v0.19.0#1345
skavulya wants to merge 8 commits intovllm-project:releases/v0.19.0from
skavulya:skavulya/mistral3-rename-scales-0.19.0

Conversation

@skavulya
Copy link
Copy Markdown
Contributor

Cherry-pick of #1304 for v0.19.0
Renames FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053

Rename FP8 blockwise compressed tensors scales to match HPU ops

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Copilot AI review requested due to automatic review settings April 13, 2026 17:23
Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
@skavulya skavulya force-pushed the skavulya/mistral3-rename-scales-0.19.0 branch from c0070da to f9758c7 Compare April 13, 2026 17:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-pick for v0.19.0 to fix an FP8 loading/runtime regression for blockwise-compressed tensors on HPU (notably affecting Mistral-Large-3-675B), primarily by aligning scale tensor attribute naming with what HPU FP8 ops expect.

Changes:

  • Add scale-attribute aliasing (*_weight_scale*_weight_scale_inv) for FP8 block-quantized Linear and MoE paths to match HPU op conventions.
  • Route FP8 block Linear execution through an HPU block FP8 helper and normalize FP8 linear input/output reshaping.
  • Add unit tests covering FP8 block-quantized CompressedTensors Linear and MoE weight post-processing and execution.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
vllm_gaudi/ops/hpu_compressed_tensors.py Adds scale aliasing for block quant, updates dequant/apply paths, and adds MoE quant-config helper.
vllm_gaudi/attention/oot_mla.py Ensures dequantized MLA kv projection weights are contiguous to avoid runtime overhead/issues.
tests/unit_tests/ops/test_hpu_compressed_tensors.py Adds new unit tests for FP8 block CompressedTensors Linear and MoE flows.

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py
Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py Outdated
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a69949bdadf0e8942b7a1619b229cb475beef20

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
2a69949bdadf0e8942b7a1619b229cb475beef20

@mgawarkiewicz-intel
Copy link
Copy Markdown
Collaborator

We want to release tomorrow. This model is not of the highest priority, and this PR didn't go through full QA validation yet.
It's too late for merging this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants