[HPU] Fix FP8 block-to-channel conversion breaking MLA weight loading#1220
[HPU] Fix FP8 block-to-channel conversion breaking MLA weight loading#1220
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
8c66943 to
e9a9525
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
When VLLM_HPU_FORCE_CHANNEL_FP8=True (default), block-quantized FP8 weights are converted to channel-wise FP8 by hpu_fp8.py. This makes weight_scale_inv 1D [N_out], but weight_block_size is not cleared. The upstream MLAAttention.process_weights_after_loading then calls scaled_dequantize with group_shape=weight_block_size, causing: ValueError: 1D scale with shape [4096] cannot be broadcast to x with shape [4096, 512], group_shape=(128, 128) Fix: HPUMLAAttention now handles FP8 kv_b_proj dequantization directly for both channel-wise (1D) and block (2D) scale formats. Tested with DeepSeek-R1 FP8 on g2.l (8x Gaudi2, TP=8) using calibration step-2-measure-scales.py with both mlperf and NeelNanda/pile-10k datasets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com>
e9a9525 to
c0eedba
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
There was a problem hiding this comment.
Pull request overview
This PR addresses two HPU startup/runtime blockers for MLA + FP8 block-quantized models (e.g., DeepSeek-R1): a kv_b_proj dequantization failure when block-FP8 is converted to channel-wise FP8, and an eager torchaudio import crash during model registry initialization.
Changes:
- Override MLA weight post-processing on HPU to directly dequantize
kv_b_projcorrectly for both channel-wise and block FP8 scales. - Add a safe model-registration helper to avoid vLLM startup crashes when optional model dependencies (e.g.,
torchaudio) are unavailable.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vllm_gaudi/models/__init__.py |
Adds _safe_register() to skip (with warning) model registrations that fail due to missing optional deps. |
vllm_gaudi/attention/oot_mla.py |
Adds HPU-specific kv_b_proj dequantization logic to avoid scale shape mismatch after block→channel FP8 conversion. |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
|
Finding 1 🟡 Medium · The root cause is in The real bug is that Fixing this in Suggestion: Consider also adding a one-liner to layer.weight_block_size = None # scale is now per-channel, not per-blockThis could be done in a follow-up PR or alongside this one. At minimum, add a [- Reviewed by Awesome ChlOpus] |
|
Finding 2 🟡 Medium · Block FP8 path uses hardcoded In the channel-wise FP8 branch, the dequantization correctly uses ws = weight_scale_inv.view(-1, 1).to(act_dtype)
kv_b_proj_weight = (weight.to(act_dtype) * ws).TBut in the block FP8 branch, kv_b_proj_weight = dequant_block_fp8_weight_naive(
weight,
weight_scale_inv,
kv_b_proj.weight_block_size,
original_M=orig_M,
original_N=orig_N,
do_unpad=(orig_M is not None),
).TIf Suggestion: Pass kv_b_proj_weight = dequant_block_fp8_weight_naive(
weight,
weight_scale_inv,
kv_b_proj.weight_block_size,
dtype=act_dtype,
original_M=orig_M,
original_N=orig_N,
do_unpad=(orig_M is not None),
).T[- Reviewed by Awesome ChlOpus] |
- Pass dtype=act_dtype to dequant_block_fp8_weight_naive in the block FP8 path to avoid hardcoded bfloat16 assumption. - Clear layer.weight_block_size=None in fp8_block_linear_postprocess_weights after block-to-channel conversion, fixing the root cause for all layers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
Addressed review findings from @adobrzyn and Copilot reviewer in commit cc70fef:
Also updated PR title and description to reflect both changes. |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
…vllm-project#1220) ## Summary Fix FP8 block-to-channel conversion that leaves stale `weight_block_size`, crashing MLA weight loading for DeepSeek-R1 and similar models on HPU. ## Problem When `VLLM_HPU_FORCE_CHANNEL_FP8=True` (default), `fp8_block_linear_postprocess_weights` in `ops.py` converts block-quantized FP8 weights to channel-wise FP8. After conversion, `weight_scale_inv` becomes 1D `[N_out]` (per-channel), but **`weight_block_size` is not cleared** (remains `[128, 128]`). Any downstream code that uses `weight_block_size` as a `group_shape` for dequantization (e.g. `MLAAttention.process_weights_after_loading` → `scaled_dequantize`) fails: ``` ValueError: 1D scale with shape torch.Size([4096]) cannot be broadcast to x with shape torch.Size([4096, 512]), group_shape=(128, 128) ``` ## Fix Two changes: 1. **Root cause** (`vllm_gaudi/extension/ops.py`): Clear `layer.weight_block_size = None` after block→channel conversion in `fp8_block_linear_postprocess_weights`. This fixes the issue for all layers, not just MLA. 2. **MLA-specific handling** (`vllm_gaudi/attention/oot_mla.py`): `HPUMLAAttention.process_weights_after_loading` now handles FP8 `kv_b_proj` dequantization directly for both: - **Channel-wise (1D scale)** — broadcast multiply with `act_dtype` - **Block (2D scale)** — `dequant_block_fp8_weight_naive` with explicit `dtype=act_dtype` Non-FP8 models are unaffected (falls through to upstream logic). ## Testing Tested on g2.l pod (8× Gaudi2 HL-225, 98304 MiB each) with: - **vllm** 0.18.1rc1 (commit `aec2dc6c0`) - **vllm-gaudi** main branch - **Model:** DeepSeek-R1 (FP8 block quant, `weight_block_size=[128,128]`) - **TP=8**, `gpu_memory_utilization=0.9` ### Results - ✅ `step-2-measure-scales.py` with mlperf dataset — completed successfully - ✅ `step-2-measure-scales.py` with NeelNanda/pile-10k (512 samples) — completed without OOM - ✅ Model loads and runs inference correctly ## Affected models Any model using MLA attention + FP8 block quantization on HPU: - DeepSeek-R1 - DeepSeek-V3 (FP8 variants) - Other DeepSeek V2/V3 architecture models with FP8 checkpoints --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#1220) Fix FP8 block-to-channel conversion that leaves stale `weight_block_size`, crashing MLA weight loading for DeepSeek-R1 and similar models on HPU. When `VLLM_HPU_FORCE_CHANNEL_FP8=True` (default), `fp8_block_linear_postprocess_weights` in `ops.py` converts block-quantized FP8 weights to channel-wise FP8. After conversion, `weight_scale_inv` becomes 1D `[N_out]` (per-channel), but **`weight_block_size` is not cleared** (remains `[128, 128]`). Any downstream code that uses `weight_block_size` as a `group_shape` for dequantization (e.g. `MLAAttention.process_weights_after_loading` → `scaled_dequantize`) fails: ``` ValueError: 1D scale with shape torch.Size([4096]) cannot be broadcast to x with shape torch.Size([4096, 512]), group_shape=(128, 128) ``` Two changes: 1. **Root cause** (`vllm_gaudi/extension/ops.py`): Clear `layer.weight_block_size = None` after block→channel conversion in `fp8_block_linear_postprocess_weights`. This fixes the issue for all layers, not just MLA. 2. **MLA-specific handling** (`vllm_gaudi/attention/oot_mla.py`): `HPUMLAAttention.process_weights_after_loading` now handles FP8 `kv_b_proj` dequantization directly for both: - **Channel-wise (1D scale)** — broadcast multiply with `act_dtype` - **Block (2D scale)** — `dequant_block_fp8_weight_naive` with explicit `dtype=act_dtype` Non-FP8 models are unaffected (falls through to upstream logic). Tested on g2.l pod (8× Gaudi2 HL-225, 98304 MiB each) with: - **vllm** 0.18.1rc1 (commit `aec2dc6c0`) - **vllm-gaudi** main branch - **Model:** DeepSeek-R1 (FP8 block quant, `weight_block_size=[128,128]`) - **TP=8**, `gpu_memory_utilization=0.9` - ✅ `step-2-measure-scales.py` with mlperf dataset — completed successfully - ✅ `step-2-measure-scales.py` with NeelNanda/pile-10k (512 samples) — completed without OOM - ✅ Model loads and runs inference correctly Any model using MLA attention + FP8 block quantization on HPU: - DeepSeek-R1 - DeepSeek-V3 (FP8 variants) - Other DeepSeek V2/V3 architecture models with FP8 checkpoints --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053 --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Rename FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053 --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
…or v0.19.0 (#1374) Renames FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053 --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Fix FP8 block-to-channel conversion that leaves stale
weight_block_size, crashing MLA weight loading for DeepSeek-R1 and similar models on HPU.Problem
When
VLLM_HPU_FORCE_CHANNEL_FP8=True(default),fp8_block_linear_postprocess_weightsinops.pyconverts block-quantized FP8 weights to channel-wise FP8. After conversion,weight_scale_invbecomes 1D[N_out](per-channel), butweight_block_sizeis not cleared (remains[128, 128]).Any downstream code that uses
weight_block_sizeas agroup_shapefor dequantization (e.g.MLAAttention.process_weights_after_loading→scaled_dequantize) fails:Fix
Two changes:
Root cause (
vllm_gaudi/extension/ops.py): Clearlayer.weight_block_size = Noneafter block→channel conversion infp8_block_linear_postprocess_weights. This fixes the issue for all layers, not just MLA.MLA-specific handling (
vllm_gaudi/attention/oot_mla.py):HPUMLAAttention.process_weights_after_loadingnow handles FP8kv_b_projdequantization directly for both:act_dtypedequant_block_fp8_weight_naivewith explicitdtype=act_dtypeNon-FP8 models are unaffected (falls through to upstream logic).
Testing
Tested on g2.l pod (8× Gaudi2 HL-225, 98304 MiB each) with:
aec2dc6c0)weight_block_size=[128,128])gpu_memory_utilization=0.9Results
step-2-measure-scales.pywith mlperf dataset — completed successfullystep-2-measure-scales.pywith NeelNanda/pile-10k (512 samples) — completed without OOMAffected models
Any model using MLA attention + FP8 block quantization on HPU: