Fix Llama4 shape mismatch for 32k+ context window by afierka-intel · Pull Request #842 · vllm-project/vllm-gaudi

afierka-intel · 2026-01-20T12:06:08Z

Llama4 for max_model_len > 32k enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor q shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to UnqnatizedFusedMoEMetod -> forward: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D output.view based on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

Copilot

Pull request overview

This PR fixes a tensor shape mismatch issue in Llama4 when using context windows larger than 32k tokens. The problem occurs because temperature adjustment enabled for large contexts modifies the query tensor shape from 2D to 3D, but the output reshaping logic was incompatible with this change.

Changes:

Updated the output tensor reshaping logic in forward_oot to handle both 2D and 3D input shapes correctly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

github-actions · 2026-01-20T18:26:10Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
72506c98349d6bcd32b4e33eec7b5513453c1502

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

github-actions · 2026-01-21T11:32:00Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
72506c98349d6bcd32b4e33eec7b5513453c1502

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

afierka-intel requested review from mgawarkiewicz-intel, piotrbocian and wpyszka as code owners January 20, 2026 12:06

Copilot AI review requested due to automatic review settings January 20, 2026 12:06

Copilot AI reviewed Jan 20, 2026

View reviewed changes

afierka-intel added 2 commits January 20, 2026 14:25

Fix Llama4 shape mismatch for 32k+ context window

d7d52bc

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Add condition for dp_size

2ed6db6

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

afierka-intel force-pushed the dev/afierka/fix-llama4-shape-mismatch branch from 202275d to 2ed6db6 Compare January 20, 2026 12:29

xinyu-intel approved these changes Jan 20, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jan 20, 2026

🚦 Team Review Dashboard #701

Open

PatrykWo assigned afierka-intel and unassigned afierka-intel Jan 20, 2026

Fix missing return

e7c069c

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

mgawarkiewicz-intel approved these changes Jan 21, 2026

View reviewed changes

mgawarkiewicz-intel merged commit 906abe3 into releases/v0.13.0 Jan 21, 2026
72 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llama4 shape mismatch for 32k+ context window#842

Fix Llama4 shape mismatch for 32k+ context window#842
mgawarkiewicz-intel merged 3 commits into
releases/v0.13.0from
dev/afierka/fix-llama4-shape-mismatch

afierka-intel commented Jan 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jan 20, 2026

Uh oh!

github-actions Bot commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

afierka-intel commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented Jan 20, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Jan 21, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

afierka-intel commented Jan 20, 2026 •

edited

Loading