DP: dispatch tensor in FusedMoEMethod by xinyu-intel · Pull Request #680 · vllm-project/vllm-gaudi

xinyu-intel · 2025-12-04T05:48:33Z

This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g.,

we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long.
we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in DP: dispatch fp8 hidden_states in INC #684

Copilot

Pull request overview

This PR refactors the data parallel (DP) dispatching logic for mixture of experts (MoE) models by moving the tensor dispatch operation from the HPU communicator into the FusedMoEMethod implementations. The key change is that instead of dispatching router_logits, the code now dispatches the already-computed topk_ids and topk_weights tensors directly in the FusedMoE forward pass.

Key changes:

Modified HPUDPMetadata to store topk_ids_across_dp and topk_weights_across_dp instead of router_logits_across_dp
Added a new dispatch_tensor utility function to handle all-gather operations across DP ranks
Updated FusedMoE implementations to perform tensor dispatching inline rather than in the communicator

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
vllm_gaudi/v1/worker/hpu_dp_utils.py	Refactored metadata to store topk tensors instead of router logits; added `dispatch_tensor` utility function
vllm_gaudi/ops/hpu_fused_moe.py	Integrated dispatch logic for topk_ids and topk_weights directly in forward pass
vllm_gaudi/ops/hpu_fp8.py	Integrated dispatch logic for topk_ids and topk_weights in FP8 variant
vllm_gaudi/distributed/device_communicators/hpu_communicator.py	Removed dispatch implementation as it's now handled in FusedMoE methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2025-12-05T09:02:45Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
1b7c7f5159484063af28cb47809d79e83d3301ec

xuechendi

Looks good to me, the PR makes dispatch logic clearer as well.

xuechendi · 2025-12-11T23:05:30Z

Since the last CI is bit old, @xinyu-intel , please rebase and fix the comments.
I'll go ahead merge tomorrow

xuechendi · 2025-12-11T23:06:44Z

BTW, please also add necessary description to explain the PR, and expected benefit for future reference

github-actions · 2025-12-12T01:46:36Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

github-actions · 2025-12-12T04:27:03Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
7618dc973dd1e56a46162bc7bd6e7625143bead0

xuechendi · 2025-12-12T15:42:18Z

@xinyu-intel , sorry, there is a conflict, Hmm, Let me resolve and rerun-ci

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

github-actions · 2025-12-12T17:10:01Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
91401c7a266450e332e88c3b569e93aeecca9a89

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: lvkaokao <kaokao.lv@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g., - we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long. - we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in #684 Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Copilot AI review requested due to automatic review settings December 4, 2025 05:48

xinyu-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners December 4, 2025 05:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated

Comment thread vllm_gaudi/ops/hpu_fused_moe.py

Comment thread vllm_gaudi/ops/hpu_fp8.py

xinyu-intel mentioned this pull request Dec 4, 2025

DP: dispatch fp8 hidden_states in INC #684

Merged

xinyu-intel force-pushed the dev/xinyu/hpu-dispatch-tensor branch from ad8b9cc to 1aeb69b Compare December 5, 2025 06:21

github-actions Bot mentioned this pull request Dec 8, 2025

🚦 Team Review Dashboard #701

Open

xuechendi approved these changes Dec 11, 2025

View reviewed changes

xuechendi self-assigned this Dec 11, 2025

xinyu-intel force-pushed the dev/xinyu/hpu-dispatch-tensor branch 2 times, most recently from f4cdff3 to a931b50 Compare December 12, 2025 01:56

DP: dispatch tensor in FusedMoEMethod

a931b50

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Merge branch 'main' into dev/xinyu/hpu-dispatch-tensor

f9d5642

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

xuechendi merged commit 8293752 into vllm-project:main Dec 12, 2025
47 checks passed

xuechendi pushed a commit that referenced this pull request Dec 17, 2025

DP: dispatch fp8 hidden_states in INC (#684)

b1f11cd

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

afierka-intel mentioned this pull request Jan 20, 2026

Fix Llama4 shape mismatch for 32k+ context window #842

Merged

afierka-intel mentioned this pull request Jan 21, 2026

Fix Llama4 shape mismatch for 32k+ context window (#842) #855

Merged

rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026

DP: dispatch fp8 hidden_states in INC (vllm-project#684)

ad251e1

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

adobrzyn pushed a commit that referenced this pull request Mar 31, 2026

DP: dispatch fp8 hidden_states in INC (#684)

c689411

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DP: dispatch tensor in FusedMoEMethod#680

DP: dispatch tensor in FusedMoEMethod#680
xuechendi merged 2 commits into
vllm-project:mainfrom
xinyu-intel:dev/xinyu/hpu-dispatch-tensor

xinyu-intel commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

xuechendi left a comment

Uh oh!

xuechendi commented Dec 11, 2025

Uh oh!

xuechendi commented Dec 11, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

Uh oh!

xuechendi commented Dec 12, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xinyu-intel commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Dec 5, 2025

✅ CI Passed

Uh oh!

xuechendi left a comment

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Dec 11, 2025

Uh oh!

xuechendi commented Dec 11, 2025

Uh oh!

github-actions Bot commented Dec 12, 2025

🚧 CI Blocked

Uh oh!

github-actions Bot commented Dec 12, 2025

✅ CI Passed

Uh oh!

xuechendi commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Dec 12, 2025

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinyu-intel commented Dec 4, 2025 •

edited

Loading

xuechendi commented Dec 12, 2025 •

edited

Loading