Skip to content

DP: dispatch tensor in FusedMoEMethod#680

Merged
xuechendi merged 2 commits into
vllm-project:mainfrom
xinyu-intel:dev/xinyu/hpu-dispatch-tensor
Dec 12, 2025
Merged

DP: dispatch tensor in FusedMoEMethod#680
xuechendi merged 2 commits into
vllm-project:mainfrom
xinyu-intel:dev/xinyu/hpu-dispatch-tensor

Conversation

@xinyu-intel
Copy link
Copy Markdown
Contributor

@xinyu-intel xinyu-intel commented Dec 4, 2025

This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g.,

  • we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long.
  • we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in DP: dispatch fp8 hidden_states in INC #684

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the data parallel (DP) dispatching logic for mixture of experts (MoE) models by moving the tensor dispatch operation from the HPU communicator into the FusedMoEMethod implementations. The key change is that instead of dispatching router_logits, the code now dispatches the already-computed topk_ids and topk_weights tensors directly in the FusedMoE forward pass.

Key changes:

  • Modified HPUDPMetadata to store topk_ids_across_dp and topk_weights_across_dp instead of router_logits_across_dp
  • Added a new dispatch_tensor utility function to handle all-gather operations across DP ranks
  • Updated FusedMoE implementations to perform tensor dispatching inline rather than in the communicator

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
vllm_gaudi/v1/worker/hpu_dp_utils.py Refactored metadata to store topk tensors instead of router logits; added dispatch_tensor utility function
vllm_gaudi/ops/hpu_fused_moe.py Integrated dispatch logic for topk_ids and topk_weights directly in forward pass
vllm_gaudi/ops/hpu_fp8.py Integrated dispatch logic for topk_ids and topk_weights in FP8 variant
vllm_gaudi/distributed/device_communicators/hpu_communicator.py Removed dispatch implementation as it's now handled in FusedMoE methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated
Comment thread vllm_gaudi/ops/hpu_fused_moe.py
Comment thread vllm_gaudi/ops/hpu_fp8.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 5, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
1b7c7f5159484063af28cb47809d79e83d3301ec

Copy link
Copy Markdown
Collaborator

@xuechendi xuechendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, the PR makes dispatch logic clearer as well.

@xuechendi
Copy link
Copy Markdown
Collaborator

Since the last CI is bit old, @xinyu-intel , please rebase and fix the comments.
I'll go ahead merge tomorrow

@xuechendi
Copy link
Copy Markdown
Collaborator

BTW, please also add necessary description to explain the PR, and expected benefit for future reference

@xuechendi xuechendi self-assigned this Dec 11, 2025
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@xinyu-intel xinyu-intel force-pushed the dev/xinyu/hpu-dispatch-tensor branch 2 times, most recently from f4cdff3 to a931b50 Compare December 12, 2025 01:56
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
7618dc973dd1e56a46162bc7bd6e7625143bead0

@xuechendi
Copy link
Copy Markdown
Collaborator

xuechendi commented Dec 12, 2025

@xinyu-intel , sorry, there is a conflict, Hmm, Let me resolve and rerun-ci

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
91401c7a266450e332e88c3b569e93aeecca9a89

@xuechendi xuechendi merged commit 8293752 into vllm-project:main Dec 12, 2025
47 checks passed
xuechendi pushed a commit that referenced this pull request Dec 17, 2025
depends on #680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
lkk12014402 pushed a commit to lkk12014402/vllm-gaudi that referenced this pull request Dec 17, 2025
depends on vllm-project#680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: lvkaokao <kaokao.lv@intel.com>
mgawarkiewicz-intel pushed a commit that referenced this pull request Jan 21, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Jan 21, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
adobrzyn pushed a commit that referenced this pull request Jan 26, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
rsmyrek pushed a commit to rsmyrek/vllm-gaudi that referenced this pull request Jan 26, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
hlahkar pushed a commit to hlahkar/vllm-gaudi that referenced this pull request Jan 27, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
testdig pushed a commit to testdig/vllm-gaudi-fork that referenced this pull request Jan 29, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
depends on vllm-project#680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so
that we can do more ninja optimizations. E.g.,

- we can dispatch the topk weights and ids instead of router_logits
because the topk performance is not good when the sequence length is
long.
- we can dispatch the fp8 hidden_states after quantization for smaller
message size. This will be addressed in #684

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
depends on #680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants