DP: dispatch fp8 hidden_states in INC by xinyu-intel · Pull Request #684 · vllm-project/vllm-gaudi

xinyu-intel · 2025-12-04T13:12:57Z

depends on #680

Copilot

Pull request overview

This PR implements FP8 hidden state dispatching in INC (Intel Neural Compressor) for data parallel (DP) execution. The main purpose is to optimize MoE (Mixture of Experts) layer communication by dispatching FP8-quantized hidden states and routing information across DP ranks, rather than full precision tensors.

Key changes:

Replaces router logits dispatching with topk IDs and weights dispatching for more efficient communication
Adds FP8 dtype support for hidden states when INC quantization is enabled
Introduces dispatch functions that are passed to MoE operators for flexible tensor distribution

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_dp_utils.py	Adds dispatch_tensor and dispatch_hidden_states functions; updates HPUDPMetadata to store topk_ids and topk_weights instead of router_logits; adds FP8 dtype detection for INC quantization
vllm_gaudi/ops/hpu_fused_moe.py	Integrates dispatch functions into unquantized MoE processing; dispatches topk_ids and topk_weights when DP is enabled
vllm_gaudi/ops/hpu_fp8.py	Integrates dispatch functions into FP8 MoE processing; dispatches topk_ids and topk_weights when DP is enabled
vllm_gaudi/extension/ops.py	Adds dispatch_fn parameter to VllmMixtureOfExpertsOp, VllmMixtureOfExpertsOpFP8, and VllmMixtureOfExpertsOpFP8PerChannel constructors with _get_dispatch_func accessor method
vllm_gaudi/distributed/device_communicators/hpu_communicator.py	Removes dispatch implementation, delegating to plugin FusedMoEMethod for better performance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2025-12-04T14:28:33Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g., - we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long. - we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in #684 Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

github-actions · 2025-12-14T11:50:56Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-12-14T11:54:08Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-12-15T02:35:38Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-12-15T04:37:17Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
e2ed238885be6af358be1851cd43105b7d036c49

xuechendi

LGTM, @yiliu30 may you help to review as well

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

github-actions · 2025-12-16T15:43:25Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
17fec3af0942da83bcebe2ca0cb4f6ae81c634d8

xuechendi · 2025-12-16T15:57:08Z

@yiliu30 , please help to approve if you think the fix is looking good

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: lvkaokao <kaokao.lv@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g., - we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long. - we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in #684 Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Copilot AI review requested due to automatic review settings December 4, 2025 13:12

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated

Comment thread vllm_gaudi/extension/ops.py

Comment thread vllm_gaudi/extension/ops.py Outdated

Comment thread vllm_gaudi/extension/ops.py Outdated

xinyu-intel mentioned this pull request Dec 12, 2025

DP: dispatch tensor in FusedMoEMethod #680

Merged

xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from b26ad13 to 10bba11 Compare December 14, 2025 11:49

xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from 10bba11 to fa9b3b9 Compare December 14, 2025 11:53

xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from fa9b3b9 to 90c902f Compare December 15, 2025 02:35

xinyu-intel marked this pull request as ready for review December 15, 2025 02:36

xinyu-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners December 15, 2025 02:36

xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch 2 times, most recently from 7068e81 to 3993288 Compare December 15, 2025 03:20

github-actions Bot mentioned this pull request Dec 15, 2025

🚦 Team Review Dashboard #701

Open

xuechendi self-assigned this Dec 15, 2025

xuechendi approved these changes Dec 15, 2025

View reviewed changes

yiliu30 suggested changes Dec 16, 2025

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py

xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from 3993288 to d5d5436 Compare December 16, 2025 11:57

DP: dispatch fp8 hidden_states in INC

48ee813

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

DP: assert when calibration

d5d5436

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

xinyu-intel requested a review from yiliu30 December 17, 2025 00:33

yiliu30 approved these changes Dec 17, 2025

View reviewed changes

xuechendi merged commit b1f11cd into vllm-project:main Dec 17, 2025
47 checks passed

afierka-intel mentioned this pull request Jan 20, 2026

Fix Llama4 shape mismatch for 32k+ context window #842

Merged

afierka-intel mentioned this pull request Jan 21, 2026

Fix Llama4 shape mismatch for 32k+ context window (#842) #855

Merged

rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026

DP: dispatch fp8 hidden_states in INC (vllm-project#684)

ad251e1

depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

adobrzyn pushed a commit that referenced this pull request Mar 31, 2026

DP: dispatch fp8 hidden_states in INC (#684)

c689411

depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DP: dispatch fp8 hidden_states in INC#684

DP: dispatch fp8 hidden_states in INC#684
xuechendi merged 2 commits intovllm-project:mainfrom
xinyu-intel:dev/xinyu/dispatch-fp8-hidden

xinyu-intel commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Dec 4, 2025

Uh oh!

github-actions Bot commented Dec 14, 2025

Uh oh!

github-actions Bot commented Dec 14, 2025

Uh oh!

github-actions Bot commented Dec 15, 2025

Uh oh!

github-actions Bot commented Dec 15, 2025

Uh oh!

xuechendi left a comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 16, 2025

Uh oh!

xuechendi commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xinyu-intel commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Dec 4, 2025

🚧 CI Blocked

Uh oh!

github-actions Bot commented Dec 14, 2025

🚧 CI Blocked

Uh oh!

github-actions Bot commented Dec 14, 2025

🚧 CI Blocked

Uh oh!

github-actions Bot commented Dec 15, 2025

🚧 CI Blocked

Uh oh!

github-actions Bot commented Dec 15, 2025

✅ CI Passed

Uh oh!

xuechendi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 16, 2025

✅ CI Passed

Uh oh!

xuechendi commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants