Skip to content

DP: dispatch fp8 hidden_states in INC#684

Merged
xuechendi merged 2 commits intovllm-project:mainfrom
xinyu-intel:dev/xinyu/dispatch-fp8-hidden
Dec 17, 2025
Merged

DP: dispatch fp8 hidden_states in INC#684
xuechendi merged 2 commits intovllm-project:mainfrom
xinyu-intel:dev/xinyu/dispatch-fp8-hidden

Conversation

@xinyu-intel
Copy link
Copy Markdown
Contributor

depends on #680

Copilot AI review requested due to automatic review settings December 4, 2025 13:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements FP8 hidden state dispatching in INC (Intel Neural Compressor) for data parallel (DP) execution. The main purpose is to optimize MoE (Mixture of Experts) layer communication by dispatching FP8-quantized hidden states and routing information across DP ranks, rather than full precision tensors.

Key changes:

  • Replaces router logits dispatching with topk IDs and weights dispatching for more efficient communication
  • Adds FP8 dtype support for hidden states when INC quantization is enabled
  • Introduces dispatch functions that are passed to MoE operators for flexible tensor distribution

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_dp_utils.py Adds dispatch_tensor and dispatch_hidden_states functions; updates HPUDPMetadata to store topk_ids and topk_weights instead of router_logits; adds FP8 dtype detection for INC quantization
vllm_gaudi/ops/hpu_fused_moe.py Integrates dispatch functions into unquantized MoE processing; dispatches topk_ids and topk_weights when DP is enabled
vllm_gaudi/ops/hpu_fp8.py Integrates dispatch functions into FP8 MoE processing; dispatches topk_ids and topk_weights when DP is enabled
vllm_gaudi/extension/ops.py Adds dispatch_fn parameter to VllmMixtureOfExpertsOp, VllmMixtureOfExpertsOpFP8, and VllmMixtureOfExpertsOpFP8PerChannel constructors with _get_dispatch_func accessor method
vllm_gaudi/distributed/device_communicators/hpu_communicator.py Removes dispatch implementation, delegating to plugin FusedMoEMethod for better performance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py Outdated
Comment thread vllm_gaudi/extension/ops.py
Comment thread vllm_gaudi/extension/ops.py Outdated
Comment thread vllm_gaudi/extension/ops.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 4, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

xuechendi added a commit that referenced this pull request Dec 12, 2025
This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so
that we can do more ninja optimizations. E.g.,

- we can dispatch the topk weights and ids instead of router_logits
because the topk performance is not good when the sequence length is
long.
- we can dispatch the fp8 hidden_states after quantization for smaller
message size. This will be addressed in #684

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from b26ad13 to 10bba11 Compare December 14, 2025 11:49
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from 10bba11 to fa9b3b9 Compare December 14, 2025 11:53
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from fa9b3b9 to 90c902f Compare December 15, 2025 02:35
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
e2ed238885be6af358be1851cd43105b7d036c49

@xuechendi xuechendi self-assigned this Dec 15, 2025
Copy link
Copy Markdown
Collaborator

@xuechendi xuechendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @yiliu30 may you help to review as well

Comment thread vllm_gaudi/v1/worker/hpu_dp_utils.py
@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dispatch-fp8-hidden branch from 3993288 to d5d5436 Compare December 16, 2025 11:57
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
17fec3af0942da83bcebe2ca0cb4f6ae81c634d8

@xuechendi
Copy link
Copy Markdown
Collaborator

@yiliu30 , please help to approve if you think the fix is looking good

@xinyu-intel xinyu-intel requested a review from yiliu30 December 17, 2025 00:33
@xuechendi xuechendi merged commit b1f11cd into vllm-project:main Dec 17, 2025
47 checks passed
lkk12014402 pushed a commit to lkk12014402/vllm-gaudi that referenced this pull request Dec 17, 2025
depends on vllm-project#680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: lvkaokao <kaokao.lv@intel.com>
mgawarkiewicz-intel pushed a commit that referenced this pull request Jan 21, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
afierka-intel added a commit to afierka-intel/vllm-gaudi that referenced this pull request Jan 21, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
adobrzyn pushed a commit that referenced this pull request Jan 26, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
rsmyrek pushed a commit to rsmyrek/vllm-gaudi that referenced this pull request Jan 26, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
hlahkar pushed a commit to hlahkar/vllm-gaudi that referenced this pull request Jan 27, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
testdig pushed a commit to testdig/vllm-gaudi-fork that referenced this pull request Jan 29, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
depends on vllm-project#680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so
that we can do more ninja optimizations. E.g.,

- we can dispatch the topk weights and ids instead of router_logits
because the topk performance is not good when the sequence length is
long.
- we can dispatch the fp8 hidden_states after quantization for smaller
message size. This will be addressed in #684

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
depends on #680

---------

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: #680 and #684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants