DP: dispatch fp8 hidden_states in INC#684
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements FP8 hidden state dispatching in INC (Intel Neural Compressor) for data parallel (DP) execution. The main purpose is to optimize MoE (Mixture of Experts) layer communication by dispatching FP8-quantized hidden states and routing information across DP ranks, rather than full precision tensors.
Key changes:
- Replaces router logits dispatching with topk IDs and weights dispatching for more efficient communication
- Adds FP8 dtype support for hidden states when INC quantization is enabled
- Introduces dispatch functions that are passed to MoE operators for flexible tensor distribution
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_dp_utils.py | Adds dispatch_tensor and dispatch_hidden_states functions; updates HPUDPMetadata to store topk_ids and topk_weights instead of router_logits; adds FP8 dtype detection for INC quantization |
| vllm_gaudi/ops/hpu_fused_moe.py | Integrates dispatch functions into unquantized MoE processing; dispatches topk_ids and topk_weights when DP is enabled |
| vllm_gaudi/ops/hpu_fp8.py | Integrates dispatch functions into FP8 MoE processing; dispatches topk_ids and topk_weights when DP is enabled |
| vllm_gaudi/extension/ops.py | Adds dispatch_fn parameter to VllmMixtureOfExpertsOp, VllmMixtureOfExpertsOpFP8, and VllmMixtureOfExpertsOpFP8PerChannel constructors with _get_dispatch_func accessor method |
| vllm_gaudi/distributed/device_communicators/hpu_communicator.py | Removes dispatch implementation, delegating to plugin FusedMoEMethod for better performance |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g., - we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long. - we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in #684 Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
b26ad13 to
10bba11
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
10bba11 to
fa9b3b9
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
fa9b3b9 to
90c902f
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
7068e81 to
3993288
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
3993288 to
d5d5436
Compare
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
|
@yiliu30 , please help to approve if you think the fix is looking good |
depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: lvkaokao <kaokao.lv@intel.com>
Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
depends on vllm-project#680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
This PR is mainly to move the dispatch logic from vllm to vllm-gaudi so that we can do more ninja optimizations. E.g., - we can dispatch the topk weights and ids instead of router_logits because the topk performance is not good when the sequence length is long. - we can dispatch the fp8 hidden_states after quantization for smaller message size. This will be addressed in #684 Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
depends on #680 --------- Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: #680 and #684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
depends on #680