GPT OSS Integration Code#771
Conversation
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
There was a problem hiding this comment.
Pull request overview
This PR integrates support for the GPT OSS model type, including additions for handling model-specific routing logic, bias support in MoE layers, and attention sink mechanisms for improved inference.
- Adds GPT OSS-specific expert routing and softmax handling in the MoE forward pass
- Implements bias support throughout the MoE pipeline
- Introduces attention sink functionality across attention backends and operations
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Increases sliding window block size calculation by 1 |
| vllm_gaudi/ops/hpu_fused_moe.py | Adds GPT OSS model type detection, bias handling in MoE operations, and model-specific expert routing |
| vllm_gaudi/extension/utils.py | Adds sinks parameter support to forward pass |
| vllm_gaudi/extension/ops.py | Implements sink attention mechanism in pipelined and naive attention functions, adds bias support to MoE operations |
| vllm_gaudi/attention/backends/hpu_attn.py | Adds sinks parameter and dtype consistency checks in attention implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
/run-gaudi-tests |
Signed-off-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
|
/run-gaudi-tests |
|
Only codeowners and testowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, adobrzyn, mgawarkiewicz-intel, afierka-intel, michalkuligowski, iboiko-habana, kamil-kaczor, ksmusz, PatrykWo, kamil-kaczor, kfojcik-intel, ksmusz, wuxun-zhang, xuechendi, attafosu, ulivne, Kacper-Pietkun, iboiko-habana, jkaniecki, jbyczkow, wpyszka |
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
|
/run-gaudi-tests |
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
…vllm-project#852) Signed-off-by: Dudi Lester <dlester@habana.ai> Co-authored-by: Kamil Kaczor <kamil.kaczor@intel.com>
Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
8416f2f to
f3a4560
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
/run-gaudi-tests |
|
PR is taken care through #887 |
No description provided.