GPT OSS Integration Code#887
Conversation
There was a problem hiding this comment.
Pull request overview
This PR integrates support for the GPT OSS model type, adding specialized handling for routing logic in MoE layers, bias support throughout the MoE pipeline, and attention sink mechanisms to improve inference performance.
Changes:
- Adds GPT OSS-specific expert routing with reversed softmax/topk ordering in the MoE forward pass
- Implements bias support across MoE operations (w13_bias and w2_bias) with conditional bias application
- Introduces attention sink functionality across multiple attention backends (pipelined, naive, and FSDPA) to enhance attention computation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Adjusts sliding window block size calculation with +1 offset |
| vllm_gaudi/ops/hpu_fused_moe.py | Adds GPT OSS routing logic and bias support to MoE operations |
| vllm_gaudi/extension/utils.py | Extends forward signature to accept sinks parameter |
| vllm_gaudi/extension/ops.py | Implements attention sink mechanisms in pipelined and prompt attention functions |
| vllm_gaudi/attention/backends/hpu_attn.py | Adds sink support to attention implementations with dtype conversions |
Comments suppressed due to low confidence (1)
vllm_gaudi/ops/hpu_fused_moe.py:1
- Variable
iis undefined in this context. The variableiis used from the loop that starts at line 660, but this code at line 634 executes before that loop. Useexperts_range[0]or iterate through experts_range to access bias attributes.
from functools import partial
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
c22d4ef to
efed397
Compare
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Fixes Accuracy Issue in GPTOSS: #887. Updates `apply_monolithic` introduced in #876 to handle gptoss --------- Signed-off-by: Rohit kumar Singh <rksingh@habana.ai> Signed-off-by: Rohit Kumar Singh <9626333+SKRohit@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes Accuracy Issue in GPTOSS: vllm-project#887. Updates `apply_monolithic` introduced in vllm-project#876 to handle gptoss --------- Signed-off-by: Rohit kumar Singh <rksingh@habana.ai> Signed-off-by: Rohit Kumar Singh <9626333+SKRohit@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR integrates support for the GPT OSS model type, including additions for handling model-specific routing logic, bias support in MoE layers, and attention sink mechanisms for improved inference. Adds GPT OSS-specific expert routing and softmax handling in the MoE forward pass Implements bias support throughout the MoE pipeline Introduces attention sink functionality across attention backends and operations --------- Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai> Signed-off-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Fixes Accuracy Issue in GPTOSS: vllm-project#887. Updates `apply_monolithic` introduced in vllm-project#876 to handle gptoss --------- Signed-off-by: Rohit kumar Singh <rksingh@habana.ai> Signed-off-by: Rohit Kumar Singh <9626333+SKRohit@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR integrates support for the GPT OSS model type, including additions for handling model-specific routing logic, bias support in MoE layers, and attention sink mechanisms for improved inference. Adds GPT OSS-specific expert routing and softmax handling in the MoE forward pass Implements bias support throughout the MoE pipeline Introduces attention sink functionality across attention backends and operations --------- Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai> Signed-off-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wojciech Pyszka <wpyszka@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Fixes Accuracy Issue in GPTOSS: #887. Updates `apply_monolithic` introduced in #876 to handle gptoss --------- Signed-off-by: Rohit kumar Singh <rksingh@habana.ai> Signed-off-by: Rohit Kumar Singh <9626333+SKRohit@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR integrates support for the GPT OSS model type, including additions for handling model-specific routing logic, bias support in MoE layers, and attention sink mechanisms for improved inference.
Adds GPT OSS-specific expert routing and softmax handling in the MoE forward pass
Implements bias support throughout the MoE pipeline
Introduces attention sink functionality across attention backends and operations