Fix mamba_type comparison for GDN hybrid cache allococation#1449
Conversation
Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation. Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions. Signed-off-by: Seunghyuk Park <separk@habana.ai>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates GDN/linear Mamba type detection to support both legacy string identifiers and the newer MambaAttentionBackendEnum (when available), reducing hard-coded string checks.
Changes:
- Add a guarded import of
MambaAttentionBackendEnumwith a fallback to string identifiers. - Centralize GDN/linear Mamba type identifiers in
_GDN_MAMBA_TYPES. - Replace repeated
("gdn_attention", "linear_attention")membership checks with_GDN_MAMBA_TYPES.
| _GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention", | ||
| "linear_attention") | ||
| except (ImportError, AttributeError): | ||
| _GDN_MAMBA_TYPES = ("gdn_attention", "linear_attention") |
| 1 for g in kv_cache_config.kv_cache_groups | ||
| if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in ("gdn_attention", | ||
| "linear_attention")) | ||
| if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in _GDN_MAMBA_TYPES) |
| _GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention", | ||
| "linear_attention") |
|
Finding: This PR only patches Particularly dangerous are the Suggestion: apply the same |
|
Finding 1 🔴 Critical · The PR fixes string-literal Particularly dangerous are the Suggestion: Apply the same [- Reviewed by Awesome ChlOpus] |
Proper fix for Qwen3.5 compilation (mamba_type Enum comparison) is in PR vllm-project#1449. The enforce_eager workaround causes performance degradation and is unnecessary once vllm-project#1449 merges. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Seunghyuk Park <separk@habana.ai>
Signed-off-by: Seunghyuk Park <separk@habana.ai>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation. Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions. --------- Signed-off-by: Seunghyuk Park <separk@habana.ai>
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (#1436) Fix upstream regressions affecting hourly CI: 1. **MultiModelEngineClient**: Added missing `notify_kv_transfer_request_rejected` abstract method (upstream PR vllm-project/vllm#41269) 2. **Qwen3.5 test harness**: Updated `test_common.py` to read `enforce_eager` from model card config (with env var override), enabling per-model compilation control 3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and `enable_eplb` parameter from `patched_create_fused_moe_router` after upstream MoE refactor (upstream PR vllm-project/vllm#41055) Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by #1449, which should merge first. Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors). --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation.
Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions.