Fix mamba_type comparison for GDN hybrid cache allococation by shepark · Pull Request #1449 · vllm-project/vllm-gaudi

shepark · 2026-05-14T21:55:59Z

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation.

Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions.

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation. Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions. Signed-off-by: Seunghyuk Park <separk@habana.ai>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates GDN/linear Mamba type detection to support both legacy string identifiers and the newer MambaAttentionBackendEnum (when available), reducing hard-coded string checks.

Changes:

Add a guarded import of MambaAttentionBackendEnum with a fallback to string identifiers.
Centralize GDN/linear Mamba type identifiers in _GDN_MAMBA_TYPES.
Replace repeated ("gdn_attention", "linear_attention") membership checks with _GDN_MAMBA_TYPES.

+    _GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention",
+                        "linear_attention")
+except (ImportError, AttributeError):
+    _GDN_MAMBA_TYPES = ("gdn_attention", "linear_attention")


                1 for g in kv_cache_config.kv_cache_groups
-                if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in ("gdn_attention",
-                                                                                             "linear_attention"))
+                if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in _GDN_MAMBA_TYPES)


+    _GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention",
+                        "linear_attention")


adobrzyn · 2026-05-15T07:19:50Z

Finding: hpu_worker.py has the same broken string-literal comparisons (not fixed)

This PR only patches hpu_model_runner.py, but vllm_gaudi/v1/worker/hpu_worker.py has 4 more instances of the identical mamba_type in ("gdn_attention", "linear_attention") pattern (lines 445, 448, 466, 488). These will break in exactly the same way after the upstream enum change.

Particularly dangerous are the not in checks (lines 448, 488) — with the enum change, GDN/linear layers will erroneously match the "standard Mamba2" path because MambaAttentionBackendEnum.GDN_ATTN not in ("gdn_attention", "linear_attention") evaluates to True. This could lead to incorrect memory calculations for hybrid models.

Suggestion: apply the same _GDN_MAMBA_TYPES pattern (or import it from a shared location) in hpu_worker.py as well.

adobrzyn · 2026-05-15T08:21:34Z

Finding 1 🔴 Critical · vllm_gaudi/v1/worker/hpu_worker.py:L445-L488

The PR fixes string-literal mamba_type comparisons in hpu_model_runner.py but misses the same pattern in hpu_worker.py (lines 445, 448, 466, 488). These will break identically after the upstream enum change.

Particularly dangerous are the not in checks (lines 448, 488) — with the enum change, MambaAttentionBackendEnum.GDN_ATTN not in ("gdn_attention", "linear_attention") evaluates to True, causing GDN/linear layers to erroneously match the "standard Mamba2" path. This could lead to incorrect memory calculations for hybrid models.

Suggestion: Apply the same _GDN_MAMBA_TYPES pattern to hpu_worker.py — either duplicate the guarded import + tuple definition, or extract it into a shared location both files can import.

[- Reviewed by Awesome ChlOpus]

Proper fix for Qwen3.5 compilation (mamba_type Enum comparison) is in PR vllm-project#1449. The enforce_eager workaround causes performance degradation and is unnecessary once vllm-project#1449 merges. Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Signed-off-by: Seunghyuk Park <separk@habana.ai>

github-actions · 2026-05-15T18:54:09Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation. Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions. --------- Signed-off-by: Seunghyuk Park <separk@habana.ai>

…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (#1436) Fix upstream regressions affecting hourly CI: 1. **MultiModelEngineClient**: Added missing `notify_kv_transfer_request_rejected` abstract method (upstream PR vllm-project/vllm#41269) 2. **Qwen3.5 test harness**: Updated `test_common.py` to read `enforce_eager` from model card config (with env var override), enabling per-model compilation control 3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and `enable_eplb` parameter from `patched_create_fused_moe_router` after upstream MoE refactor (upstream PR vllm-project/vllm#41055) Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by #1449, which should merge first. Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors). --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>

Copilot AI review requested due to automatic review settings May 14, 2026 21:55

shepark requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, jbyczkow, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners May 14, 2026 21:56

Copilot AI reviewed May 14, 2026

View reviewed changes

shepark mentioned this pull request May 14, 2026

[FIX_FOR_VLLM_CUSTOM=dcacdf9a8860a86401127d1c8f93ebf3cfbfd026] Fix MultiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring #1436

Merged

github-actions Bot mentioned this pull request May 14, 2026

🚦 Team Review Dashboard #701

Open

shepark added 3 commits May 15, 2026 07:34

Merge branch 'main' into shepark/fix_mamba_type_enum

9df6dda

Fix pre-commit failure

451b652

Signed-off-by: Seunghyuk Park <separk@habana.ai>

Fix more mamba_type enum changes

5049b30

Signed-off-by: Seunghyuk Park <separk@habana.ai>

iboiko-habana approved these changes May 18, 2026

View reviewed changes

iboiko-habana merged commit e5b23b2 into vllm-project:main May 18, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mamba_type comparison for GDN hybrid cache allococation#1449

Fix mamba_type comparison for GDN hybrid cache allococation#1449
iboiko-habana merged 4 commits into
vllm-project:mainfrom
shepark:shepark/fix_mamba_type_enum

shepark commented May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

adobrzyn commented May 15, 2026

Uh oh!

adobrzyn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		_GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention",
		"linear_attention")

Conversation

shepark commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

adobrzyn commented May 15, 2026

Uh oh!

adobrzyn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants