Skip to content

Fix mamba_type comparison for GDN hybrid cache allococation#1449

Merged
iboiko-habana merged 4 commits into
vllm-project:mainfrom
shepark:shepark/fix_mamba_type_enum
May 18, 2026
Merged

Fix mamba_type comparison for GDN hybrid cache allococation#1449
iboiko-habana merged 4 commits into
vllm-project:mainfrom
shepark:shepark/fix_mamba_type_enum

Conversation

@shepark
Copy link
Copy Markdown
Contributor

@shepark shepark commented May 14, 2026

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation.

Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions.

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from
str to MambaAttentionBackendEnum. The hybrid cache allocation in
hpu_model_runner.py still compared against str literals, causing
GDN layers to fall through to the Mamba2 shared-buffer path. This
created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the
same storage, triggering an aot_autograd assertion error during
compilation.

Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum
values and string literals for backward compatibility with older
upstream versions.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates GDN/linear Mamba type detection to support both legacy string identifiers and the newer MambaAttentionBackendEnum (when available), reducing hard-coded string checks.

Changes:

  • Add a guarded import of MambaAttentionBackendEnum with a fallback to string identifiers.
  • Centralize GDN/linear Mamba type identifiers in _GDN_MAMBA_TYPES.
  • Replace repeated ("gdn_attention", "linear_attention") membership checks with _GDN_MAMBA_TYPES.

Comment on lines +55 to +58
_GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention",
"linear_attention")
except (ImportError, AttributeError):
_GDN_MAMBA_TYPES = ("gdn_attention", "linear_attention")
1 for g in kv_cache_config.kv_cache_groups
if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in ("gdn_attention",
"linear_attention"))
if isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.mamba_type in _GDN_MAMBA_TYPES)
Comment on lines +55 to +56
_GDN_MAMBA_TYPES = (MambaAttentionBackendEnum.GDN_ATTN, MambaAttentionBackendEnum.LINEAR, "gdn_attention",
"linear_attention")
@adobrzyn
Copy link
Copy Markdown
Collaborator

Finding: hpu_worker.py has the same broken string-literal comparisons (not fixed)

This PR only patches hpu_model_runner.py, but vllm_gaudi/v1/worker/hpu_worker.py has 4 more instances of the identical mamba_type in ("gdn_attention", "linear_attention") pattern (lines 445, 448, 466, 488). These will break in exactly the same way after the upstream enum change.

Particularly dangerous are the not in checks (lines 448, 488) — with the enum change, GDN/linear layers will erroneously match the "standard Mamba2" path because MambaAttentionBackendEnum.GDN_ATTN not in ("gdn_attention", "linear_attention") evaluates to True. This could lead to incorrect memory calculations for hybrid models.

Suggestion: apply the same _GDN_MAMBA_TYPES pattern (or import it from a shared location) in hpu_worker.py as well.

@adobrzyn
Copy link
Copy Markdown
Collaborator

Finding 1 🔴 Critical · vllm_gaudi/v1/worker/hpu_worker.py:L445-L488

The PR fixes string-literal mamba_type comparisons in hpu_model_runner.py but misses the same pattern in hpu_worker.py (lines 445, 448, 466, 488). These will break identically after the upstream enum change.

Particularly dangerous are the not in checks (lines 448, 488) — with the enum change, MambaAttentionBackendEnum.GDN_ATTN not in ("gdn_attention", "linear_attention") evaluates to True, causing GDN/linear layers to erroneously match the "standard Mamba2" path. This could lead to incorrect memory calculations for hybrid models.

Suggestion: Apply the same _GDN_MAMBA_TYPES pattern to hpu_worker.py — either duplicate the guarded import + tuple definition, or extract it into a shared location both files can import.


[- Reviewed by Awesome ChlOpus]

pawel-olejniczak added a commit to pawel-olejniczak/vllm-gaudi that referenced this pull request May 15, 2026
Proper fix for Qwen3.5 compilation (mamba_type Enum comparison)
is in PR vllm-project#1449. The enforce_eager workaround causes performance
degradation and is unnecessary once vllm-project#1449 merges.

Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
shepark added 3 commits May 15, 2026 07:34
Signed-off-by: Seunghyuk Park <separk@habana.ai>
Signed-off-by: Seunghyuk Park <separk@habana.ai>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@iboiko-habana iboiko-habana merged commit e5b23b2 into vllm-project:main May 18, 2026
2 checks passed
iboiko-habana pushed a commit that referenced this pull request May 18, 2026
Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to
MambaAttentionBackendEnum. The hybrid cache allocation in
hpu_model_runner.py still compared against str literals, causing GDN
layers to fall through to the Mamba2 shared-buffer path. This created
mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage,
triggering an aot_autograd assertion error during compilation.

Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values
and string literals for backward compatibility with older upstream
versions.

---------

Signed-off-by: Seunghyuk Park <separk@habana.ai>
iboiko-habana added a commit that referenced this pull request May 19, 2026
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (#1436)

Fix upstream regressions affecting hourly CI:

1. **MultiModelEngineClient**: Added missing
`notify_kv_transfer_request_rejected` abstract method (upstream PR
vllm-project/vllm#41269)
2. **Qwen3.5 test harness**: Updated `test_common.py` to read
`enforce_eager` from model card config (with env var override), enabling
per-model compilation control
3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and
`enable_eplb` parameter from `patched_create_fused_moe_router` after
upstream MoE refactor (upstream PR vllm-project/vllm#41055)

Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has
been removed — the root cause (mamba_type str-vs-Enum comparison in
hybrid cache allocation) is properly fixed by #1449, which should merge
first.

Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed
tensors).

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants