Skip to content

fix: allow HMA with KV events when explicitly enabled#39269

Open
sara4dev wants to merge 1 commit into
vllm-project:mainfrom
sara4dev:fix/enable-hma-with-kv-events
Open

fix: allow HMA with KV events when explicitly enabled#39269
sara4dev wants to merge 1 commit into
vllm-project:mainfrom
sara4dev:fix/enable-hma-with-kv-events

Conversation

@sara4dev
Copy link
Copy Markdown

@sara4dev sara4dev commented Apr 8, 2026

Summary

Allow the Hybrid Memory Allocator (HMA) to work alongside KV events when explicitly enabled via --no-disable-hybrid-kv-cache-manager. This unblocks disaggregated serving for hybrid models (Mamba + attention) like NVIDIA Nemotron-H.

The problem: HMA is unconditionally disabled when kv_events_config is set (line 1226-1228 of vllm/config/vllm.py), even when the user explicitly passes --no-disable-hybrid-kv-cache-manager. This blocks disaggregated serving for hybrid models despite NixlConnector already implementing SupportsHMA with full Mamba support (_has_mamba detection, MAMBA2_ATTN backend handling).

The fix: Only auto-disable HMA for KV events when the user hasn't explicitly opted in. When --no-disable-hybrid-kv-cache-manager is passed (disable_hybrid_kv_cache_manager=False), respect it.

Changes

  • vllm/config/vllm.py: Add check for explicit user opt-in before disabling HMA due to KV events config (+8 lines, -2 lines)

Test Plan

Tested with:

  • NVIDIA Nemotron-3-Super-120B-A12B-NVFP4 (hybrid Mamba + attention, NVFP4 quantization)
  • Disaggregated prefill/decode via NixlConnector with CUDA-aware UCX
  • 200/200 requests successful in production benchmark at concurrency=[1, 32, 128, 512]

Context

NixlConnector already:

  • Subclasses SupportsHMA (kv_connector/v1/base.py)
  • Detects Mamba layers via _has_mamba (nixl_connector.py:576)
  • Handles MAMBA2_ATTN attention backend (nixl_connector.py:1216)
  • Initializes NIXL scheduler for hybrid models (nixl_connector.py:581)

The config guard at line 1226 was added conservatively but creates a conflict: disaggregated mode auto-enables KV events, which then blocks HMA, making hybrid disagg impossible even though the connector supports it.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the configuration logic in vllm/config/vllm.py to allow the hybrid KV cache manager to be used with KV events if explicitly enabled via the scheduler configuration. Previously, the hybrid KV cache manager was automatically disabled whenever KV events were configured. I have no feedback to provide.

@sara4dev sara4dev marked this pull request as ready for review April 8, 2026 06:15
@sara4dev sara4dev force-pushed the fix/enable-hma-with-kv-events branch from b13133f to 944b63e Compare April 8, 2026 06:16
The Hybrid Memory Allocator (HMA) is unconditionally disabled when
kv_events_config is set, even when the user explicitly passes
--no-disable-hybrid-kv-cache-manager. This blocks disaggregated
serving for hybrid models (Mamba+attention) like Nemotron-H, despite
NixlConnector already implementing SupportsHMA with full Mamba support
(_has_mamba detection, MAMBA2_ATTN backend handling).

Fix: Only auto-disable HMA for KV events when the user has not
explicitly opted in. When --no-disable-hybrid-kv-cache-manager is
passed (disable_hybrid_kv_cache_manager=False), respect it.

Tested with:
- NVIDIA Nemotron-3-Super-120B-A12B-NVFP4 (hybrid Mamba+attention)
- Disaggregated prefill/decode via NixlConnector with CUDA-aware UCX
- 200/200 requests successful in production benchmark

Signed-off-by: sara4dev <saravanakumar.periyasamy@gmail.com>
@sara4dev sara4dev force-pushed the fix/enable-hma-with-kv-events branch from 944b63e to c5b75ec Compare April 8, 2026 06:19
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sara4dev.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 15, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sara4dev.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant