fix: allow HMA with KV events when explicitly enabled by sara4dev · Pull Request #39269 · vllm-project/vllm

sara4dev · 2026-04-08T06:01:39Z

Summary

Allow the Hybrid Memory Allocator (HMA) to work alongside KV events when explicitly enabled via --no-disable-hybrid-kv-cache-manager. This unblocks disaggregated serving for hybrid models (Mamba + attention) like NVIDIA Nemotron-H.

The problem: HMA is unconditionally disabled when kv_events_config is set (line 1226-1228 of vllm/config/vllm.py), even when the user explicitly passes --no-disable-hybrid-kv-cache-manager. This blocks disaggregated serving for hybrid models despite NixlConnector already implementing SupportsHMA with full Mamba support (_has_mamba detection, MAMBA2_ATTN backend handling).

The fix: Only auto-disable HMA for KV events when the user hasn't explicitly opted in. When --no-disable-hybrid-kv-cache-manager is passed (disable_hybrid_kv_cache_manager=False), respect it.

Changes

vllm/config/vllm.py: Add check for explicit user opt-in before disabling HMA due to KV events config (+8 lines, -2 lines)

Test Plan

Tested with:

NVIDIA Nemotron-3-Super-120B-A12B-NVFP4 (hybrid Mamba + attention, NVFP4 quantization)
Disaggregated prefill/decode via NixlConnector with CUDA-aware UCX
200/200 requests successful in production benchmark at concurrency=[1, 32, 128, 512]

Context

NixlConnector already:

Subclasses SupportsHMA (kv_connector/v1/base.py)
Detects Mamba layers via _has_mamba (nixl_connector.py:576)
Handles MAMBA2_ATTN attention backend (nixl_connector.py:1216)
Initializes NIXL scheduler for hybrid models (nixl_connector.py:581)

The config guard at line 1226 was added conservatively but creates a conflict: disaggregated mode auto-enables KV events, which then blocks HMA, making hybrid disagg impossible even though the connector supports it.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request modifies the configuration logic in vllm/config/vllm.py to allow the hybrid KV cache manager to be used with KV events if explicitly enabled via the scheduler configuration. Previously, the hybrid KV cache manager was automatically disabled whenever KV events were configured. I have no feedback to provide.

The Hybrid Memory Allocator (HMA) is unconditionally disabled when kv_events_config is set, even when the user explicitly passes --no-disable-hybrid-kv-cache-manager. This blocks disaggregated serving for hybrid models (Mamba+attention) like Nemotron-H, despite NixlConnector already implementing SupportsHMA with full Mamba support (_has_mamba detection, MAMBA2_ATTN backend handling). Fix: Only auto-disable HMA for KV events when the user has not explicitly opted in. When --no-disable-hybrid-kv-cache-manager is passed (disable_hybrid_kv_cache_manager=False), respect it. Tested with: - NVIDIA Nemotron-3-Super-120B-A12B-NVFP4 (hybrid Mamba+attention) - Disaggregated prefill/decode via NixlConnector with CUDA-aware UCX - 200/200 requests successful in production benchmark Signed-off-by: sara4dev <saravanakumar.periyasamy@gmail.com>

mergify · 2026-04-13T03:45:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sara4dev.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-15T11:08:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sara4dev.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

sara4dev marked this pull request as ready for review April 8, 2026 06:15

sara4dev requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 8, 2026 06:15

sara4dev force-pushed the fix/enable-hma-with-kv-events branch from b13133f to 944b63e Compare April 8, 2026 06:16

sara4dev force-pushed the fix/enable-hma-with-kv-events branch from 944b63e to c5b75ec Compare April 8, 2026 06:19

mergify Bot added the needs-rebase label Apr 13, 2026

mgoin mentioned this pull request Apr 15, 2026

[Spec Decode] Support hybrid attention models in extract_hidden_states #39949

Merged

4 tasks

v1b3coder mentioned this pull request May 6, 2026

[Feature]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector #41830

Closed

chfeng-cs mentioned this pull request May 6, 2026

[KV Transfer] Enable HMA by default for connectors that support it #41847

Merged

mergify Bot removed the needs-rebase label May 15, 2026

mergify Bot added the needs-rebase label May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: allow HMA with KV events when explicitly enabled#39269

fix: allow HMA with KV events when explicitly enabled#39269
sara4dev wants to merge 1 commit into
vllm-project:mainfrom
sara4dev:fix/enable-hma-with-kv-events

sara4dev commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

mergify Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sara4dev commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

mergify Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sara4dev commented Apr 8, 2026 •

edited

Loading