[KVConnector] scheduler: Add HMA support for KV load recovery#35223
[KVConnector] scheduler: Add HMA support for KV load recovery#35223orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request extends the KV load recovery mechanism to support Hybrid Memory Allocation (HMA) by handling multiple KV cache groups. It also introduces a change where models with sliding window attention will fail on error instead of recomputing, which is a safer default.
The changes in the tests and scheduler logic for handling num_computed_tokens in async loading scenarios are consistent. However, I've found a critical bug in the _update_requests_with_invalid_blocks function where the accounting for affected tokens is incorrect when multiple cache groups are present. Please see my detailed comment for a fix.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request extends the KV load recovery mechanism to support the Hybrid Memory Allocator (HMA) by refactoring how requests with invalid KV blocks are handled. The changes correctly disable recovery for models with sliding windows. However, I've found a critical issue in the implementation for handling invalid blocks with HMA. The logic for calculating the number of affected tokens is placed inside a loop over KV cache groups, which will lead to incorrect accounting when a request is affected by failures in multiple groups. I've provided a comment with details on how to fix this.
4e47781 to
7e55abc
Compare
|
Hi @orozery, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
7e55abc to
8a8e519
Compare
8a8e519 to
4e87cfe
Compare
|
Hi @orozery, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4e87cfe to
1cb34bc
Compare
1cb34bc to
f8112ad
Compare
|
To make things clear, it is possible to enable re-computation even for sliding window. |
|
Forgot cc @sdavidbd in case you're still around :) |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
lgtm but @heheda12345 and/or @NickLucche should approve as well
8f47716 to
8c011ae
Compare
|
Relaxed check a bit to allow sliding window re-computation if HMA is off. |
d994c2f to
f02a5c8
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the work @orozery .
Can you check whether this works with MLA models?
|
@tlrmchlsmth I found quite a few issues (bugs) with my previous implementation. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request extends the KV load recovery mechanism to support Hybrid Memory Allocation (HMA). The key change is in the scheduler's handling of invalid KV cache blocks. The logic now differentiates between models with only full-attention layers and those with other types, such as sliding-window attention or SSMs. For the latter, any KV load failure triggers a full recomputation of the request. For full-attention models, a more granular partial recomputation is performed. The implementation correctly handles multiple KV cache groups with varying block sizes, a core requirement for HMA. The changes to how num_computed_tokens is managed for asynchronous loads and the updates to the test suite are consistent and appropriate. The overall implementation appears solid and I did not find any issues of high or critical severity.
ok, thanks for the update - sounds like I'll need to take a closer look this time 😄 |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Can we add some tests to cover the failure cases we're trying to cover?
| # Count the number of prefix cached tokens. | ||
| if request.num_cached_tokens < 0: | ||
| request.num_cached_tokens = request.num_computed_tokens |
There was a problem hiding this comment.
why wasn't this needed earlier?
There was a problem hiding this comment.
It was a bug: num_cached_tokens was not set for async loaded requests
| else: | ||
| # Sync loading. num_computed_tokens includes new tokens | ||
| req_num_computed_tokens = request.num_cached_tokens |
There was a problem hiding this comment.
I dont fully understand how we're covering this case
There was a problem hiding this comment.
This is another bug I'm fixing here:
req_num_computed_tokens is supposed to be the number of local_gpu_tokens + num_external_tokens.
For the sync case, they got it from request.num_cached_tokens.
But this does not work for requests resuming from preemption, as request.num_cached_tokens is only set once in the request lifetime.
So instead of using request.num_cached_tokens we use request.num_computed_tokens - num_scheduled_tokens to get local_gpu_tokens + num_external_tokens.
There was a problem hiding this comment.
This also simplifies the code which previously had different handling for sync and async.
vllm/v1/core/sched/scheduler.py
Outdated
| req_num_computed_tokens = request.num_computed_tokens | ||
|
|
||
| is_affected = True | ||
| # iterate request blocks by group |
There was a problem hiding this comment.
nit:
| # iterate request blocks by group | |
| # iterate request blocks by group | |
| assert len(self.kv_cache_config.kv_cache_groups) == len(req_block_id_groups) |
vllm/v1/core/sched/scheduler.py
Outdated
| if is_affected and self.has_non_full_attention_layer: | ||
| break |
There was a problem hiding this comment.
I don't think I am following, if we have more than one group than that implies that they're not all full attention.
Why would we still loop then..?
There was a problem hiding this comment.
As discussed offline, I don't see an assertion for that (that multiple groups -> non full attention) in kv_cache_utils.py, and so I prefer not to assume it, even for the case of future compatibility.
There was a problem hiding this comment.
I prefer to add an assert for this assumption here. I can't imagine a case that all layers are full attention but we create different groups for them. Even if we have multiple attention layers with different hidden size, we will merge them into one group (e.g., deepseek 3.2). Prefer to revisit this path if there is a new case that forces us to create multiple groups for full attention under some reason as the new reason may be strange.
| # on a full prompt hit, we need to re-compute the last token | ||
| # in order to be able to sample the next token | ||
| if request.num_computed_tokens >= request.num_tokens: | ||
| request.num_computed_tokens = request.num_tokens - 1 |
There was a problem hiding this comment.
why do we do this line after
self.kv_cache_manager.cache_blocks(request, request.num_computed_tokens)
```?
There was a problem hiding this comment.
The reason we decrease by 1 is that the model runner cannot run with num_scheduled_tokens = 0 per request.
So this basically is a model runner limitation.
cache_blocks has nothing to do with the model runner, it does not have the limitation that requires us to decrease by 1.
Decreasing by 1 before calling cache_blocks means we don't cache the last block.
This block is valid, why not cache it?
For sync connectors we do cache the last block when we call allocate_slots:
num_tokens_to_cache = min(
total_computed_tokens + num_new_tokens,
request.num_tokens,
)
self.coordinator.cache_blocks(request, num_tokens_to_cache)
Why should the behaviors be different for async connectors?
f02a5c8 to
0000209
Compare
|
I've made some changes:
|
90a9f27 to
af56820
Compare
This commits extends the KV load recovery flow to support HMA. Signed-off-by: Or Ozeri <oro@il.ibm.com>
af56820 to
3fe5f8a
Compare
This PR extends the KV load recovery flow to support HMA.
Models using sliding window will fail on error instead of recompute.
Depends on #34616.