Lazily initialize null_block only when it is needed#26939
Lazily initialize null_block only when it is needed#26939elaineyz wants to merge 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
…ee blocks exist Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
…block as null block Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
|
I think lazy initialization is a bit magic. Can you take |
Thanks @heheda12345. I agree that
In addition, I think it should also account for:
However, those validations seem orthogonal to this PR. The goal of this PR is purely to introduce lazy initialization of null_block to avoid unnecessary early allocations. I would like to:
Let me know your thoughts. |
|
Closing as not planned. Discussed offline with @heheda12345, the conclusion is that lazy initialization here will make the logic overly complicated, and the preference is to keep the code base simple. vLLM does not plan to make every feature backwards compatible with v0. Separate work on making sure there are no silent failures due to this will be carried out in #27238 |
Purpose
After PR #14097 we now always subtract 1 usable block from the block pool and designate it as the
null_block, even when sliding window attention is not used. This creates an edge case when users specify the minimum required number of blocks for their model & config via--num-gpu-blocks-overridebut then vllm comes and deducts 1 from that (if user specifies 1 block it's effectively 0).This PR proposes to lazily-initialize the null_block only when it is called (mainly in
remove_skipped_blocks). As a result, for kv cache managers like FullAttentionManager, thenull_blocknever materializes and all blocks are available for allocation.Test Plan
Added unit tests.
Also tested the change E2E locally on Neuron device with the following server command and request:
Test Result
Without the change, the vllm scheduler is unable to schedule the request since it only sees 3 blocks available. It will repeatedly send an empty SchedulerOutput to the model runner and the server hangs indefinitely.
After the change, the request goes through successfully.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.