Lazily initialize null_block only when it is needed by elaineyz · Pull Request #26939 · vllm-project/vllm

elaineyz · 2025-10-15T19:30:49Z

Purpose

After PR #14097 we now always subtract 1 usable block from the block pool and designate it as the null_block, even when sliding window attention is not used. This creates an edge case when users specify the minimum required number of blocks for their model & config via --num-gpu-blocks-override but then vllm comes and deducts 1 from that (if user specifies 1 block it's effectively 0).

This PR proposes to lazily-initialize the null_block only when it is called (mainly in remove_skipped_blocks). As a result, for kv cache managers like FullAttentionManager, the null_block never materializes and all blocks are available for allocation.

Test Plan

Added unit tests.

Also tested the change E2E locally on Neuron device with the following server command and request:

python3 -m vllm.entrypoints.openai.api_server \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --max-num-seqs 1 \
  --max-model-len 128 \
  --tensor-parallel-size 64 \
  --enable-prefix-caching \
  --block-size 32 \
  --num-gpu-blocks-override 4 \
  --additional-config '{
    "override_neuron_config": {
        "is_prefix_caching": true,
        "is_block_kv_layout": true,
        "pa_num_blocks": 4,
        "pa_block_size": 32,
    }
  }' \
  --port 8000

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "prompt": ["You are an expert software engineer and DevOps specialist. Tell me a story."],
  "min_tokens": 109,
  "max_tokens": 109,
  "temperature": 0
}'

Test Result

Without the change, the vllm scheduler is unable to schedule the request since it only sees 3 blocks available. It will repeatedly send an empty SchedulerOutput to the model runner and the server hangs indefinitely.

After the change, the request goes through successfully.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/core/block_pool.py

…ee blocks exist Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/core/single_type_kv_cache_manager.py

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/core/single_type_kv_cache_manager.py

…block as null block Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

heheda12345 · 2025-10-20T04:06:19Z

I think lazy initialization is a bit magic. Can you take null_block and --num-gpu-blocks-override into consideration in check_enough_kv_cache_memory of vllm/v1/core/kv_cache_utils.py so that vLLM can raise an error if it can't schedule one request with max-model-len tokens.

elaineyz · 2025-10-21T18:24:28Z

I think lazy initialization is a bit magic. Can you take null_block and --num-gpu-blocks-override into consideration in check_enough_kv_cache_memory of vllm/v1/core/kv_cache_utils.py so that vLLM can raise an error if it can't schedule one request with max-model-len tokens.

Thanks @heheda12345.

I agree that check_enough_kv_cache_memory should be updated to account for:

--num-gpu-blocks-override, which may artificially limit the usable KV cache (now tracked under issue 27181, thanks!)

In addition, I think it should also account for:

the presence (or lazy creation) of null_block, which could reduce the effective number of available blocks.

However, those validations seem orthogonal to this PR. The goal of this PR is purely to introduce lazy initialization of null_block to avoid unnecessary early allocations. I would like to:

Land this PR to keep the lazy initialization logic self-contained. I understand that lazy initialization can feel a bit “magic,” but the intent here is purely to defer allocation of null_block until it’s actually needed, which reduces memory footprint and improves startup performance without changing functional correctness. In fact, it is more "logically correct" for paths like full attention where the creation of a null_block doesn't even make sense.
In a separate PR, extend check_enough_kv_cache_memory to explicitly consider both num_gpu_blocks_override and null_block so vLLM can fail early if the configuration can’t handle a full max_model_len request. I see that v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238 is already created. Happy to support that effort by @khaled-wsa.

Let me know your thoughts.

elaineyz · 2025-10-31T00:00:24Z

Closing as not planned. Discussed offline with @heheda12345, the conclusion is that lazy initialization here will make the logic overly complicated, and the preference is to keep the code base simple. vLLM does not plan to make every feature backwards compatible with v0.

Separate work on making sure there are no silent failures due to this will be carried out in #27238

lazily initialize null_block

8c96e0a

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

mergify bot added the v1 label Oct 15, 2025

Fix mypy

66c4d99

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

elaineyz marked this pull request as ready for review October 15, 2025 19:54

elaineyz requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 15, 2025 19:54

clean up unused code

fbec44c

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/core/block_pool.py Show resolved Hide resolved

handle an edge case where we need to allocate a null_block when no fr…

71a3cf6

…ee blocks exist Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Show resolved Hide resolved

bug fix

c659801

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

elaineyz force-pushed the null_block branch from 5a7ac72 to c659801 Compare October 15, 2025 21:43

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

elaineyz added 2 commits October 15, 2025 22:19

check for _null_block in block pool before repurposing first removed …

405f204

…block as null block Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

fix indentation

d93f7ee

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

heheda12345 mentioned this pull request Oct 20, 2025

[Bug]: check_enough_kv_cache_memory didn't consider num_gpu_blocks_override #27181

Closed

1 task

elaineyz mentioned this pull request Oct 21, 2025

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

Closed

elaineyz closed this Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lazily initialize null_block only when it is needed#26939

Lazily initialize null_block only when it is needed#26939
elaineyz wants to merge 7 commits intovllm-project:mainfrom
aws-neuron:null_block

elaineyz commented Oct 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

heheda12345 commented Oct 20, 2025

Uh oh!

elaineyz commented Oct 21, 2025 •

edited

Loading

Uh oh!

elaineyz commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

elaineyz commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

heheda12345 commented Oct 20, 2025

Uh oh!

elaineyz commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elaineyz commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elaineyz commented Oct 15, 2025 •

edited by github-actions bot

Loading

elaineyz commented Oct 21, 2025 •

edited

Loading