[bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py by HF-001 · Pull Request #5132 · vllm-project/vllm-ascend

HF-001 · 2025-12-17T09:35:56Z

What this PR does / why we need it?

just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists.

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

…only Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist

Code Review

This pull request aims to optimize the encoder cache manager by operating on embeddings instead of tokens, which should reduce memory consumption. The change in vllm_ascend/core/recompute_scheduler.py reflects this by updating how the encoder compute budget is restored upon preemption. While this change aligns with the PR's goal, it introduces a potential critical issue due to a unit mismatch. The encoder compute budget is restored using a count of embeddings, but it appears to be initialized and used elsewhere within the same file as a count of tokens. This inconsistency could lead to incorrect resource allocation and scheduling behavior.

gemini-code-assist · 2025-12-17T09:37:34Z

+                                    num_embeds_to_restore = sum(
+                                        preempted_req.get_num_encoder_embeds(i)
                                        for i in preempted_encoder_inputs)
-                                    encoder_compute_budget += num_tokens_to_restore
+                                    encoder_compute_budget += num_embeds_to_restore


There appears to be a potential unit mismatch in the accounting of encoder_compute_budget. This change restores the budget using the number of embeddings (get_num_encoder_embeds), which implies the budget is tracked in units of embeddings.

However, encoder_compute_budget is initialized with self.max_num_encoder_input_tokens on line 116. This same configuration value is also used as num_encoder_tokens when calling self.kv_cache_manager.allocate_slots on lines 451-458. This suggests the budget is still measured in tokens.

If one embedding does not correspond to exactly one token, this will lead to incorrect budget calculations, potentially causing scheduling failures or overallocation.

For example:

If encoder_compute_budget is in tokens, then restoring it with the number of embeddings is incorrect. It should be restored with the number of tokens corresponding to the preempted embeddings.

If encoder_compute_budget is meant to be in embeddings, then the configuration max_num_encoder_input_tokens is misnamed and its usage for num_encoder_tokens in allocate_slots is likely incorrect, as that method probably expects a token count for allocating cross-attention KV cache.

This inconsistency could lead to critical issues. Please verify the units used for encoder_compute_budget throughout the scheduler and ensure they are consistent.

github-actions · 2025-12-17T11:38:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan · 2026-01-04T07:34:46Z

please make sure both vllm main and vllm v0.13.0 works. Thanks

HF-001 · 2026-01-04T07:49:47Z

please make sure both vllm main and vllm v0.13.0 works. Thanks

@wangxiyuan Thank you for the reminder. I will resolve it soon.

HF-001 · 2026-01-04T08:27:30Z

please make sure both vllm main and vllm v0.13.0 works. Thanks

@wangxiyuan hi, I have confirmed that the changes made to this PR are fully compatible with the latest versions such as vllm main, vllm v0.13.0.rc2, vllm v0.13.0.rc3, vllm v0.13.0.rc4, and vllm v0.13.0, vllm v0.14.0rc0, etc.

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits) add dispath_ffn_combine_bf16 (vllm-project#5866) [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932) [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951) [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945) [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132) [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049) 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045) [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967) [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775) [CI][Lint] Show lint diff on failure (vllm-project#5956) [CI] Add wait logic for each individual case (vllm-project#6036) [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633) model runner v2 support triton of penalty (vllm-project#5854) [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034) [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980) [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994) [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988) [Ops] Add layernorm for qwen3Next (vllm-project#5765) [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921) ...

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: huangning1995 <huangning12@huawei.com>

…_embeds in recompute_schedule.py (vllm-project#5132)" This reverts commit 2dc93a4.

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…in recompute_schedule.py (vllm-project#5132) ### What this PR does / why we need it? adapt to: vllm-project/vllm#30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>

[feat][mm]optimize encoder cache manager by operating with embedding …

3030651

…only Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist Bot reviewed Dec 17, 2025

View reviewed changes

HF-001 changed the title ~~[feat][mm]optimize encoder cache manager by operating with embedding~~ [feat][mm]optimize encoder cache by operating with embedding Dec 17, 2025

HF-001 changed the title ~~[feat][mm]optimize encoder cache by operating with embedding~~ [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py Dec 18, 2025

zzzzwwjj approved these changes Jan 5, 2026

View reviewed changes

wangxiyuan merged commit 936d81a into vllm-project:main Jan 21, 2026
22 checks passed

huangfeifei1995 added a commit to huangfeifei1995/vllm-ascend that referenced this pull request Jan 21, 2026

Revert "[bugfix][mm] change get_num_encoder_tokens to get_num_encoder…

14524f4

…_embeds in recompute_schedule.py (vllm-project#5132)" This reverts commit 2dc93a4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py#5132

[bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py#5132
wangxiyuan merged 1 commit intovllm-project:mainfrom
HF-001:encoder_cache_op

HF-001 commented Dec 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 17, 2025

Uh oh!

github-actions Bot commented Dec 17, 2025

Uh oh!

wangxiyuan commented Jan 4, 2026

Uh oh!

HF-001 commented Jan 4, 2026

Uh oh!

HF-001 commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HF-001 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Dec 17, 2025

Uh oh!

wangxiyuan commented Jan 4, 2026

Uh oh!

HF-001 commented Jan 4, 2026

Uh oh!

HF-001 commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HF-001 commented Dec 17, 2025 •

edited

Loading