Fix scheduler yield on arm by wangxiyuan · Pull Request #30228 · vllm-project/vllm

wangxiyuan · 2025-12-08T02:59:01Z

Purpose

For Arm systems, os.sched_yield does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. we should making the process execute time.sleep(0) instead to release the GIL.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Ensures polling yields the GIL on ARM.

Update vllm/distributed/utils.py: USE_SCHED_YIELD now also checks Platform.get_cpu_architecture() and disables os.sched_yield on ARM, falling back to time.sleep(0)
Add imports for CpuArchEnum and Platform; update comments accordingly

^{Written by Cursor Bugbot for commit 0274e03. This will update automatically on new commits. Configure here.}

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a critical issue on ARM systems where os.sched_yield fails to relinquish the Global Interpreter Lock (GIL), leading to CPU-bound performance problems. The change correctly modifies the USE_SCHED_YIELD logic to fall back to time.sleep(0) on ARM architectures, ensuring proper GIL release and improving system responsiveness. The addition of the CpuArchEnum and Platform imports is appropriate for this detection. The code is clear and directly resolves the described problem.

robertgshaw2-redhat · 2025-12-08T15:26:22Z

Does this fix: #29369?

heheda12345 · 2025-12-09T19:27:34Z

@tlrmchlsmth can you check this on gb200?

tlrmchlsmth · 2025-12-16T20:53:42Z

I or someone on my team will look into this, but I'm not sure what we should look out for.

What should we expect to see if the os.sched_yield isn't taking effect?

amohoste · 2025-12-18T14:56:19Z

We encountered a similar issue running P2P KV Cache sharing through vllm-ascend + LMCache-Ascend on ARM, Python 3.11.13. In this scenario, there is an async transfer function to load prefix caches while the main thread continues to do other work. When os.sched_yield is used, the async transfer function is typically starved for 100ms+ before the transfer operations are submitted to the device.

When applying the patch to use time.sleep(0) instead, the async_batched_write function to submit the transfer operations to the device completes within 1.2ms as expected

wangxiyuan · 2025-12-29T01:50:59Z

@heheda12345 @robertgshaw2-redhat @tlrmchlsmth Sorry for late reply. When running vLLM with world size >1 on arm machine, we can see that the worker process always use CPU 100% after the serve start.

Reproduce command:
vllm serve Qwen/Qwen3-0.6B --tensor-parallel-size 2.

Then the top result is:

I think this can be reproduced on GH200 as well.

wangxiyuan · 2026-01-15T07:34:47Z

@tlrmchlsmth would you mind take a look at this one? Thanks.

Fix scheduler yield on arm

28144b4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

gemini-code-assist Bot reviewed Dec 8, 2025

View reviewed changes

robertgshaw2-redhat mentioned this pull request Dec 8, 2025

[BugFix] Add sleep to fix tight loop and release GIL #29476

Merged

5 tasks

robertgshaw2-redhat mentioned this pull request Dec 8, 2025

[Bug]: nixl connector PD disagg stuck at INFO logging level #29369

Open

1 task

heheda12345 assigned tlrmchlsmth Dec 9, 2025

wangxiyuan added 3 commits December 29, 2025 11:00

Merge branch 'main' into fix_yield

b535f3c

Merge branch 'main' into fix_yield

0274e03

Merge branch 'main' into fix_yield

22ff2a3

Merge branch 'main' into fix_yield

44c871f

Meihan-chen mentioned this pull request Jan 27, 2026

[Patch] Remove patch of USE_SCHED_YIELD vllm-project/vllm-ascend#6316

Closed

wangxiyuan added 3 commits February 25, 2026 15:44

Merge branch 'main' into fix_yield

2cc8b4c

Merge branch 'main' into fix_yield

d7af209

Merge branch 'main' into fix_yield

6c69307

slippersss mentioned this pull request Apr 22, 2026

[Bug]: V1 engine workers die after idle period (SystemError: PyCFunction / EngineDeadError) — TP=2, multiprocessing #35104

Open

2 tasks

jsboige mentioned this pull request Apr 23, 2026

[Bug] Fix shm_broadcast PyCFunction descriptor corruption under JIT loads #40303

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scheduler yield on arm#30228

Fix scheduler yield on arm#30228
wangxiyuan wants to merge 8 commits intovllm-project:mainfrom
wangxiyuan:fix_yield

wangxiyuan commented Dec 8, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

robertgshaw2-redhat commented Dec 8, 2025

Uh oh!

heheda12345 commented Dec 9, 2025

Uh oh!

tlrmchlsmth commented Dec 16, 2025

Uh oh!

amohoste commented Dec 18, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented Dec 29, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

wangxiyuan commented Dec 8, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Dec 8, 2025

Uh oh!

heheda12345 commented Dec 9, 2025

Uh oh!

tlrmchlsmth commented Dec 16, 2025

Uh oh!

amohoste commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangxiyuan commented Dec 8, 2025 •

edited by github-actions Bot

Loading

amohoste commented Dec 18, 2025 •

edited

Loading

wangxiyuan commented Dec 29, 2025 •

edited

Loading