Skip to content

Fix scheduler yield on arm#30228

Open
wangxiyuan wants to merge 8 commits intovllm-project:mainfrom
wangxiyuan:fix_yield
Open

Fix scheduler yield on arm#30228
wangxiyuan wants to merge 8 commits intovllm-project:mainfrom
wangxiyuan:fix_yield

Conversation

@wangxiyuan
Copy link
Copy Markdown
Contributor

@wangxiyuan wangxiyuan commented Dec 8, 2025

Purpose

For Arm systems, os.sched_yield does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. we should making the process execute time.sleep(0) instead to release the GIL.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Ensures polling yields the GIL on ARM.

  • Update vllm/distributed/utils.py: USE_SCHED_YIELD now also checks Platform.get_cpu_architecture() and disables os.sched_yield on ARM, falling back to time.sleep(0)
  • Add imports for CpuArchEnum and Platform; update comments accordingly

Written by Cursor Bugbot for commit 0274e03. This will update automatically on new commits. Configure here.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical issue on ARM systems where os.sched_yield fails to relinquish the Global Interpreter Lock (GIL), leading to CPU-bound performance problems. The change correctly modifies the USE_SCHED_YIELD logic to fall back to time.sleep(0) on ARM architectures, ensuring proper GIL release and improving system responsiveness. The addition of the CpuArchEnum and Platform imports is appropriate for this detection. The code is clear and directly resolves the described problem.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Does this fix: #29369?

@heheda12345
Copy link
Copy Markdown
Collaborator

@tlrmchlsmth can you check this on gb200?

@tlrmchlsmth
Copy link
Copy Markdown
Member

I or someone on my team will look into this, but I'm not sure what we should look out for.

What should we expect to see if the os.sched_yield isn't taking effect?

@amohoste
Copy link
Copy Markdown

amohoste commented Dec 18, 2025

We encountered a similar issue running P2P KV Cache sharing through vllm-ascend + LMCache-Ascend on ARM, Python 3.11.13. In this scenario, there is an async transfer function to load prefix caches while the main thread continues to do other work. When os.sched_yield is used, the async transfer function is typically starved for 100ms+ before the transfer operations are submitted to the device.

image

When applying the patch to use time.sleep(0) instead, the async_batched_write function to submit the transfer operations to the device completes within 1.2ms as expected
image

@wangxiyuan
Copy link
Copy Markdown
Contributor Author

wangxiyuan commented Dec 29, 2025

@heheda12345 @robertgshaw2-redhat @tlrmchlsmth Sorry for late reply. When running vLLM with world size >1 on arm machine, we can see that the worker process always use CPU 100% after the serve start.

Reproduce command:
vllm serve Qwen/Qwen3-0.6B --tensor-parallel-size 2.

Then the top result is:
image

I think this can be reproduced on GH200 as well.

@wangxiyuan
Copy link
Copy Markdown
Contributor Author

@tlrmchlsmth would you mind take a look at this one? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants