Skip to content

[MISC][Bugfix] Use less CPU when message queue has been empty for some time#16226

Merged
njhill merged 2 commits intovllm-project:mainfrom
p12tic:reduce-spinning-when-no-activity
Jun 5, 2025
Merged

[MISC][Bugfix] Use less CPU when message queue has been empty for some time#16226
njhill merged 2 commits intovllm-project:mainfrom
p12tic:reduce-spinning-when-no-activity

Conversation

@p12tic
Copy link
Contributor

@p12tic p12tic commented Apr 8, 2025

In setups which have long inactivity periods it is desirable to reduce system power consumption when vllm does nothing. This would lead to more CPU thermal headroom when a request eventually comes, especially when multiple GPUs are connected as each GPU would otherwise pin one thread at 100% CPU usage.

I didn't include any configuration knobs because I couldn't think of anyone who would adversely impacted by this change. It introduces a maximum additional latency of 100ms when vllm workers are inactive for 10 seconds or more. Seems like someone who has inactive vllm wouldn't care about small additional latency. But please let me know if you think otherwise.

EDIT: Final version of the PR contains new environment variable VLLM_SLEEP_WHEN_IDLE to enable CPU usage reduction. Set VLLM_SLEEP_WHEN_IDLE=1 to reduce power consumption at the expense of small amount of additional latency.

FIX #14799, #16968, #16660

@github-actions
Copy link

github-actions bot commented Apr 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 6d50290 to 24bf316 Compare April 8, 2025 02:48
@p12tic p12tic changed the title [MISC] Use less CPU when message queue has been empty for some time [MISC][Bugfix] Use less CPU when message queue has been empty for some time Apr 8, 2025
@MattJCroteau
Copy link

I was experiencing 100% CPU load on 0.8.3, applied this PR and it's now fixed. Thank you.

@zacksiri
Copy link

Any chance this can get merged?

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 24bf316 to 2eea42d Compare April 21, 2025 19:07
@p12tic
Copy link
Contributor Author

p12tic commented Apr 21, 2025

Rebased on latest main

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 2eea42d to 5aa28ba Compare April 21, 2025 19:14
@p12tic
Copy link
Contributor Author

p12tic commented Apr 21, 2025

@youkaichao, @WoosukKwon, @mgoin Could you please take a look? I cannot request reviews, thus pinging directly. Thank you very much.

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 5aa28ba to 447a8f4 Compare April 21, 2025 19:24
@RawthiL
Copy link

RawthiL commented Apr 28, 2025

+1 to this.

I'm using vllm:6.2 on my local environment just for this, it's really annoying...

@p12tic
Copy link
Contributor Author

p12tic commented Apr 29, 2025

@RawthiL FYI you can apply this PR to a recent release using docker as follows:

Dockerfile:

FROM docker.io/vllm/vllm-openai:v0.8.5

ADD vllm.patch /vllm.patch

RUN bash -c "cd /usr/local/lib/python3.12/dist-packages/ && patch -p1 < /vllm.patch"

vllm.patch can be generated by checking out this branch and doing git diff HEAD^^..HEAD > vllm.patch. Or just use the file attached to this comment. vllm.patch must be moved to the docker build context.

In docker-compose add the following for the vllm service:

    build:
      context: .

EDIT: Attached patch as file as the command to generate it is now more complex. --sleep-on-idle needs to be passed to vllm to turn on the feature. sleep-on-idle.txt

@ahmeteminkocal
Copy link

+1

@RomainBrault
Copy link

RomainBrault commented May 13, 2025

any update on this @youkaichao @WoosukKwon @mgoin ?

@nOneKzero
Copy link

+1

3 similar comments
@lingej
Copy link

lingej commented May 15, 2025

+1

@bengsV
Copy link

bengsV commented May 15, 2025

+1

@Jollokim
Copy link

+1

@mgoin
Copy link
Member

mgoin commented May 15, 2025

Could you change the behavior so it is off by default, preserving original behavior? I don't think we want to disturb existing deployments that rely on fast responses regardless of continuous load. I think it is okay to add an opt-in config

@youkaichao
Copy link
Member

agree with @mgoin that we can provide an option to enable the behavior, but we should not use it by default since normally we expect the server to be always busy (if it is idle for quite a long time, an upper level monitor should shut down the server).

@njhill
Copy link
Member

njhill commented May 15, 2025

I was wondering whether we can just use zmq instead of the shm spin there?j (it already supports both and falls back to zmq for large msg or remote cases anyhow). I suspect the perf difference may not be noticeable. And there could also be a config toggle for that in case there is any difference to latency.

@xanadu-3g
Copy link

@youkaichao Consider our use case: the vllm server gets a few thousand queries a day, mostly in working hours, the average inference takes 5 seconds, a few millisecond more latency has no relevance. Starting vllm takes more than a minute, so shutting it down is not an option. Saturating 2 cpu cores increases the system's power usage by 2,5x

@chclaus
Copy link

chclaus commented May 16, 2025

+1

@p12tic
Copy link
Contributor Author

p12tic commented May 16, 2025

Could you change the behavior so it is off by default, preserving original behavior?

Will do, thanks.

@p12tic
Copy link
Contributor Author

p12tic commented May 16, 2025

@njhill

I was wondering whether we can just use zmq instead of the shm spin there?

Using zmq does not necessarily mean that there won't be busy loop. I fixed very similar issue in sglang in their zmq code path (sgl-project/sglang#6026).

@njhill
Copy link
Member

njhill commented May 16, 2025

@njhill

I was wondering whether we can just use zmq instead of the shm spin there?

Using zmq does not necessarily mean that there won't be busy loop. I fixed very similar issue in sglang in their zmq code path (sgl-project/sglang#6026).

@p12tic there won't be any busy loop if zmq is used properly. Maybe they were reading nonblocking in a busy loop but that isn't what we do anywhere. I think we should just switch to that unless there's a measurable performance difference to using shm.

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 447a8f4 to dcae20d Compare May 22, 2025 08:11
@p12tic p12tic requested a review from WoosukKwon as a code owner May 22, 2025 08:11
@mergify mergify bot added the needs-rebase label Jun 2, 2025
@njhill
Copy link
Member

njhill commented Jun 2, 2025

@p12tic are you ok with updating this to use an env var instead of CLI arg?

@p12tic
Copy link
Contributor Author

p12tic commented Jun 4, 2025

@p12tic are you ok with updating this to use an env var instead of CLI arg?

Sorry, I only added reaction to your comment.

In setups which have long inactivity periods it is desirable to reduce
system power consumption when vllm does nothing. The simpliest is to
reduce polling frequency when there is no activity for a certain period
of time.

Fixes: vllm-project#14799
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 58d4caa to 219f47a Compare June 4, 2025 20:25
@mergify mergify bot removed the needs-rebase label Jun 4, 2025
@p12tic
Copy link
Contributor Author

p12tic commented Jun 4, 2025

@njhill Sorry for the delay again, I've implemented and tested env variable based solution.

@p12tic p12tic requested a review from njhill June 4, 2025 21:35
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @p12tic

@njhill
Copy link
Member

njhill commented Jun 4, 2025

Actually @p12tic would you be ok to add a test for this? Can just be copy of one of the existing simple tests but with the env var set, or else we're not exercising that path at all.

@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 219f47a to 62eb078 Compare June 5, 2025 07:58
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
@p12tic p12tic force-pushed the reduce-spinning-when-no-activity branch from 62eb078 to 2a1d615 Compare June 5, 2025 08:08
@p12tic
Copy link
Contributor Author

p12tic commented Jun 5, 2025

@njhill Done, let me know if I should have added test in more appropriate place.

@zacksiri
Copy link

zacksiri commented Jun 5, 2025

@p12tic thank you so much for this patch. I've been following this thread, your patience and persistence is greatly appreciated.

Thanks to the vllm team for merging this patch, I believe it will be beneficial for a lot of users who have smaller deployments.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @p12tic for your work and patience!

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2025
@njhill njhill enabled auto-merge (squash) June 5, 2025 15:00
@njhill njhill merged commit 85e2b7b into vllm-project:main Jun 5, 2025
79 checks passed
@p12tic p12tic deleted the reduce-spinning-when-no-activity branch June 5, 2025 18:24
leoli1208 pushed a commit to leoli1208/vllm that referenced this pull request Jul 22, 2025
…e time (vllm-project#16226)

Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: v0.7.4 dev version CPU usage remains at 100% even when no requests are being processed.