[MISC][Bugfix] Use less CPU when message queue has been empty for some time#16226
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
6d50290 to
24bf316
Compare
|
I was experiencing 100% CPU load on 0.8.3, applied this PR and it's now fixed. Thank you. |
|
Any chance this can get merged? |
24bf316 to
2eea42d
Compare
|
Rebased on latest |
2eea42d to
5aa28ba
Compare
|
@youkaichao, @WoosukKwon, @mgoin Could you please take a look? I cannot request reviews, thus pinging directly. Thank you very much. |
5aa28ba to
447a8f4
Compare
|
+1 to this. I'm using |
|
@RawthiL FYI you can apply this PR to a recent release using docker as follows: Dockerfile: vllm.patch can be generated by checking out this branch and doing In docker-compose add the following for the vllm service: EDIT: Attached patch as file as the command to generate it is now more complex. |
|
+1 |
|
any update on this @youkaichao @WoosukKwon @mgoin ? |
|
+1 |
3 similar comments
|
+1 |
|
+1 |
|
+1 |
|
Could you change the behavior so it is off by default, preserving original behavior? I don't think we want to disturb existing deployments that rely on fast responses regardless of continuous load. I think it is okay to add an opt-in config |
|
agree with @mgoin that we can provide an option to enable the behavior, but we should not use it by default since normally we expect the server to be always busy (if it is idle for quite a long time, an upper level monitor should shut down the server). |
|
I was wondering whether we can just use zmq instead of the shm spin there?j (it already supports both and falls back to zmq for large msg or remote cases anyhow). I suspect the perf difference may not be noticeable. And there could also be a config toggle for that in case there is any difference to latency. |
|
@youkaichao Consider our use case: the vllm server gets a few thousand queries a day, mostly in working hours, the average inference takes 5 seconds, a few millisecond more latency has no relevance. Starting vllm takes more than a minute, so shutting it down is not an option. Saturating 2 cpu cores increases the system's power usage by 2,5x |
|
+1 |
Will do, thanks. |
Using zmq does not necessarily mean that there won't be busy loop. I fixed very similar issue in sglang in their zmq code path (sgl-project/sglang#6026). |
@p12tic there won't be any busy loop if zmq is used properly. Maybe they were reading nonblocking in a busy loop but that isn't what we do anywhere. I think we should just switch to that unless there's a measurable performance difference to using shm. |
447a8f4 to
dcae20d
Compare
|
@p12tic are you ok with updating this to use an env var instead of CLI arg? |
Sorry, I only added reaction to your comment. |
In setups which have long inactivity periods it is desirable to reduce system power consumption when vllm does nothing. The simpliest is to reduce polling frequency when there is no activity for a certain period of time. Fixes: vllm-project#14799 Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
58d4caa to
219f47a
Compare
|
@njhill Sorry for the delay again, I've implemented and tested env variable based solution. |
|
Actually @p12tic would you be ok to add a test for this? Can just be copy of one of the existing simple tests but with the env var set, or else we're not exercising that path at all. |
219f47a to
62eb078
Compare
Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
62eb078 to
2a1d615
Compare
|
@njhill Done, let me know if I should have added test in more appropriate place. |
|
@p12tic thank you so much for this patch. I've been following this thread, your patience and persistence is greatly appreciated. Thanks to the vllm team for merging this patch, I believe it will be beneficial for a lot of users who have smaller deployments. |
…e time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <povilas@radix.lt>
In setups which have long inactivity periods it is desirable to reduce system power consumption when vllm does nothing. This would lead to more CPU thermal headroom when a request eventually comes, especially when multiple GPUs are connected as each GPU would otherwise pin one thread at 100% CPU usage.
I didn't include any configuration knobs because I couldn't think of anyone who would adversely impacted by this change. It introduces a maximum additional latency of 100ms when vllm workers are inactive for 10 seconds or more. Seems like someone who has inactive vllm wouldn't care about small additional latency. But please let me know if you think otherwise.EDIT: Final version of the PR contains new environment variable VLLM_SLEEP_WHEN_IDLE to enable CPU usage reduction. Set VLLM_SLEEP_WHEN_IDLE=1 to reduce power consumption at the expense of small amount of additional latency.
FIX #14799, #16968, #16660