[Core] Add sleep level 0 mode with enqueue/wait pattern#33195
[Core] Add sleep level 0 mode with enqueue/wait pattern#33195zhuohan123 merged 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request introduces a new 'sleep level 0' mode, which allows pausing the engine's scheduling without offloading model weights or KV cache from GPU memory. This is implemented by introducing a scheduling_paused flag in EngineCore and modifying the step, sleep, wake_up, run_busy_loop, and _process_input_queue methods to respect this state. The LLM.generate method is refactored to use enqueue and wait_for_completion for a more flexible request handling pattern. The changes appear to correctly implement the intended functionality, providing a new mechanism for fine-grained control over engine activity without incurring the overhead of full memory offload. No critical or high-severity issues were identified.
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
e6a63aa to
63a8b06
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
63a8b06 to
7937cde
Compare
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| CPU memory pressure. | ||
| """ | ||
| self.reset_prefix_cache() | ||
| if level > 0: |
There was a problem hiding this comment.
what's the behavior of level 0?
Will this cause any breakage if user use level 0 before?
|
|
||
| return self.wait_for_completion(use_tqdm=use_tqdm) | ||
|
|
||
| def enqueue( |
There was a problem hiding this comment.
where do we expect to call this function?
houseroad
left a comment
There was a problem hiding this comment.
Could you explain the usage of this new functions?
014e33a to
611888e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
86d66c1 to
f5a74b1
Compare
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f5a74b1 to
f1ece8b
Compare
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
b2ac14a to
65fca50
Compare
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @jaewonlee-fb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
7dfe329 to
cc187a3
Compare
Add level 0 sleep mode that pauses scheduling without touching GPU memory. This enables batched inference patterns where all requests are queued first, then processed together. Also adds enqueue() and wait_for_completion() methods to LLM class for explicit control over request scheduling. Level 0 sleep: - Pauses scheduling but keeps accepting requests - No GPU memory changes (unlike level 1/2) - Wake up with tags=["scheduling"] to resume Also adds profile_prefix parameter to start_profile() for custom trace naming. Signed-off-by: Jaewon Lee <jaewon@meta.com>
Signed-off-by: Jaewon Lee <jaewon@meta.com>
Level 0 sleep should only pause scheduling without any side effects. The sync path (llm.py) correctly guards reset_prefix_cache with `if level > 0:`, but the async path was missing this check. Signed-off-by: Jaewon Lee <jaewon@meta.com>
Signed-off-by: Jaewon Lee <jaewon@meta.com>
cc187a3 to
b5fcb4c
Compare
…#33195) Signed-off-by: Jaewon Lee <jaewon@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
…#33195) Signed-off-by: Jaewon Lee <jaewon@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
…#33195) Signed-off-by: Jaewon Lee <jaewon@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Summary
enqueue()andwait_for_completion()methods to offline LLM class for explicit request scheduling controlLevel 0 Sleep
tags=["scheduling"]to resumeUse Case
Enables batched inference patterns where all requests are queued first, then processed together.
Test plan
No-op by default, could be used for offline inference LLM class as a start.