[Core] Fix livelock in shm_broadcast under high concurrent load#29813
[Core] Fix livelock in shm_broadcast under high concurrent load#29813kitaekatt wants to merge 1 commit into
Conversation
Add SpinBackoffTimer to prevent spinlock livelock in the shared memory broadcast mechanism. Under sustained high concurrent load, pure sched_yield() in the spin loops can cause all worker processes to livelock when waiting for read/write access to the ring buffer. The fix introduces a new SpinBackoffTimer class that adds a small periodic sleep (1ms every 1000 spins) to break potential livelock patterns. This is used in both acquire_write() and acquire_read() spin loops. Root cause analysis: - The V1 multiprocess executor uses shared memory ring buffers for IPC between EngineCore and workers - Under high concurrency (e.g., max_num_seqs=62 with sustained batch requests), all processes can enter spin loops simultaneously - Pure sched_yield() allows the OS scheduler to immediately reschedule the same process, creating a livelock where no process makes progress - The periodic backoff sleep breaks this pattern by ensuring processes yield long enough for others to acquire the shared resource Symptoms before fix: - Server freeze after ~492 concurrent items processed - All worker threads blocked on futex_wait_queue - No error messages - complete hang Testing: - Verified fix with stress test: 120 batches x 12 concurrent requests (1440 items) completed without freeze - Previous failure point was ~41 batches (~492 items) - Ran multiple iterations to confirm stability
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a livelock issue in the shared memory broadcast mechanism under high concurrent load. The root cause is identified as the use of pure sched_yield() in spin loops. The solution introduces a SpinBackoffTimer class that implements a backoff strategy by adding a small sleep periodically. This new timer is now used in the spin loops for both writers (acquire_write) and readers (acquire_read). The changes are well-contained, logical, and directly address the described problem. The implementation of SpinBackoffTimer is clear and the integration into MessageQueue is correct. I have one point of feedback regarding the usage of the new timer in the writer path, which appears inconsistent and may have minor performance implications.
| self._is_remote_reader = False | ||
| self._read_spin_timer = SpinTimer() | ||
| self._read_spin_timer = SpinBackoffTimer() | ||
| self._write_spin_timer = SpinBackoffTimer() |
There was a problem hiding this comment.
While _write_spin_timer is correctly initialized here to use the new backoff strategy, its usage in acquire_write appears to be incomplete. The record_activity() method of the timer is never called after a successful write operation. This is inconsistent with the acquire_read method, which does call record_activity() on its timer after a successful read.
The docstring for record_activity in SpinBackoffTimer states it is for 'maintain[ing] low latency during normal ops'. By not calling it, the writer's spin counter is never reset on success. This leads to periodic sleeps even during normal, non-congested operations, which may introduce a small performance overhead and contradicts a stated goal of the PR to 'preserve low latency during normal operations'.
To ensure consistency and optimal performance in non-congested scenarios, consider calling self._write_spin_timer.record_activity() in the acquire_write method after a write operation succeeds, before the loop is broken. This would align the writer's behavior with the reader's.
|
Closing this PR - the sleep-based backoff is a workaround, not a proper fix. The root cause is missing memory barriers in the shared memory protocol. Will submit a new PR with proper memory fence implementation. |
|
Thanks @kitaekatt!! |
Summary
This PR fixes a livelock bug in the shared memory broadcast mechanism (
shm_broadcast.py) that causes the V1 engine to freeze under sustained high concurrent load.Problem
When running with high concurrency settings (e.g.,
max_num_seqs=62) under sustained batch inference load, the V1 multiprocess executor freezes completely after processing ~492 items. All worker threads become blocked onfutex_wait_queuewith no error messages.Root cause: The spin loops in
acquire_write()andacquire_read()use puresched_yield(), which can cause livelock when multiple processes are spinning simultaneously on shared memory. The OS scheduler can immediately reschedule the same process, creating a cycle where no process makes progress.Solution
Introduce a new
SpinBackoffTimerclass that adds periodic backoff sleeps (1ms every 1000 spins) to break potential livelock patterns. This is a minimal change that:sched_yield())Changes
SpinBackoffTimerclass with periodic backoff mechanismSpinTimerwithSpinBackoffTimerfor the default (non-VLLM_SLEEP_WHEN_IDLE) case_write_spin_timerforacquire_write()spin loopTesting
Before fix: Server freezes at ~492 items (batch 41 of 12-concurrent request batches)
After fix: Successfully completed multiple runs of 120 batches × 12 concurrent requests (1440 items each) without freeze
Test configuration:
max_num_seqs=62,max_model_len=4096,gpu_memory_utilization=0.68Test script