[Core] Fix livelock in shm_broadcast under high concurrent load by kitaekatt · Pull Request #29813 · vllm-project/vllm

kitaekatt · 2025-12-01T21:31:59Z

Summary

This PR fixes a livelock bug in the shared memory broadcast mechanism (shm_broadcast.py) that causes the V1 engine to freeze under sustained high concurrent load.

Problem

When running with high concurrency settings (e.g., max_num_seqs=62) under sustained batch inference load, the V1 multiprocess executor freezes completely after processing ~492 items. All worker threads become blocked on futex_wait_queue with no error messages.

Root cause: The spin loops in acquire_write() and acquire_read() use pure sched_yield(), which can cause livelock when multiple processes are spinning simultaneously on shared memory. The OS scheduler can immediately reschedule the same process, creating a cycle where no process makes progress.

Solution

Introduce a new SpinBackoffTimer class that adds periodic backoff sleeps (1ms every 1000 spins) to break potential livelock patterns. This is a minimal change that:

Preserves low latency during normal operations (most spins still use sched_yield())
Breaks livelock patterns by ensuring processes yield long enough for others to acquire resources
Is configurable via constructor parameters if tuning is needed

Changes

Add SpinBackoffTimer class with periodic backoff mechanism
Replace SpinTimer with SpinBackoffTimer for the default (non-VLLM_SLEEP_WHEN_IDLE) case
Add _write_spin_timer for acquire_write() spin loop

Testing

Before fix: Server freezes at ~492 items (batch 41 of 12-concurrent request batches)

After fix: Successfully completed multiple runs of 120 batches × 12 concurrent requests (1440 items each) without freeze

Test configuration:

Model: Qwen2.5-32B-Instruct-AWQ
GPU: NVIDIA RTX 5090 (32GB)
Settings: max_num_seqs=62, max_model_len=4096, gpu_memory_utilization=0.68

Test script

#!/usr/bin/env python3
"""Aggressive stress test - 120 batches to ensure freeze trigger."""
import asyncio
import aiohttp
import time

SERVER_URL = "http://127.0.0.1:8008"

async def make_request(session, i):
    try:
        async with session.post(
            f"{SERVER_URL}/v1/chat/completions",
            json={
                "model": "Qwen2.5-32B-Instruct-AWQ",
                "messages": [{"role": "user", "content": f"What is {i} + {i}?"}],
                "max_tokens": 20
            },
            timeout=aiohttp.ClientTimeout(total=60)
        ) as resp:
            return "ok" if resp.status == 200 else f"err:{resp.status}"
    except asyncio.TimeoutError:
        return "timeout"

async def run_batch(batch_num, concurrency=12):
    async with aiohttp.ClientSession() as session:
        tasks = [make_request(session, batch_num*concurrency + i) for i in range(concurrency)]
        return await asyncio.gather(*tasks)

async def main():
    for batch in range(120):
        results = await run_batch(batch, 12)
        ok_count = sum(1 for r in results if r == "ok")
        if ok_count == 0:
            print(f"FREEZE DETECTED at batch {batch+1}")
            return
    print("SUCCESS: Completed 120 batches without freeze!")

if __name__ == "__main__":
    asyncio.run(main())

Add SpinBackoffTimer to prevent spinlock livelock in the shared memory broadcast mechanism. Under sustained high concurrent load, pure sched_yield() in the spin loops can cause all worker processes to livelock when waiting for read/write access to the ring buffer. The fix introduces a new SpinBackoffTimer class that adds a small periodic sleep (1ms every 1000 spins) to break potential livelock patterns. This is used in both acquire_write() and acquire_read() spin loops. Root cause analysis: - The V1 multiprocess executor uses shared memory ring buffers for IPC between EngineCore and workers - Under high concurrency (e.g., max_num_seqs=62 with sustained batch requests), all processes can enter spin loops simultaneously - Pure sched_yield() allows the OS scheduler to immediately reschedule the same process, creating a livelock where no process makes progress - The periodic backoff sleep breaks this pattern by ensuring processes yield long enough for others to acquire the shared resource Symptoms before fix: - Server freeze after ~492 concurrent items processed - All worker threads blocked on futex_wait_queue - No error messages - complete hang Testing: - Verified fix with stress test: 120 batches x 12 concurrent requests (1440 items) completed without freeze - Previous failure point was ~41 batches (~492 items) - Ran multiple iterations to confirm stability

chatgpt-codex-connector · 2025-12-01T21:32:08Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

github-actions · 2025-12-01T21:32:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses a livelock issue in the shared memory broadcast mechanism under high concurrent load. The root cause is identified as the use of pure sched_yield() in spin loops. The solution introduces a SpinBackoffTimer class that implements a backoff strategy by adding a small sleep periodically. This new timer is now used in the spin loops for both writers (acquire_write) and readers (acquire_read). The changes are well-contained, logical, and directly address the described problem. The implementation of SpinBackoffTimer is clear and the integration into MessageQueue is correct. I have one point of feedback regarding the usage of the new timer in the writer path, which appears inconsistent and may have minor performance implications.

gemini-code-assist · 2025-12-01T21:34:27Z

        self._is_remote_reader = False
-        self._read_spin_timer = SpinTimer()
+        self._read_spin_timer = SpinBackoffTimer()
+        self._write_spin_timer = SpinBackoffTimer()


While _write_spin_timer is correctly initialized here to use the new backoff strategy, its usage in acquire_write appears to be incomplete. The record_activity() method of the timer is never called after a successful write operation. This is inconsistent with the acquire_read method, which does call record_activity() on its timer after a successful read.

The docstring for record_activity in SpinBackoffTimer states it is for 'maintain[ing] low latency during normal ops'. By not calling it, the writer's spin counter is never reset on success. This leads to periodic sleeps even during normal, non-congested operations, which may introduce a small performance overhead and contradicts a stated goal of the PR to 'preserve low latency during normal operations'.

To ensure consistency and optimal performance in non-congested scenarios, consider calling self._write_spin_timer.record_activity() in the acquire_write method after a write operation succeeds, before the loop is broken. This would align the writer's behavior with the reader's.

kitaekatt · 2025-12-01T22:23:11Z

Closing this PR - the sleep-based backoff is a workaround, not a proper fix. The root cause is missing memory barriers in the shared memory protocol. Will submit a new PR with proper memory fence implementation.

njhill · 2025-12-01T22:25:42Z

Thanks @kitaekatt!!

gemini-code-assist Bot reviewed Dec 1, 2025

View reviewed changes

kitaekatt closed this Dec 1, 2025

kitaekatt mentioned this pull request Dec 10, 2025

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Fix livelock in shm_broadcast under high concurrent load#29813

[Core] Fix livelock in shm_broadcast under high concurrent load#29813
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix-shm-broadcast-livelock

kitaekatt commented Dec 1, 2025

Uh oh!

chatgpt-codex-connector Bot commented Dec 1, 2025

Uh oh!

github-actions Bot commented Dec 1, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 1, 2025

Uh oh!

kitaekatt commented Dec 1, 2025

Uh oh!

njhill commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kitaekatt commented Dec 1, 2025

Summary

Problem

Solution

Changes

Testing

Test script

Uh oh!

chatgpt-codex-connector Bot commented Dec 1, 2025

Uh oh!

github-actions Bot commented Dec 1, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Dec 1, 2025

Uh oh!

njhill commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants