Skip to content

[Core] Fix livelock in shm_broadcast under high concurrent load#29813

Closed
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix-shm-broadcast-livelock
Closed

[Core] Fix livelock in shm_broadcast under high concurrent load#29813
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix-shm-broadcast-livelock

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

This PR fixes a livelock bug in the shared memory broadcast mechanism (shm_broadcast.py) that causes the V1 engine to freeze under sustained high concurrent load.

Problem

When running with high concurrency settings (e.g., max_num_seqs=62) under sustained batch inference load, the V1 multiprocess executor freezes completely after processing ~492 items. All worker threads become blocked on futex_wait_queue with no error messages.

Root cause: The spin loops in acquire_write() and acquire_read() use pure sched_yield(), which can cause livelock when multiple processes are spinning simultaneously on shared memory. The OS scheduler can immediately reschedule the same process, creating a cycle where no process makes progress.

Solution

Introduce a new SpinBackoffTimer class that adds periodic backoff sleeps (1ms every 1000 spins) to break potential livelock patterns. This is a minimal change that:

  • Preserves low latency during normal operations (most spins still use sched_yield())
  • Breaks livelock patterns by ensuring processes yield long enough for others to acquire resources
  • Is configurable via constructor parameters if tuning is needed

Changes

  • Add SpinBackoffTimer class with periodic backoff mechanism
  • Replace SpinTimer with SpinBackoffTimer for the default (non-VLLM_SLEEP_WHEN_IDLE) case
  • Add _write_spin_timer for acquire_write() spin loop

Testing

Before fix: Server freezes at ~492 items (batch 41 of 12-concurrent request batches)

After fix: Successfully completed multiple runs of 120 batches × 12 concurrent requests (1440 items each) without freeze

Test configuration:

  • Model: Qwen2.5-32B-Instruct-AWQ
  • GPU: NVIDIA RTX 5090 (32GB)
  • Settings: max_num_seqs=62, max_model_len=4096, gpu_memory_utilization=0.68

Test script

#!/usr/bin/env python3
"""Aggressive stress test - 120 batches to ensure freeze trigger."""
import asyncio
import aiohttp
import time

SERVER_URL = "http://127.0.0.1:8008"

async def make_request(session, i):
    try:
        async with session.post(
            f"{SERVER_URL}/v1/chat/completions",
            json={
                "model": "Qwen2.5-32B-Instruct-AWQ",
                "messages": [{"role": "user", "content": f"What is {i} + {i}?"}],
                "max_tokens": 20
            },
            timeout=aiohttp.ClientTimeout(total=60)
        ) as resp:
            return "ok" if resp.status == 200 else f"err:{resp.status}"
    except asyncio.TimeoutError:
        return "timeout"

async def run_batch(batch_num, concurrency=12):
    async with aiohttp.ClientSession() as session:
        tasks = [make_request(session, batch_num*concurrency + i) for i in range(concurrency)]
        return await asyncio.gather(*tasks)

async def main():
    for batch in range(120):
        results = await run_batch(batch, 12)
        ok_count = sum(1 for r in results if r == "ok")
        if ok_count == 0:
            print(f"FREEZE DETECTED at batch {batch+1}")
            return
    print("SUCCESS: Completed 120 batches without freeze!")

if __name__ == "__main__":
    asyncio.run(main())

Add SpinBackoffTimer to prevent spinlock livelock in the shared memory
broadcast mechanism. Under sustained high concurrent load, pure
sched_yield() in the spin loops can cause all worker processes to
livelock when waiting for read/write access to the ring buffer.

The fix introduces a new SpinBackoffTimer class that adds a small
periodic sleep (1ms every 1000 spins) to break potential livelock
patterns. This is used in both acquire_write() and acquire_read()
spin loops.

Root cause analysis:
- The V1 multiprocess executor uses shared memory ring buffers for
  IPC between EngineCore and workers
- Under high concurrency (e.g., max_num_seqs=62 with sustained batch
  requests), all processes can enter spin loops simultaneously
- Pure sched_yield() allows the OS scheduler to immediately reschedule
  the same process, creating a livelock where no process makes progress
- The periodic backoff sleep breaks this pattern by ensuring processes
  yield long enough for others to acquire the shared resource

Symptoms before fix:
- Server freeze after ~492 concurrent items processed
- All worker threads blocked on futex_wait_queue
- No error messages - complete hang

Testing:
- Verified fix with stress test: 120 batches x 12 concurrent requests
  (1440 items) completed without freeze
- Previous failure point was ~41 batches (~492 items)
- Ran multiple iterations to confirm stability
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a livelock issue in the shared memory broadcast mechanism under high concurrent load. The root cause is identified as the use of pure sched_yield() in spin loops. The solution introduces a SpinBackoffTimer class that implements a backoff strategy by adding a small sleep periodically. This new timer is now used in the spin loops for both writers (acquire_write) and readers (acquire_read). The changes are well-contained, logical, and directly address the described problem. The implementation of SpinBackoffTimer is clear and the integration into MessageQueue is correct. I have one point of feedback regarding the usage of the new timer in the writer path, which appears inconsistent and may have minor performance implications.

self._is_remote_reader = False
self._read_spin_timer = SpinTimer()
self._read_spin_timer = SpinBackoffTimer()
self._write_spin_timer = SpinBackoffTimer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While _write_spin_timer is correctly initialized here to use the new backoff strategy, its usage in acquire_write appears to be incomplete. The record_activity() method of the timer is never called after a successful write operation. This is inconsistent with the acquire_read method, which does call record_activity() on its timer after a successful read.

The docstring for record_activity in SpinBackoffTimer states it is for 'maintain[ing] low latency during normal ops'. By not calling it, the writer's spin counter is never reset on success. This leads to periodic sleeps even during normal, non-congested operations, which may introduce a small performance overhead and contradicts a stated goal of the PR to 'preserve low latency during normal operations'.

To ensure consistency and optimal performance in non-congested scenarios, consider calling self._write_spin_timer.record_activity() in the acquire_write method after a write operation succeeds, before the loop is broken. This would align the writer's behavior with the reader's.

@kitaekatt
Copy link
Copy Markdown
Contributor Author

Closing this PR - the sleep-based backoff is a workaround, not a proper fix. The root cause is missing memory barriers in the shared memory protocol. Will submit a new PR with proper memory fence implementation.

@kitaekatt kitaekatt closed this Dec 1, 2025
@njhill
Copy link
Copy Markdown
Member

njhill commented Dec 1, 2025

Thanks @kitaekatt!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants