fix(shm): Add memory barriers for cross-process shared memory visibility#29819
fix(shm): Add memory barriers for cross-process shared memory visibility#29819kitaekatt wants to merge 2 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly identifies and fixes a critical race condition in the shared memory IPC mechanism by introducing memory barriers. The use of threading.Lock for this purpose is a standard and portable approach. The analysis in the pull request description is excellent. I have a couple of suggestions to improve the robustness and documentation of the new memory_fence function.
|
Thanks @kitaekatt! I'm just a bit surprised that this hasn't been encountered more often if it's so easy to reproduce. Are you pinning cpus / setting some particular numa config? |
|
@kitaekatt ping :) (I think the changes look very reasonable, just keen to understand the practical cases where this may arise) |
Hi Nick! Excited to contribute some value here back to vllm. Usage pattern that triggered this: I'm trying to maximize parallel inference bandwidth for batch workloads, using benchmarking (IFEval, GSM8K, HumanEval, MMLU) as a way to simulate sustained load. Running batch_size=12, concurrency=12 across multiple models - continuous high-throughput requests without pauses between batches. The freeze would occur at a consistent threshold for each configuration, suggesting a deterministic trigger once enough state accumulated. Hardware/config: RTX 5090 (Blackwell sm_120, 32GB), Ubuntu with kernel 6.14. No CPU pinning, no NUMA configuration - completely vanilla setup. Why others likely haven't hit this: When I added diagnostic instrumentation (just time.monotonic() calls and periodic logger.info() in core.py and multiproc_executor.py), the freeze completely disappeared - processed 3x the normal freeze threshold without any issue. The micro-delays broke the precise timing the race condition requires. So anyone trying to debug this with profiling tools, extra logging, or py-spy would accidentally mask it. Interactive users also don't sustain high-frequency message passing long enough to hit the threshold. You basically need: (1) sustained batch load without pauses, (2) no observation overhead, and (3) hardware The memory barrier fix directly addresses the visibility issue rather than relying on accidental timing changes. |
|
Thank you for the review feedback. I've addressed both suggestions:
|
|
Hi @kitaekatt, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Thanks for the review! Both suggestions have been addressed:
|
5b55b7f to
bbe218d
Compare
|
Pushed additional commit addressing the pre-commit failures:
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
@kitaekatt it looks like the commits in the branch got a bit messed up. Did you intend to introduce e3ba546 (the previous livelock fix)? And it looks like there's an unrelated commit showing up too. I'd like to do a quick benchmark of his but otherwise it looks good to me once cleaned up. |
The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com>
…ager Signed-off-by: Christina <truffle@gmail.com>
bbe218d to
0f27680
Compare
|
@njhill Thanks for catching that! I've cleaned up the branch - it was accidentally polluted with commits from a separate PR (#29813, which I closed in favor of this approach). The branch now contains only:
The SpinBackoffTimer/livelock fix from PR #29813 has been removed - that was a workaround, not the proper solution. |
|
Thanks @kitaekatt
FWIW those comments were from gemini, not me :) |
|
|
njhill
left a comment
There was a problem hiding this comment.
@kitaekatt did you close this on purpose? I had just benchmarked it and was about to approve/merge :)
Performance actually looks slightly better with this!
|
Oh I see you opened another PR, will move to that one. |
Summary
Fixes freeze/hang during sustained concurrent batch inference caused by missing memory barriers in the shared memory ring buffer protocol.
Root Cause
The
shm_broadcast.pyshared memory IPC uses plain byte writes (metadata_buffer[0] = 1) to signal between writer and reader processes. On multi-core systems, these writes stay in CPU store buffers and may not be visible to other processes running on different cores. This causes indefinite spinning where:The Fix
Add explicit memory barriers using
threading.Lockacquire/release pattern (which provides full memory barrier semantics per POSIX.1-2008) at four critical points:Why threading.Lock?
On POSIX systems,
pthread_mutex_lock/unlockprovides sequentially consistent memory barrier semantics. The lock acquire/release pattern (~20ns overhead) is the most portable and well-defined way to get memory barriers in Python without requiring platform-specific code.Test Results
Before fix: Freeze at batch ~41 (~492 concurrent requests)
After fix: Successfully completed 120 batches (1440 requests) without freeze
Test configuration:
Test Plan