Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" #28289

NickLucche · 2025-11-07T11:19:39Z

This PR #28012 is breaking PD deployments with TP>1.

# Spin up P TP=2
vllm serve Qwen/Qwen3-0.6B --port $(just port 8100) --enforce-eager --enable-log-requests --tensor-parallel-size 2 --gpu-memory-utilization 0.4 --trust-remote-code --max-model-len 32768 --block-size 128 --data_parallel_size 1 --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
# spin up toy_proxy_server in the bg..

# send test request 
  curl -X POST http://localhost:$(just port 8192)/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ \
      "model": "{{MODEL}}", \
      "prompt": "Can you complete this latin sentence: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", \
      "max_tokens": 150, \
      "temperature": 0.2 \
    }'

# Observe request getting stuck 
(APIServer pid=2822821) INFO 11-07 11:08:44 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2822821) INFO 11-07 11:08:44 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2822821) INFO:     Started server process [2822821]
(APIServer pid=2822821) INFO:     Waiting for application startup.
(APIServer pid=2822821) INFO:     Application startup complete.
(APIServer pid=2822821) INFO 11-07 11:10:17 [logger.py:47] Received request cmpl-700daa72-8cc7-4a65-9bff-cb83c40b16a7-0: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args={'kv_transfer_params': {'do_remote_decode': True, 'do_remote_prefill': False, 'remote_engine_id': None, 'remote_block_ids': None, 'remote_host': None, 'remote_port': None}}), lora_request: None.
(APIServer pid=2822821) INFO 11-07 11:10:17 [async_llm.py:343] Added request cmpl-700daa72-8cc7-4a65-9bff-cb83c40b16a7-0.

gemini-code-assist

Code Review

This pull request reverts a previous performance-related change (#28012) that was causing issues in deployments with tensor parallelism greater than one. The revert re-introduces a dedicated I/O thread in the multiprocessor executor to handle worker responses, removing the custom FutureWrapper and its associated logic. The changes touch several components, including executors, KV connector utilities, and tests, to align with the restored asynchronous execution model. My review identified a critical issue with an incorrect type hint that could lead to runtime errors, and a high-severity thread-safety concern in the asynchronous aggregation logic that, while not causing a bug with the current configuration, is fragile and should be addressed to prevent future issues.

gemini-code-assist · 2025-11-07T11:22:26Z

vllm/v1/worker/gpu_worker.py

    @torch.inference_mode()
    def sample_tokens(
-        self, grammar_output: "GrammarOutput | None"
+        self, grammar_output: "GrammarOutput"


The type hint for grammar_output has been changed to GrammarOutput, but callers of this method (e.g., in vllm/v1/executor/abstract.py) can pass None. This creates a discrepancy between the type hint and the actual usage, and can lead to a runtime AttributeError if None is passed and its attributes are accessed downstream. The type hint should be reverted to GrammarOutput | None to accurately reflect that None is a valid value.

Suggested change

self, grammar_output: "GrammarOutput"

self, grammar_output: "GrammarOutput | None"

gemini-code-assist · 2025-11-07T11:22:26Z

vllm/distributed/kv_transfer/kv_connector/utils.py

+        outputs: list[ModelRunnerOutput | None] = [None] * len(output_futures)
+        remaining = len(output_futures)
+
+        def make_callback(idx):
+            def callback(fut):
+                if result_future.done():
+                    return
+
+                try:
+                    outputs[idx] = fut.result()
+                except CancelledError:
+                    result_future.cancel()
+                except Exception as e:
+                    result_future.set_exception(e)
+
+                # this check assumes io_thread_pool uses a single thread
+                nonlocal remaining
+                remaining -= 1
+                if not remaining:
+                    result_future.set_result(self.aggregate(outputs, output_rank))
+
+            return callback
+
+        for i, output_future in enumerate(output_futures):
+            output_future.add_done_callback(make_callback(i))


The current implementation of async_aggregate is not thread-safe. The remaining counter is accessed and modified without a lock. While the comment on line 246 correctly points out the assumption of a single-threaded I/O pool, this design is fragile. If the ThreadPoolExecutor in MultiprocExecutor is ever configured with more than one worker, this will introduce a race condition, leading to incorrect behavior. To make this implementation robust and thread-safe, a lock should be used to protect the shared remaining counter.

from threading import Lock outputs: list[ModelRunnerOutput | None] = [None] * len(output_futures) remaining = len(output_futures) lock = Lock() def make_callback(idx): def callback(fut): if result_future.done(): return try: outputs[idx] = fut.result() except CancelledError: result_future.cancel() except Exception as e: result_future.set_exception(e) with lock: # This check is now thread-safe. nonlocal remaining remaining -= 1 if not remaining: if not result_future.done(): result_future.set_result(self.aggregate(outputs, output_rank)) return callback for i, output_future in enumerate(output_futures): output_future.add_done_callback(make_callback(i))

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-07T11:22:55Z

vllm/v1/executor/ray_utils.py

+    def __init__(self, refs, aggregator: KVOutputAggregator | None = None):
        super().__init__()
-        self.ref_or_refs = ref_or_refs
+        self.refs = refs
        self.aggregator = aggregator

    def result(self, timeout=None):
        if timeout is not None:
            raise NotImplementedError("timeout is not supported")

-        outputs = ray.get(self.ref_or_refs, timeout=timeout)
        if self.aggregator is None:
-            return outputs
+            return self.refs[0].get()

+        outputs = [ref.get() for ref in self.refs]


Ray futures retrieved via nonexistent ObjectRef.get

The new FutureWrapper.result() calls self.refs[0].get() and [ref.get() for ref in self.refs]. Ray object references don’t expose a .get() method; they must be resolved with ray.get(ref) or ray.get(refs). This means any non-blocking path that returns a FutureWrapper will immediately raise AttributeError when the scheduler awaits the result, breaking Ray execution entirely.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-07T11:22:55Z

vllm/v1/executor/ray_executor.py


            # When PP is used, we return a FutureWrapper immediately so that
            # the scheduler can yield to the next batch.
-            return FutureWrapper(refs[0])
+            return FutureWrapper(refs)

        # Get output from all workers when connector is present
        assert self.kv_output_aggregator is not None
        if not non_block:
            # Block and get results from all workers
-            return self.kv_output_aggregator.aggregate(ray.get(refs))
+            outputs = [ref.get() for ref in refs]
+            return self.kv_output_aggregator.aggregate(outputs)


Blocking Ray sampling uses .get() on ObjectRef

RayDistributedExecutor.sample_tokens now resolves Ray outputs with refs[0].get() and [ref.get() for ref in refs]. ObjectRef does not provide a get() API, so any synchronous sampling (with or without a KV connector) will fail with AttributeError before a response is returned. Use ray.get(refs) to retrieve the values.

Useful? React with 👍 / 👎.

…m-project#28012)" This reverts commit c9f66da. Signed-off-by: NickLucche <[email protected]>

njhill · 2025-11-07T19:06:48Z

Here is the change again with a fix: #28319

Thanks @NickLucche and @DarkLight1337

…m-project#28012)" (vllm-project#28289) Signed-off-by: NickLucche <[email protected]>

…m-project#28012)" (vllm-project#28289) Signed-off-by: NickLucche <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…m-project#28012)" (vllm-project#28289) Signed-off-by: NickLucche <[email protected]>

NickLucche requested a review from ApostaC as a code owner November 7, 2025 11:19

mergify bot added v1 kv-connector labels Nov 7, 2025

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (vll…

8a1d25a

…m-project#28012)" This reverts commit c9f66da. Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the fix-nixl-thread branch from 4dc2fee to 8a1d25a Compare November 7, 2025 11:26

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

DarkLight1337 enabled auto-merge (squash) November 7, 2025 14:36

DarkLight1337 approved these changes Nov 7, 2025

View reviewed changes

DarkLight1337 merged commit 68a72a5 into vllm-project:main Nov 7, 2025
54 checks passed

NickLucche deleted the fix-nixl-thread branch November 7, 2025 15:08

njhill mentioned this pull request Nov 7, 2025

[PerfFix] Avoid separate thread for MP executor shm spin (take 2) #28319

Merged

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (vll…

373707f

…m-project#28012)" (vllm-project#28289) Signed-off-by: NickLucche <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (vll…

5a3923d

…m-project#28012)" (vllm-project#28289) Signed-off-by: NickLucche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" #28289

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" #28289

Uh oh!

NickLucche commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Uh oh!

Uh oh!

njhill commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self, grammar_output: "GrammarOutput"
	self, grammar_output: "GrammarOutput \| None"

Uh oh!

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" #28289

Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" #28289

Uh oh!

Conversation

NickLucche commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NickLucche commented Nov 7, 2025 •

edited by github-actions bot

Loading