[P/D] rework mooncake connector and introduce its bootstrap server by dtcccc · Pull Request #31034 · vllm-project/vllm

dtcccc · 2025-12-19T15:33:31Z

Purpose

Rework mooncake connector to achieve better performance and prepare for more features in future.
Introduce a central bootstrap server on P.

init phase:
All P workers register their info (dp/tp/pp rank, zmq worker addr) with the bootstrap server.
After all P workers finished registering, proxy and D workers can query when they meet a new engine_id.

Note:
(deprecated)

Since #30739 data_parallel_size and data_parallel_size_local are stick to 1 for non-Moe models. So origin_data_parallel_size and origin_data_parallel_size_local are introduced to get the raw value.
For non-Moe model with dp_size > 1, all its engines by dp have the same engine_id and unable to distinguish dp_ranks and workers. So [engine_id][dp_rank] is used. See comments in MooncakeBootstrapServer for detail.

After startup, the proxy will send reqs to both P and D concurrently. Due to #27987 P and D cannot know the exact request_id of each other. #32630 seems to be a solution, and I just try to drop the last 9 chars as workaround here. See comments in TruncatingDict.

Thanks to #33037 and #32937 we can drop all workarounds now.

This design is partially inspired by sglang, aiming to improve Time To First Token (TTFT) performance and to lay the groundwork for a future layerwise transfer feature.

With random dataset and max-concurrency = 1, TTFT on two A10 machines running Qwen2.5-7B-Instruct is improved:
random-input-len 128: 83.41ms -> 77.53ms
random-input-len 1024: 252.35ms ->246.25ms
This result shows TTFT win about 6ms from running in P and D simultaneously.

Other highlights of this PR:

Introduced the proxy and example scripts for the new bootstrap server architecture, along with updated documentation to facilitate learning and testing.
Prepare the future introduction of heterogeneous TP. For now, heterogeneous TP and PP is explicitly rejected.
Pick a bugfix from [Bugfix] Fix _reqs_to_process leak on abort #26012

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-12-19T15:34:15Z

Documentation preview: https://vllm--31034.org.readthedocs.build/en/31034/

gemini-code-assist

Code Review

This pull request is a significant and well-structured rework of the mooncake connector, introducing a central bootstrap server to improve performance and enable future features. The refactoring to an asynchronous design is a great improvement. My review focuses on the robustness of the new distributed communication patterns. I've identified a few high-severity issues related to error handling and idempotency that could impact the system's reliability, particularly under failure conditions.

gemini-code-assist · 2025-12-19T15:36:31Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+        except Exception as e:
+            err_msg = (
+                e.response.text if isinstance(e, httpx.HTTPStatusError) else str(e)
+            )
+            logger.error(
+                "Failed to register request %s with bootstrap server: %s",
+                req_id,
+                err_msg,
+            )


The failure to register a request with the bootstrap server is handled by logging an error, but the failure is not propagated. Since register_req_with_bootstrap is called in a fire-and-forget manner, the scheduler remains unaware of the failure. This will likely cause the request to hang on the decoder side until it times out, which is not a clean failure mode and makes debugging difficult. A more robust solution would be to communicate this failure back to the scheduler (e.g., via a queue) so it can abort the request immediately with a clear error reason.

I think this is a valid concern, we should at least add a TODO.
Re: disconnects I am assuming the proxy forwards to both instances

gemini-code-assist · 2025-12-19T15:36:31Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+        except Exception as e:
+            err_msg = (
+                e.response.text if isinstance(e, httpx.HTTPStatusError) else str(e)
            )
-            path = make_zmq_path(
-                "tcp", meta.remote_host, meta.remote_port + self.tp_rank
+            logger.error(
+                "Failed to query bootstrap server for %d requests: %s",
+                len(req_ids),
+                err_msg,
            )
-            kv_pulls[path].append((req_id, meta.local_block_ids))
+            return {}


The except block in batch_query_requests catches all exceptions and returns an empty dictionary. The calling function, handle_bootstrap_group, does not correctly handle this case, leading to the requests in the failed batch being effectively dropped. They are not retried or explicitly failed, which will likely cause them to hang until the scheduler times them out. This can make the system brittle, especially if the bootstrap server is temporarily unavailable. The client should implement retries for transient errors (e.g., 5xx status codes, connection errors) and handle non-retriable errors more explicitly.

gemini-code-assist · 2025-12-19T15:36:31Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+        if (reg := self.req_to_dp_rank.get(payload.req_id)) is not None:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Request '{payload.req_id}' already registered with rank {reg} "
+                f"but still want to register with rank {payload.dp_rank}",
+            )


The register_request endpoint is not idempotent. If a client retries a registration request (e.g., due to a transient network error), the second attempt will fail with a 400 error because the request ID is already present. This prevents the implementation of a robust retry mechanism on the client side. The endpoint should handle duplicate registrations for the same request and DP rank gracefully, for example, by returning a success response.

if (reg := self.req_to_dp_rank.get(payload.req_id)) is not None: if reg == payload.dp_rank: # Request is already registered with the same dp_rank, treat as success. return {"status": "ok"} raise HTTPException( status_code=400, detail=f"Request '{payload.req_id}' already registered with rank {reg} " f"but still want to register with rank {payload.dp_rank}", )

robertgshaw2-redhat · 2025-12-19T15:36:37Z

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py

@@ -0,0 +1,344 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+


why do we need a separate proxy for mooncake?

From what I've seen so far, the proxies used by vLLM's various connectors aren't unified. For example, disagg_proxy_demo.py doesn't support the nixl-specific "do_remote_decode" and "do_remote_prefill" parameters, which is why nixl has its own toy_proxy_server.py. So, in my view, it's common practice for each connector to have its own dedicated proxy.

To get back to the main point, we are using this specific proxy because it aligns with the central bootstrap server architecture we want to introduce. This is a prerequisite for supporting layerwise transfer in the future. (The other existing proxies only send data to D after receiving the full result from P, which is too late for our use case.)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-19T15:40:58Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py

+                await sock.send(encoded_data)
+                while True:
+                    ret_msg = await sock.recv()
+                    response = self._xfer_resp_decoder.decode(ret_msg)


KV pull waits indefinitely when prefiller never replies

The decoder’s receive_kv loop now waits on await sock.recv() with no timeout or cancellation. If the prefiller never responds (e.g., wrong bootstrap address or the producer crashes mid-transfer), this coroutine hangs forever and the request is never re-queued or marked finished, so the abort timeout VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT is never enforced and the decode side leaks the request until process shutdown. The previous implementation set a receive timeout; we should restore a bounded wait or explicit cancellation/retry.

Useful? React with 👍 / 👎.

mergify · 2025-12-19T16:00:52Z

Hi @dtcccc, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

NickLucche

Thanks for the work once again @dtcccc !

After some thinking, I believe all in all these changes are already fine.

My main concern is the added complexity that this design brings in: we have a longer "failure-chain" to propagate through and effectively an extra point of failure in the side fastapi server which needs handling of failures for the registering/querying endpoints .
In particular, I feel like the extra "registering" call that each P worker now has to perform in its flow is the weakest link, so we really have to be sure the benefits from having D be aware of the req at step 0 outweigh this overhead.
To that extent it would be nice to have a broader benchmark sweep to confirm TTFT gains, possibly comparing with a prev version + "Refactored the sender thread using async coroutines" (in hindsight, this should've been a separate PR to help the review process here).

But I also understand that the alternative to the point above would be to "push" the dp_rank request-selection at the proxy level (or some Coordinator in front of the DP instances @robertgshaw2-redhat ) which would take away control from the connector and/or require more invasive changes. Therefore I am overall ok with this, but just wanted to bring up a few points for discussion.

One qq, what is currently stopping D from running query_requests before P registers the request (3/4)?

cc @wseaton for changes that are very similar to a past work of yours (+opinion on future failure handling?)

NickLucche · 2025-12-22T11:05:05Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+http_log_level = logger.getEffectiveLevel()
+# INFO logs of http are too noisy. Silence them.
+# Setting vllm log level to DEBUG if we really want to see.
+if http_log_level == logging.INFO:
+    http_log_level = logging.WARNING
+logging.getLogger("httpx").setLevel(http_log_level)


this is a bit arbitrary, we should probably either do it in the mooncakeconnector init or push a separate global change cc @markmc

Either ways, we should log that we're "silencing" logs.

NickLucche · 2025-12-22T11:11:28Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py



 class MooncakeConnectorMetadata(KVConnectorMetadata):
    def __init__(self):
-        self.reqs_to_recv: dict[ReqId, RecvReqMeta] = {}
+        self.reqs_to_recv: list[PullReqMeta] = []


why did we switch to a list here?
If we want to group the reqs by remote boostrap server, can't we just use a dict[server_addr, list[meta]] ?

NickLucche · 2025-12-22T11:32:57Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+        except Exception as e:
+            err_msg = (
+                e.response.text if isinstance(e, httpx.HTTPStatusError) else str(e)
+            )
+            logger.error(
+                "Failed to register request %s with bootstrap server: %s",
+                req_id,
+                err_msg,
+            )


I think this is a valid concern, we should at least add a TODO.
Re: disconnects I am assuming the proxy forwards to both instances

NickLucche · 2025-12-22T11:38:02Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+class MooncakeBootstrapServer:
+    """
+    A centralized server running on the global rank 0 prefiller worker.


nit: could probably live in a separate mooncake_utils file

NickLucche · 2025-12-22T11:46:13Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py

+                    f"expected {self.tp_size}, got {payload.tp_size}"
+                ),
+            )
+


logger debug with source payload info could help here

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

cursor

Cursor Bugbot has reviewed your changes and found 7 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-23T07:33:12Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py

+                        ready=asyncio.Event(),
+                    )
+        for p_req_id in metadata.reqs_not_processed:
+            send_meta = self.reqs_need_send.pop(p_req_id)


KeyError from dict.pop without default argument

High Severity

The code calls self.reqs_need_send.pop(p_req_id) without a default value, which raises KeyError when p_req_id doesn't exist in the dictionary. Since TruncatingDict doesn't implement the pop method, it falls back to the base MutableMapping.pop, which requires the key to exist unless a default is provided. The code then checks if send_meta:, suggesting it expects None as a possible return value, but the current implementation will crash instead of returning None.

cursor · 2026-01-23T07:33:13Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py

+            if d_req_id not in self.reqs_need_send:
+                # This req is not enqueued in P side yet, create it here.
+                self.reqs_need_send[d_req_id] = SendBlockMeta(
+                    p_req_id="", local_block_ids=[], ready=asyncio.Event()


Race condition with asyncio.Event from wrong loop

High Severity

asyncio.Event() is created without specifying the event loop, defaulting to the current thread's event loop. At line 702 in send_kv_to_decode (which runs in sender_loop) and line 1211 in record_send_reqs (also in sender_loop), these events are created but may be used across different event loops. The event at line 702 is created when handling a decoder request, while the sender_loop is a separate background event loop. This creates a cross-loop event sharing issue that can cause race conditions or incorrect event signaling behavior.

Additional Locations (1)

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py#L1210-L1211

cursor · 2026-01-23T07:33:13Z

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py

+    while True:
+        for prefill_client in prefill_clients:
+            for i in range(prefill_client["dp_size"]):
+                yield prefill_client, i


Infinite generator causes early prefiller cycling

Medium Severity

The prefiller_cycle generator is initialized at line 115 before get_prefiller_info completes, causing it to iterate with prefill_client["dp_size"] being undefined or zero. At line 38, range(prefill_client["dp_size"]) will be range(0) since dp_size is only set later in line 60 of get_prefiller_info. This means the first request(s) may skip prefiller workers or behave incorrectly until get_prefiller_info finishes and populates dp_size.

Additional Locations (1)

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py#L114-L115

cursor · 2026-01-23T07:33:13Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py

            if block_ids:
                # Already gone through request_finished()
-                send_meta = self.reqs_need_send[req_id]
+                send_meta = self.reqs_need_send[p_req_id]


Potential KeyError accessing reqs_need_send without check

High Severity

The code accesses self.reqs_need_send[p_req_id] assuming the entry exists based on a comment saying "Already gone through request_finished()". However, there's a race condition where the scheduler's metadata with non-empty block_ids could arrive at the worker before the decoder's ZMQ request creates the entry in send_kv_to_decode (lines 698-704). Since reqs_need_send uses TruncatingDict which raises KeyError on missing keys, this can crash when metadata processing races ahead of decoder request handling.

cursor · 2026-01-23T07:33:13Z

vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py

+        for remote_tp_rank in remote_tp_ranks:
+            worker_addr = self._remote_agents[remote_engine_id][remote_dp_rank][
+                remote_tp_rank
+            ][0]


Missing validation causes KeyError for nested dict

High Severity

The code accesses a deeply nested dictionary self._remote_agents[remote_engine_id][remote_dp_rank][remote_tp_rank][0] without validating that remote_dp_rank and remote_tp_rank exist in the structure. While remote_engine_id is checked at line 1168, the specific dp_rank and tp_rank keys could be missing from the bootstrap server response, causing KeyError when attempting the nested access. The bootstrap server might not have registered workers for all expected rank combinations.

cursor · 2026-01-23T07:33:13Z

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py

+        "do_remote_decode": False,
+        "do_remote_prefill": True,
+        "remote_bootstrap_addr": prefill_client_info["bootstrap_addr"],
+        "remote_engine_id": prefill_client_info["dp_engine_id"][prefill_dp_rank],


KeyError from dp_engine_id with non-sequential ranks

High Severity

The prefiller_cycle generator yields ranks from 0 to dp_size-1 (line 38), but dp_engine_id dictionary is populated with actual dp_rank values from the bootstrap server (line 59), which may not be sequential from zero. When accessing prefill_client_info["dp_engine_id"][prefill_dp_rank] at line 301, if the bootstrap server returns non-sequential dp_ranks (e.g., ranks 2 and 3 instead of 0 and 1), the code will access rank 0 which doesn't exist in the dictionary, causing KeyError.

Additional Locations (1)

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py#L37-L38

cursor · 2026-01-23T07:33:13Z

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py

+
+    # Initialize round-robin iterators
+    app.state.prefill_iterator = prefiller_cycle(app.state.prefill_clients)
+    app.state.decode_iterator = itertools.cycle(range(len(app.state.decode_clients)))


Empty client lists cause StopIteration on requests

High Severity

When no --prefill or --decode arguments are provided, the iterators at lines 115-116 are created over empty collections. Line 115 creates a generator that never yields for empty prefill_clients, and line 116 creates itertools.cycle(range(0)) for empty decode_clients. When get_next_client calls next() on these iterators at lines 241 or 243, it raises StopIteration, crashing the request handler. There's no validation that at least one server of each type is configured.

Additional Locations (1)

examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py#L239-L244

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

…ctor

mergify · 2026-01-30T10:04:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dtcccc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ctor Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

…llm-project#31034) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: Pai <416932041@qq.com>

…llm-project#31034) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

…llm-project#31034) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

dtcccc requested review from ApostaC and NickLucche as code owners December 19, 2025 15:33

mergify bot added documentation Improvements or additions to documentation kv-connector labels Dec 19, 2025

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

robertgshaw2-redhat reviewed Dec 19, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 19, 2025

View reviewed changes

NickLucche reviewed Dec 22, 2025

View reviewed changes

dtcccc mentioned this pull request Dec 31, 2025

[P/D] Refactor mooncake connector sender thread using async coroutines #31573

Merged

5 tasks

rework again

f1b4b28

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

dtcccc force-pushed the dtcccc/mooncake_connector branch from 8fc582d to f1b4b28 Compare January 23, 2026 07:23

dtcccc requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, tlrmchlsmth, yewentao256 and youkaichao as code owners January 23, 2026 07:23

mergify bot added the v1 label Jan 23, 2026

cursor bot reviewed Jan 23, 2026

View reviewed changes

dtcccc added 2 commits January 23, 2026 15:52

add init.py

652d08c

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

fix a typo in comment

fde5613

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

dtcccc mentioned this pull request Jan 28, 2026

[Bug][Mooncake]: Incorrect/Incoherent model output in P/D on v0.14.0 #32898

Open

1 task

dtcccc added 3 commits January 28, 2026 16:53

Merge remote-tracking branch 'origin/main' into HEAD

170f0ea

remove workaround for engine_id, dp_size health check and req_id

5c51233

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

Merge remote-tracking branch 'origin/main' into dtcccc/mooncake_conne…

66fd4a3

…ctor

dtcccc requested a review from orozery as a code owner January 29, 2026 05:14

dtcccc force-pushed the dtcccc/mooncake_connector branch from 0228980 to 9988771 Compare January 29, 2026 10:59

NickLucche approved these changes Jan 30, 2026

View reviewed changes

NickLucche enabled auto-merge (squash) January 30, 2026 08:42

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 30, 2026

Merge branch 'main' into dtcccc/mooncake_connector

8ea6c2d

mergify bot added the needs-rebase label Jan 30, 2026

Merge remote-tracking branch 'origin/main' into dtcccc/mooncake_conne…

e7ffc20

…ctor Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

auto-merge was automatically disabled January 30, 2026 10:04
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label Jan 30, 2026

NickLucche enabled auto-merge (squash) January 30, 2026 10:34

dtcccc added 4 commits February 2, 2026 10:11

Merge branch 'main' into dtcccc/mooncake_connector

5dfcf20

Merge branch 'main' into dtcccc/mooncake_connector

c515421

Merge branch 'main' into dtcccc/mooncake_connector

9cd016a

Merge branch 'main' into dtcccc/mooncake_connector

446c807

NickLucche disabled auto-merge February 3, 2026 15:38

NickLucche enabled auto-merge (squash) February 3, 2026 15:39

vllm-bot merged commit 0d6ccf6 into vllm-project:main Feb 3, 2026
46 of 48 checks passed

This was referenced Feb 4, 2026

[CI Failure]: Distributed 2xH100 tests #33802

Closed

Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" #33841

Merged

markmc mentioned this pull request Feb 12, 2026

[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround #34415

Merged

		@@ -0,0 +1,344 @@
		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

Uh oh!

Conversation

dtcccc commented Dec 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

KeyError from dict.pop without default argument

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Race condition with asyncio.Event from wrong loop

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Infinite generator causes early prefiller cycling

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Potential KeyError accessing reqs_need_send without check

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Missing validation causes KeyError for nested dict

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

KeyError from dp_engine_id with non-sequential ranks

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

dtcccc commented Dec 19, 2025 •

edited by github-actions bot

Loading