feat(mocker): add offline disagg replay#7617
Conversation
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
WalkthroughThis pull request introduces a new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/mocker/src/replay/router/offline.rs (1)
374-397:⚠️ Potential issue | 🟠 Major
track_prefill_tokensis still ignored during offline admission.
SequenceRequestnow carries the flag, butadmit_request()still computes candidate load withslots.potential_blocks_and_tokens(...), which always includes prompt-side tokens. That means offline replay keeps charging prompt-side load during worker selection even whenrouter_track_prefill_tokens=false(notably the decode pool in disagg mode). This path needs the samepotential_blocks_and_tokens_with_prefill_tracking(..., request.track_prefill_tokens)change thatlib/kv-router/src/scheduling/queue.rsgot.Suggested fix
fn admit_request(&mut self, request: PendingRequest) -> Result<usize> { - let (decode_blocks, prefill_tokens) = self.slots.potential_blocks_and_tokens( - request.token_seq.as_deref(), - request.isl_tokens, - request.overlaps.clone(), - ); + let (decode_blocks, prefill_tokens) = self + .slots + .potential_blocks_and_tokens_with_prefill_tracking( + request.token_seq.as_deref(), + request.isl_tokens, + request.overlaps.clone(), + request.track_prefill_tokens, + ); let scheduling_request = request.scheduling_request(decode_blocks, prefill_tokens);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/mocker/src/replay/router/offline.rs` around lines 374 - 397, The offline admit_request path currently calls self.slots.potential_blocks_and_tokens(...) which always counts prompt-side (prefill) tokens; change that call to use the prefill-aware API self.slots.potential_blocks_and_tokens_with_prefill_tracking(..., request.track_prefill_tokens) so worker selection respects the SequenceRequest.track_prefill_tokens flag. Update the call site in admit_request (where decode_blocks and prefill_tokens are computed) to pass request.track_prefill_tokens into potential_blocks_and_tokens_with_prefill_tracking and keep the rest of the scheduling flow (scheduling_request, selector.select_worker, SequenceRequest construction) unchanged.
🧹 Nitpick comments (3)
lib/mocker/src/replay/offline/runtime_utils.rs (1)
79-102: Refutable pattern match on single-variant enum may break if new event kinds are added.The
letbinding on lines 88-94 uses a refutable pattern that assumesSimulationEventKind::WorkerCompletionis the only variant. IfSimulationEventKindgains additional variants in the future, this will panic at runtime instead of producing a compile error.Consider using
matchwith an explicit exhaustive arm or adding a#[deny(irrefutable_let_patterns)]check if this assumption should remain stable.♻️ Optional: Use explicit match for future-proofing
- let SimulationEventKind::WorkerCompletion { - stage, - worker_idx, - completed_requests, - output_signals, - kv_events, - } = event.kind; + let (stage, worker_idx, completed_requests, output_signals, kv_events) = match event.kind { + SimulationEventKind::WorkerCompletion { + stage, + worker_idx, + completed_requests, + output_signals, + kv_events, + } => (stage, worker_idx, completed_requests, output_signals, kv_events), + };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/mocker/src/replay/offline/runtime_utils.rs` around lines 79 - 102, The current refutable pattern in pop_ready_worker_completion destructures event.kind as SimulationEventKind::WorkerCompletion and can panic if new variants are added; change the code to explicitly match event.kind (from the SimulationEvent returned by events.peek()/events.pop()) and handle the WorkerCompletion arm by constructing and returning WorkerCompletionPayload, while adding a wildcard arm that safely returns None (or logs/handles unexpected variants) to avoid runtime panics.lib/kv-router/src/protocols.rs (1)
1030-1056: Add a legacy-payload test, not just a round-trip.This only proves
falsesurvives when the field is present. The compatibility contract from the new serde default is that olderAddRequestpayloads withouttrack_prefill_tokensstill deserialize totrue, so please lock that in with a fixture that omits the field.Example regression test
+ #[test] + fn test_active_sequence_add_request_defaults_track_prefill_tokens_for_legacy_payloads() { + let legacy = r#"{"request_id":"req-123","worker":{"worker_id":7,"dp_rank":0},"data":{"AddRequest":{"token_sequence":[11,22],"isl":128,"overlap":1,"expected_output_tokens":32}},"router_id":9,"lora_name":null}"#; + let deserialized: ActiveSequenceEvent = serde_json::from_str(legacy).unwrap(); + + match deserialized.data { + ActiveSequenceEventData::AddRequest { + track_prefill_tokens, + .. + } => assert!(track_prefill_tokens), + _ => panic!("expected add request event"), + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/kv-router/src/protocols.rs` around lines 1030 - 1056, The current test only round-trips an event that includes track_prefill_tokens=false; add a regression test that deserializes a legacy JSON payload which omits the track_prefill_tokens field and asserts the resulting ActiveSequenceEvent (specifically ActiveSequenceEventData::AddRequest) yields track_prefill_tokens == true. Create a new test (or extend test_active_sequence_add_request_serialization_preserves_track_prefill_tokens) that constructs a JSON string representing an AddRequest without the track_prefill_tokens key, calls serde_json::from_str::<ActiveSequenceEvent>(), matches on ActiveSequenceEventData::AddRequest and asserts track_prefill_tokens is true to lock in the backward-compat default behavior.lib/bindings/python/tests/test_replay.py (1)
593-614: Consider adding a@pytest.mark.timeoutdecorator.This test exercises the full synthetic disagg replay pipeline, which involves worker execution. While the speedup ratio is high (1000.0), adding a timeout guard would be consistent with the subprocess tests and prevent CI hangs if something goes wrong.
🛡️ Suggested timeout decorator
+@pytest.mark.timeout(30) def test_run_synthetic_trace_replay_disagg_preserves_expected_output_tokens(): report = run_synthetic_trace_replay(As per coding guidelines: "add
@pytest.mark.timeout() for any test that may exceed 30s or uses polling/sleeps/subprocess waits"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/bindings/python/tests/test_replay.py` around lines 593 - 614, Add a pytest timeout decorator to the test_run_synthetic_trace_replay_disagg_preserves_expected_output_tokens function to prevent CI hangs; annotate the function with `@pytest.mark.timeout`(30) (or another appropriate seconds value) placed immediately above its def, and ensure pytest is imported in the test file if not already so the decorator resolves.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/benchmarks/mocker-trace-replay.md`:
- Around line 171-178: The docs currently present staged disagg args (flags
--prefill-engine-args and --decode-engine-args and staged JSON fields) as
conveniences; update the text to state them as required for offline disagg
replay and document the validator constraints: each --prefill-engine-args JSON
must include worker_type="prefill" and each --decode-engine-args JSON must
include worker_type="decode", both staged configs must set the same block_size,
and pool sizes are controlled via --num-prefill-workers and
--num-decode-workers; mention these requirements where the staged args are
described (references: flags --prefill-engine-args, --decode-engine-args, fields
worker_type and block_size, and flags --num-prefill-workers /
--num-decode-workers) so users won’t hit validation errors.
In `@docs/mocker/mocker.md`:
- Around line 131-134: The AIC guidance and any references to dynamo.replay must
be updated to mention split engine args for disaggregated offline replay: when
using `--replay-mode offline` with disagg you should use `--prefill-engine-args`
and `--decode-engine-args` (plus `--num-prefill-workers` and
`--num-decode-workers`) instead of `--extra-engine-args`; update the AIC section
text to conditionally show the aggregated example using `--extra-engine-args`
and a separate disagg example showing `--prefill-engine-args`,
`--decode-engine-args`, and the worker flags, and change any instructions that
currently tell `dynamo.replay` users to only use `--extra-engine-args` to
include the disagg path and flags.
In `@lib/bindings/c/src/lib.rs`:
- Around line 489-492: The bookkeeping path is not receiving the disagg decode
RouterConfigOverride, so decode_router.add_request is called with None and the
assume_kv_reuse/track_prefill_tokens overrides never reach
KvRouter::add_request; fix by creating or reusing the same RouterConfigOverride
(e.g., RouterConfigOverride { overlap_score_weight: Some(0.0), assume_kv_reuse:
Some(false), track_prefill_tokens: Some(false) }) used for query-time and pass
it into the bookkeeping call instead of None (i.e., replace the None argument to
decode_router.add_request(...) with Some(router_config_override) or propagate
the existing variable) so KvRouter::add_request sees the override.
In `@lib/bindings/python/rust/llm/replay.rs`:
- Around line 642-644: The code accepts num_prefill_workers and
num_decode_workers but ignores them when the replay path stays aggregated; add a
validation in the function that constructs/dispatches the replay (the function
handling num_workers, num_prefill_workers, num_decode_workers in replay.rs) to
reject (return Err or panic with a clear message) any non-default
num_prefill_workers or num_decode_workers when prefill_engine_args and
decode_engine_args are not provided (i.e., when choosing the aggregated arm); do
the same guard where the parameters are later processed (the block around the
alternative arm handling at the section covering lines ~656-673) to ensure
callers are informed rather than silently ignored.
In `@lib/bindings/python/src/dynamo/replay/main.py`:
- Around line 21-48: In _load_engine_args, json.loads(raw_args) can produce
non-dict values (e.g., list, null, scalar) so add an explicit validation right
after raw = json.loads(raw_args) that checks isinstance(raw, dict) and raises a
ValueError (e.g., "engine-args must be a JSON object") if not; keep all
subsequent logic that reads keys from raw (worker_type handling,
planner_profile_data resolution via resolve_planner_profile_data, and final
return via MockEngineArgs.from_json) unchanged.
---
Outside diff comments:
In `@lib/mocker/src/replay/router/offline.rs`:
- Around line 374-397: The offline admit_request path currently calls
self.slots.potential_blocks_and_tokens(...) which always counts prompt-side
(prefill) tokens; change that call to use the prefill-aware API
self.slots.potential_blocks_and_tokens_with_prefill_tracking(...,
request.track_prefill_tokens) so worker selection respects the
SequenceRequest.track_prefill_tokens flag. Update the call site in admit_request
(where decode_blocks and prefill_tokens are computed) to pass
request.track_prefill_tokens into
potential_blocks_and_tokens_with_prefill_tracking and keep the rest of the
scheduling flow (scheduling_request, selector.select_worker, SequenceRequest
construction) unchanged.
---
Nitpick comments:
In `@lib/bindings/python/tests/test_replay.py`:
- Around line 593-614: Add a pytest timeout decorator to the
test_run_synthetic_trace_replay_disagg_preserves_expected_output_tokens function
to prevent CI hangs; annotate the function with `@pytest.mark.timeout`(30) (or
another appropriate seconds value) placed immediately above its def, and ensure
pytest is imported in the test file if not already so the decorator resolves.
In `@lib/kv-router/src/protocols.rs`:
- Around line 1030-1056: The current test only round-trips an event that
includes track_prefill_tokens=false; add a regression test that deserializes a
legacy JSON payload which omits the track_prefill_tokens field and asserts the
resulting ActiveSequenceEvent (specifically ActiveSequenceEventData::AddRequest)
yields track_prefill_tokens == true. Create a new test (or extend
test_active_sequence_add_request_serialization_preserves_track_prefill_tokens)
that constructs a JSON string representing an AddRequest without the
track_prefill_tokens key, calls serde_json::from_str::<ActiveSequenceEvent>(),
matches on ActiveSequenceEventData::AddRequest and asserts track_prefill_tokens
is true to lock in the backward-compat default behavior.
In `@lib/mocker/src/replay/offline/runtime_utils.rs`:
- Around line 79-102: The current refutable pattern in
pop_ready_worker_completion destructures event.kind as
SimulationEventKind::WorkerCompletion and can panic if new variants are added;
change the code to explicitly match event.kind (from the SimulationEvent
returned by events.peek()/events.pop()) and handle the WorkerCompletion arm by
constructing and returning WorkerCompletionPayload, while adding a wildcard arm
that safely returns None (or logs/handles unexpected variants) to avoid runtime
panics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6378df3c-ba91-45aa-a8c8-1143db98f0a9
📒 Files selected for processing (42)
components/src/dynamo/common/configuration/groups/kv_router_args.pycomponents/src/dynamo/router/README.mdcomponents/src/dynamo/router/__main__.pydocs/benchmarks/mocker-trace-replay.mddocs/components/router/README.mddocs/components/router/router-guide.mddocs/mocker/mocker.mdlib/bench/kv_router/active_sequences_bench.rslib/bindings/c/src/lib.rslib/bindings/python/rust/llm/entrypoint.rslib/bindings/python/rust/llm/replay.rslib/bindings/python/src/dynamo/replay/api.pylib/bindings/python/src/dynamo/replay/main.pylib/bindings/python/tests/test_replay.pylib/kv-router/src/protocols.rslib/kv-router/src/scheduling/config.rslib/kv-router/src/scheduling/local.rslib/kv-router/src/scheduling/policy.rslib/kv-router/src/scheduling/queue.rslib/kv-router/src/scheduling/types.rslib/kv-router/src/sequences/multi_worker.rslib/kv-router/src/sequences/single.rslib/llm/src/kv_router.rslib/llm/src/kv_router/prefill_router.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/sequence.rslib/mocker/src/replay/entrypoints.rslib/mocker/src/replay/mod.rslib/mocker/src/replay/offline/README.mdlib/mocker/src/replay/offline/disagg.rslib/mocker/src/replay/offline/events.rslib/mocker/src/replay/offline/mod.rslib/mocker/src/replay/offline/multi.rslib/mocker/src/replay/offline/runtime_utils.rslib/mocker/src/replay/offline/state.rslib/mocker/src/replay/router/offline.rslib/mocker/src/replay/router/online.rslib/mocker/src/replay/router/shared.rslib/mocker/src/replay/validate.rslib/mocker/src/scheduler/mod.rslib/mocker/src/scheduler/sglang/core.rslib/mocker/src/scheduler/vllm/core.rs
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
This reverts commit 6bd84f8.
Add an offline disaggregated replay path with separate prefill/decode worker pools, staged Python/CLI bindings, and replay docs updates. This also keeps the aggregated replay path intact while extending tests around disagg routing, metrics, and timing behavior.
Summary by CodeRabbit
Release Notes
New Features
--router-track-prefill-tokens/--no-router-track-prefill-tokens).Documentation