Conversation
📝 WalkthroughWalkthroughAdds worker-only (direct/ Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant DoSweep
participant SetupHead
participant Frontend
participant Backend
participant Health
participant Worker
User->>DoSweep: submit job (effective_frontend_type = "direct"/"none")
alt effective_frontend_type == "direct" or "none"
DoSweep->>SetupHead: skip head infra startup
Note over SetupHead: NATS/etcd not started on head
else
DoSweep->>SetupHead: start head infra (NATS/etcd) bound to host_ip
end
DoSweep->>Frontend: start_frontend()
alt effective_frontend_type == "direct" or "none"
Frontend-->>DoSweep: return early (frontend disabled)
else
Frontend->>Frontend: configure router/nginx on frontend_port
end
DoSweep->>Backend: launch worker(s) (sglang.launch_server if sglang/direct/none)
Backend->>Worker: start SGLang worker process
DoSweep->>Health: wait_for_model()
alt effective_frontend_type == "direct" or "none"
Health->>Worker: poll /health
Worker-->>Health: 200
Health->>Worker: poll /v1/models
Worker-->>Health: non-empty → Ready
else
Health->>Frontend: check via frontend/router
Frontend-->>Health: Ready
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0d0462c to
76a89ed
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/srtctl/cli/setup_head.py`:
- Around line 134-146: The current NATS startup command binds to 0.0.0.0 (cmd
list built with "-a", "0.0.0.0"), which is overly broad; change the bind address
to the node's private interface (use the existing host_ip variable) so NATS only
listens on the cluster interface. Update the cmd construction in setup_head.py
to replace the literal "0.0.0.0" with host_ip (and ensure host_ip is computed
earlier), and confirm any port checks like wait_for_port(head_node, 4222) use
the same host_ip/hostname to match the bind address.
🧹 Nitpick comments (2)
src/srtctl/cli/do_sweep.py (1)
82-91: Consider more robust signature detection.Catching
TypeErrorfor backward compatibility is fragile—it could mask unrelatedTypeErrorexceptions from withinendpoints_to_processes. Consider usinginspect.signatureto check ifbase_sys_portis supported, or document this as a known limitation.♻️ Suggested alternative using inspect
+import inspect + `@functools.cached_property` def backend_processes(self) -> list[Process]: """Compute physical process topology from endpoints (cached).""" # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs # and crash dynamo.sglang with "Address already in use". Use a job-specific base. - try: - return self.backend.endpoints_to_processes( - self.endpoints, - base_sys_port=self.runtime.sys_port_base, - ) - except TypeError: - # Backends that don't accept base_sys_port keep their default behavior. + sig = inspect.signature(self.backend.endpoints_to_processes) + if "base_sys_port" in sig.parameters: + return self.backend.endpoints_to_processes( + self.endpoints, + base_sys_port=self.runtime.sys_port_base, + ) + else: return self.backend.endpoints_to_processes(self.endpoints)src/srtctl/core/runtime.py (1)
129-146: Port calculation logic is correct, but exception handling is overly broad.The
sys_port_basecalculation correctly constrains ports to 10000–29900 (staying under signed 16-bit max). However, using bareexcept Exceptionwhen onlyValueErroris expected fromint()could mask unexpected errors.♻️ Narrow the exception type
try: job_seed = int(job_id) - except Exception: + except ValueError: job_seed = abs(hash(job_id)) sys_port_base = 10000 + (job_seed % 200) * 100
ishandhanani
left a comment
There was a problem hiding this comment.
Not sure if we want frontend type none.
When we run dynamo we need frontend and worker to be both on when its agg singel node. When running stock sglang, we just want the worker. This should be handled under the hood
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 67-71: The comment claiming the health timeout "matches 1:30:00
time limit" is incorrect because slurm.time_limit is 02:00:00; update the
comment and/or the values under health_check (max_attempts and/or
interval_seconds) so they correctly reflect the intended timeout relationship
with slurm.time_limit. Specifically, edit the comment above health_check and
adjust health_check.max_attempts (or interval_seconds) so max_attempts *
interval_seconds equals the desired timeout (e.g., 7200s for 2:00:00) and
reference the health_check, max_attempts, interval_seconds, and slurm.time_limit
symbols when making the change.
In `@src/srtctl/core/runtime.py`:
- Around line 137-175: The code assumes only "direct" needs special handling:
update port selection and startup to treat "none" like "direct". In runtime.py
change the frontend_port logic to use 30000 when effective_frontend_type is
"direct" or "none" (i.e., frontend_port = 30000 if effective_frontend_type in
("direct", "none") else 8000). In FrontendStageMixin.start_frontend() update the
skip condition to return/skip startup when effective_frontend_type is "direct"
or "none" so it does not attempt to instantiate a non-existent frontend
implementation; ensure references to config.frontend.type or
effective_frontend_type are used consistently to detect "none".
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 64-68: This file sets a long health_check timeout (max_attempts:
540, interval_seconds: 10) but lacks an explicit Slurm time limit; add a
slurm.time_limit setting (e.g., time_limit: "02:00:00") to the YAML so the job
walltime exceeds the 90-minute health check window—place the time_limit under
the same slurm configuration block used in other variants (match the 1K_8K
pattern) to ensure the job isn't killed before health checks complete.
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 69-71: The comment for health_check (max_attempts: 540,
interval_seconds: 10) incorrectly says "matches 1:30:00 time limit" while
slurm.time_limit is "02:00:00"; update the comment to accurately reflect that
540 attempts * 10s = 5400s (1:30:00) is a subset of the job time (2:00:00) or
change wording to state it equals 1:30:00 and is less than the slurm.time_limit;
adjust the inline comment next to health_check, max_attempts, or
interval_seconds accordingly to reference the correct durations and relationship
to slurm.time_limit.
In `@src/srtctl/core/runtime.py`:
- Around line 227-236: The cargo_env file is being created under configs_dir
(variable cargo_env) which is inside the source tree; change it to live in the
ephemeral log_dir instead: build the path as log_dir / "cargo_env" (instead of
configs_dir / "cargo_env"), ensure cargo_env.parent.mkdir(parents=True,
exist_ok=True) and cargo_env.touch(exist_ok=True) still run, and update the
container_mounts entry to mount cargo_env.resolve() to Path("/root/.cargo/env");
keep the try/except best-effort behavior and do not create or modify any
gitignore changes.
🧹 Nitpick comments (3)
src/srtctl/core/runtime.py (1)
162-173: Port collision handling is reasonable, but collision probability should be understood.With 200 possible base values (
job_seed % 200), there's a ~5% chance of collision at ~14 concurrent jobs sharing the same nodes (birthday paradox). This is acceptable given:
- Ports are per-node, so only jobs sharing nodes can collide
- Each job gets a 100-port range for internal allocation headroom
- The constraint of staying under 32767 for Dynamo's i16 requirement limits options
Consider documenting this trade-off in a comment if cluster density is expected to be high.
src/srtctl/core/health.py (1)
339-480: Consider extracting common polling infrastructure, but current structure is acceptable.The
wait_for_modelfunction has some structural duplication between the direct/none branch and sglang/dynamo branches (timeout handling, progress reporting, exception handling). However, the core logic differs enough that the current explicit structure is readable and maintainable.src/srtctl/cli/do_sweep.py (1)
82-91: Consider catching a more specific exception or using introspection.The
TypeErrorcatch is a pragmatic backward-compat approach, but it may inadvertently swallow unrelatedTypeErrors from bugs insideendpoints_to_processes. Consider usinginspect.signatureto check ifbase_sys_portis accepted, or at minimum, log when the fallback is triggered so issues are visible.♻️ Optional: Use signature introspection instead of try/except
+import inspect + `@functools.cached_property` def backend_processes(self) -> list[Process]: """Compute physical process topology from endpoints (cached).""" - # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs - # and crash dynamo.sglang with "Address already in use". Use a job-specific base. - try: - return self.backend.endpoints_to_processes( - self.endpoints, - base_sys_port=self.runtime.sys_port_base, - ) - except TypeError: - # Backends that don't accept base_sys_port keep their default behavior. - return self.backend.endpoints_to_processes(self.endpoints) + # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs + # and crash dynamo.sglang with "Address already in use". Use a job-specific base. + sig = inspect.signature(self.backend.endpoints_to_processes) + if "base_sys_port" in sig.parameters: + return self.backend.endpoints_to_processes( + self.endpoints, + base_sys_port=self.runtime.sys_port_base, + ) + return self.backend.endpoints_to_processes(self.endpoints)
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 66-68: The health_check block contains an invalid line ("\g<1>540
...") causing YAML parse failures; update the health_check mapping (the block
containing interval_seconds and the malformed line) by removing the invalid
token and adding a proper key/value pair max_attempts: 540 so the block reads a
valid mapping with max_attempts and interval_seconds.
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml`:
- Line 70: Trailing blank lines exist at the end of the YAML document; open the
YAML file and remove extra blank lines so the file ends with a single newline
only (no multiple empty lines at EOF), ensuring the final character is a single
newline to satisfy the linter for dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml.
- Around line 64-68: The health_check block contains a corrupted token
"\g<1>540" which breaks YAML; instead remove that token and add a proper
key/value such as timeout_seconds: 540 under the health_check mapping (preserve
indentation and the existing interval_seconds: 10), so the block reads with
valid keys health_check -> timeout_seconds and interval_seconds.
31b7537 to
74947de
Compare
Updated PR description
1.Support I Added
what
frontend.type: none):agg_workers=1: launch onlysglang.launch_serverand run health + SA-bench directly against the worker OAI endpoint (no router).agg_workers>1: use the SGLang router.recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yamlrecipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml(includesslurm.time_limit: 02:00:00)recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yamlsbatch_directives.comment(JSON) for long model-load/warmup.0.0.0.0.Why
Stock SGLang aggregated with a single worker already exposes an OAI-compatible server; launching an extra router/frontend prevents hitting the worker directly. This change makes the correct behavior the default and keeps router semantics for multi-worker and disaggregated modes.
2. AGG DSR-1 SGLANG FP8 H100
1K/1K (ISL=1024, OSL=1024)
h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K1x2x4x8x16x32x64x128x256x51202:00:008K/1K (ISL=8192, OSL=1024)
h100-dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K1x2x641K/8K (ISL=1024, OSL=8192)
h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K1x2x4x8x16x32x64x128x176x25603:00:00Key settings (common across these 3 recipes)
docker://lmsysorg/sglang:v0.5.8-cu130-runtimeserved-model-name: deepseek-ai/DeepSeek-R1-0528,precision: fp8,trust-remote-code: true)agg_workers: 1(single aggregated worker group)tensor-parallel-size: 8,pipeline-parallel-size: 2,data-parallel-size: 1flashinfersbatch_directives.commentincludesOccupiedIdleGPUsJobReaperexemption (60 min) for long load/warmuphealth_check: interval_seconds=10,max_attempts=720(wait up to ~2 hours for the worker to come up)Summary by CodeRabbit
New Features
Improvements
Documentation & Configuration