sglang router profiling for PD disagg#60
sglang router profiling for PD disagg#60wenscarl wants to merge 10 commits intoishandhanani:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a prefill-server readiness wait loop to the disaggregated SLURM job template, optionally mounts and installs local SGLang source into containers, threads a new worker_idx through GPU command construction and profiling paths, exposes Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Job as Slurm Job Script (disagg)
participant Prefill as Prefill Leader Service
participant Prof as Profiler (nsys)
participant Router as SGLang Router
Job->>Prefill: Resolve PREFILL_LEADER_NODE/IP
loop Poll /health every 30s (max 600s)
Job->>Prefill: GET http://<leader_ip>:30000/health
Prefill-->>Job: 200 OK / not-ready
end
alt health OK
Job->>Prof: start profiler (background) for leader
Job->>Router: launch router pointing at leader IP
Router-->>Job: router ready
else timeout
Job-->>Job: log warning and proceed
Job->>Prof: optionally start profiler (best-effort)
Job->>Router: launch router anyway
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
scripts/worker_setup/worker.py (3)
114-122: File handles are not closed on error paths.The file handles for
frontend_stdoutandfrontend_stderrare opened but never closed. Ifrun_commandfails or an exception occurs later, these handles will leak.Consider using a context manager or storing the handles for later cleanup:
- frontend_stdout = open("/logs/frontend.out", "w") - frontend_stderr = open("/logs/frontend.err", "w") - - frontend_cmd = "python3 -m dynamo.frontend --http-port=8000" - frontend_process = run_command(frontend_cmd, background=True, stdout=frontend_stdout, stderr=frontend_stderr) + with open("/logs/frontend.out", "w") as frontend_stdout, \ + open("/logs/frontend.err", "w") as frontend_stderr: + frontend_cmd = "python3 -m dynamo.frontend --http-port=8000" + frontend_process = run_command(frontend_cmd, background=True, stdout=frontend_stdout, stderr=frontend_stderr)Note: If the background process requires the file handles to remain open, store them and close them during cleanup/shutdown instead.
241-251: Missingworker_idxargument will cause runtime error.The
get_gpu_commandfunction now requiresworker_idxas a positional argument (seecommand.pyline 146), butsetup_aggregated_workerdoes not pass it. This will raise aTypeErrorat runtime.cmd_to_run = get_gpu_command( worker_type="aggregated", + worker_idx=worker_idx, sglang_config_path=sglang_config_path, host_ip=leader_ip, port=DIST_INIT_PORT, total_nodes=nodes_per_worker, rank=local_rank, profiler=profiler, dump_config_path=dump_config_path, )
222-235: Same file handle leak issue in aggregated worker.Similar to
setup_prefill_worker, the frontend file handles are opened but never properly closed.Additionally, unlike
setup_prefill_worker(line 110), this block is missing theand not use_sglang_routercondition:- if need_frontend: + if need_frontend and not use_sglang_router:This inconsistency means the frontend would still be started in router mode for aggregated workers.
🧹 Nitpick comments (6)
srtslurm.yaml.example (1)
6-7: Consider using placeholder values in the example config.The example file now contains what appear to be real cluster-specific paths and credentials (e.g.,
/mnt/lustre01/users/slurm-shared/shuw/...). For an example configuration file that serves as documentation, consider using obvious placeholders like/path/to/your/srt-slurmto avoid:
- Leaking internal infrastructure details
- Confusion when users copy this file directly
Also applies to: 18-18, 21-21, 25-26
recipies/gb200-fp4/1k1k/low-latency.yaml (1)
122-127: Fix indentation of commented-out benchmark block.The commented-out benchmark block has excessive indentation (8 spaces). If preserved for reference, it should align with other top-level blocks.
- #benchmark: - # type: "sa-bench" - # isl: 1024 - # osl: 1024 - # concurrencies: "256" - # req_rate: "inf" +#benchmark: +# type: "sa-bench" +# isl: 1024 +# osl: 1024 +# concurrencies: "256" +# req_rate: "inf"src/srtctl/backends/sglang.py (1)
145-146: Remove or complete the dangling comment.The comment "Always pass disaggregation-mode so profiling runs in PD mode" appears to reference code that was removed or moved. Either remove this orphaned comment or add the corresponding implementation.
- # Always pass disaggregation-mode so profiling runs in PD mode - if isinstance(value, bool):scripts/profiling/profile.sh (1)
112-112: Move pip install to container setup or job initialization.Installing packages at profiling time introduces latency and potential failure points (network issues, pip conflicts). Consider pre-installing
lm-evalandtenacityin the container image or during job setup phase.scripts/worker_setup/command.py (1)
11-21: Docstring missingworker_idxparameter description.The
worker_idxparameter was added but not documented in the docstring's Args section.Args: worker_type: "prefill", "decode", or "aggregated" + worker_idx: Worker index (0-indexed) for this worker type sglang_config_path: Path to generated sglang_config.yamlsrc/srtctl/core/schema.py (1)
327-338: Remove or restore the commented-out validation code.The validation that prevented profiling with multiple prefill/decode workers has been disabled via
passstatements with commented-outraiseblocks. This leaves dead code that:
- Obscures the actual behavior
- May confuse future maintainers
Either remove the commented code entirely if multi-worker profiling is now intentionally supported:
if is_disaggregated: - if self.resources.prefill_workers and self.resources.prefill_workers > 1: - pass -# raise ValueError( -# f"Profiling mode requires single worker only. " -# f"Got prefill_workers={self.resources.prefill_workers}" -# ) - if self.resources.decode_workers and self.resources.decode_workers > 1: - pass -# raise ValueError( -# f"Profiling mode requires single worker only. " -# f"Got decode_workers={self.resources.decode_workers}" -# ) + # Multi-worker profiling is supported in disaggregated mode + passOr restore the validation if this was unintentional.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
recipies/gb200-fp4/1k1k/low-latency.yaml(3 hunks)scripts/profiling/profile.sh(3 hunks)scripts/templates/job_script_template_disagg.j2(8 hunks)scripts/worker_setup.py(4 hunks)scripts/worker_setup/command.py(4 hunks)scripts/worker_setup/worker.py(6 hunks)src/srtctl/backends/sglang.py(3 hunks)src/srtctl/core/schema.py(3 hunks)srtslurm.yaml.example(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
recipies/gb200-fp4/1k1k/low-latency.yaml (1)
tests/test_command_generation.py (1)
test_profiling_mode(232-268)
scripts/worker_setup/worker.py (2)
scripts/worker_setup/infrastructure.py (1)
setup_head_prefill_node(14-32)scripts/worker_setup/utils.py (1)
wait_for_etcd(53-70)
🪛 Ruff (0.14.8)
scripts/worker_setup/worker.py
97-97: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
161-161: Avoid specifying long messages outside the exception class
(TRY003)
210-210: Avoid specifying long messages outside the exception class
(TRY003)
214-214: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (14)
recipies/gb200-fp4/1k1k/low-latency.yaml (1)
111-121: LGTM! Separate profiling steps for prefill and decode.The new profiling structure with phase-specific
start_step/stop_stepvalues enables fine-grained control over when profiling captures each phase. The decode phase having a later step range (700-730 vs 60-70) appropriately accounts for the different timing characteristics of each phase.scripts/worker_setup.py (2)
98-102: LGTM! Clean flag addition for router mode.The new
--use-sglang-routerflag is well-documented and follows the existing argument patterns in the file.
184-227: LGTM! Flag correctly propagated to all worker types.The
use_sglang_routerparameter is consistently passed tosetup_prefill_worker,setup_decode_worker, andsetup_aggregated_worker, ensuring uniform behavior across worker types.src/srtctl/backends/sglang.py (3)
285-305: LGTM! Well-structured helper for environment string construction.The
build_env_strfunction cleanly separates shared parameters (isl/osl/concurrency) from phase-specific parameters (start_step/stop_step), with proper None checks for defensive coding.
307-324: LGTM! Backward-compatible phase config handling.The fallback logic correctly supports both the new structure (separate
prefill/decodeblocks) and legacy configs with top-levelstart_step/stop_step. This ensures existing configurations continue to work.
349-349: LGTM! Router flag added to template variables.The
use_sglang_routerflag is correctly sourced frombackend_configwith a sensible default ofFalse.scripts/profiling/profile.sh (2)
45-56: LGTM! Correct conditional wait logic for PD disaggregation.The logic correctly distinguishes between prefill (needs router + local server) and decode (only needs local server). This ensures the decode profiler doesn't block waiting for a router it doesn't need.
95-96: Verify: start_profile targets local server, not router.The profiling trigger is now sent to the local server (
127.0.0.1:30000) rather than the router. This is correct for per-worker profiling, but ensure the local server supports the/start_profileendpoint.scripts/worker_setup/worker.py (1)
87-102: Router-based flow gating looks correct.The conditional logic to skip
setup_head_prefill_nodeandwait_for_etcdwhenuse_sglang_router=Trueis well-structured. The router mode bypasses traditional infrastructure setup since the sglang router handles routing directly.scripts/templates/job_script_template_disagg.j2 (1)
338-388: Router launch logic looks well-structured.The implementation correctly:
- Waits for prefill server health with a timeout
- Collects leader IPs for both prefill and decode workers
- Assembles
ROUTER_ARGSwith--pd-disaggregationand endpoint URLs- Launches the router on node 0
One minor observation: the wait loop increments by 10 but sleeps for 10 seconds, so
WAIT_COUNTrepresents elapsed seconds. The warning at line 358 could be clearer about the timeout.scripts/worker_setup/command.py (2)
63-63: Profile output path now includes worker_idx — good for multi-worker profiling.This change enables per-worker profiling outputs when running multiple workers, preventing file collisions.
144-174: Function signature and delegation updated correctly.The
worker_idxparameter is properly threaded through tobuild_sglang_command_from_yaml. The docstring here is also missing theworker_idxdescription (same as above).src/srtctl/core/schema.py (2)
184-193: Schema restructuring for per-phase profiling looks appropriate.The changes properly add:
- Shared parameters (
isl,osl,concurrency) at the top level- Per-phase configs (
prefill,decode) for phase-specific profiling settings- Legacy fields for backward compatibility with deprecation notes
247-252: Newuse_sglang_routerfield is well-documented.The field has a clear description and appropriate default (
False). This integrates cleanly with the rest of the router-enabled workflow.
| exit_code=$? | ||
| set +x |
There was a problem hiding this comment.
Exit code only captures lm_eval result, not bench_serving.
When PROFILING_MODE == "prefill", exit_code=$? captures only the exit status of lm_eval, not bench_serving. If bench_serving fails but lm_eval succeeds, the script exits successfully.
Consider capturing and checking both exit codes:
+ python3 -m sglang.bench_serving \
...
- --warmup-request 5
+ --warmup-request 5 || exit $?
pip install lm-eval tenacity > /dev/null
python -m lm_eval \
...
--limit 10
fi
exit_code=$?Or use set -e at the script start if you want to fail on any error.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In scripts/profiling/profile.sh around lines 120-121, the script only records
exit_code=$? which captures lm_eval's exit status and misses bench_serving when
PROFILING_MODE=="prefill"; update the script to capture and check both commands'
exit codes (e.g., store the exit status of bench_serving into a variable right
after it runs, then after lm_eval set exit_code to a non-zero value if either
command failed) or alternatively enable fail-fast by adding set -e at the top of
the script; ensure the final exit_code reflects any failure from bench_serving
as well as lm_eval before exiting.
bd9f0e8 to
edbba15
Compare
edbba15 to
19f5dcc
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
scripts/templates/job_script_template_disagg.j2 (1)
445-464: Duplicate profiling block runs unconditionally — missing conditional guard.This block (lines 445-464) duplicates the profiling logic from lines 411-443 but lacks the
{% if profiler != 'none' %}guard. It will execute for every job, including those without profiling enabled, causing:
- Spurious "Starting profiler..." messages
- Unnecessary profiling
sruncommands that will likely fail or produce empty resultsThis was flagged in a previous review and marked as addressed, but the issue persists in the current code.
🔎 Wrap the block in the conditional guard or remove if it's a duplicate:
+{% if profiler != 'none' %} {% raw %} echo "Starting profiler..." # Get leader nodes for first prefill and decode workers PREFILL_LEADER_NODE=${nodes[${prefill_leaders[0]}]} DECODE_LEADER_NODE=${nodes[${decode_leaders[0]}]} echo "Prefill profiling will run on: $PREFILL_LEADER_NODE" echo "Decode profiling will run on: $DECODE_LEADER_NODE" # Run profiling on prefill leader (drives traffic via router) srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w $PREFILL_LEADER_NODE \ --output=${LOG_DIR}/profile_prefill.out --overlap \ bash -c "PROFILING_MODE=prefill {% endraw %}{% if profiler == 'torch' %}SGLANG_TORCH_PROFILER_DIR=/logs/profiles/prefill {% endif %}{{ prefill_profile_env }}{% raw %} /scripts/profiling/profile.sh $PREFILL_WORKERS $DECODE_WORKERS $PREFILL_GPUS $DECODE_GPUS $TOTAL_GPUS" & # Run profiling on decode leader (only triggers /start_profile, no extra traffic) srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w $DECODE_LEADER_NODE \ --output=${LOG_DIR}/profile_decode.out --overlap \ bash -c "PROFILING_MODE=decode {% endraw %}{% if profiler == 'torch' %}SGLANG_TORCH_PROFILER_DIR=/logs/profiles/decode {% endif %}{{ decode_profile_env }}{% raw %} /scripts/profiling/profile.sh $PREFILL_WORKERS $DECODE_WORKERS $PREFILL_GPUS $DECODE_GPUS $TOTAL_GPUS" & {% endraw %} +{% endif %}Alternatively, if the unified profiling block (lines 411-443) already handles all cases, this duplicate block should be removed entirely.
🧹 Nitpick comments (1)
src/srtctl/core/schema.py (1)
329-341: Remove or restore the commented-out validation code.The commented-out validation (lines 332-335 and 338-341) is dead code. If multi-worker profiling is now intentionally supported in disaggregated mode, remove the commented code entirely. If this is temporary for debugging, add a TODO comment explaining when it should be restored.
Leaving
passstatements with commented-out code creates maintenance burden and ambiguity about the intended behavior.🔎 If multi-worker profiling is intentionally supported, apply this diff:
if is_disaggregated: if self.resources.prefill_workers and self.resources.prefill_workers > 1: - pass -# raise ValueError( -# f"Profiling mode requires single worker only. " -# f"Got prefill_workers={self.resources.prefill_workers}" -# ) + pass # Multi-worker profiling supported in disaggregated mode if self.resources.decode_workers and self.resources.decode_workers > 1: - pass -# raise ValueError( -# f"Profiling mode requires single worker only. " -# f"Got decode_workers={self.resources.decode_workers}" -# ) + pass # Multi-worker profiling supported in disaggregated mode
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
recipies/gb200-fp4/1k1k/low-latency.yaml(3 hunks)scripts/templates/job_script_template_disagg.j2(3 hunks)scripts/worker_setup/command.py(4 hunks)scripts/worker_setup/worker.py(3 hunks)src/srtctl/backends/sglang.py(1 hunks)src/srtctl/core/schema.py(2 hunks)srtslurm.yaml.example(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- scripts/worker_setup/worker.py
- srtslurm.yaml.example
- recipies/gb200-fp4/1k1k/low-latency.yaml
🔇 Additional comments (5)
scripts/worker_setup/command.py (2)
11-21: LGTM! Per-worker profiling context is properly threaded.The
worker_idxparameter is correctly added to enable per-worker profiling output differentiation. The nsys output path at line 63 now includes{config_key}_{worker_idx}_{rank}which prevents file collisions when running multiple workers.Minor: The docstring (lines 30-40) should document the new
worker_idxparameter for completeness.
144-174: LGTM! Function signature and call site updated consistently.The
worker_idxparameter is correctly propagated fromget_gpu_commandtobuild_sglang_command_from_yaml. The docstring (lines 157-167) should also documentworker_idxfor completeness.src/srtctl/core/schema.py (1)
176-196: LGTM! Clean per-phase profiling configuration.The new
ProfilingPhaseConfigmodel and the updatedProfilingConfigwith separateprefill/decodefields provides a clean structure for phase-specific profiling. The legacystart_step/stop_stepfields maintain backward compatibility.src/srtctl/backends/sglang.py (1)
285-324: LGTM! Clean separation of phase-specific and shared profiling config.The refactored
build_env_strfunction correctly:
- Separates phase-specific parameters (
start_step/stop_step) from shared parameters (isl/osl/concurrency)- Maintains backward compatibility by deriving
prefill_phase_cfganddecode_phase_cfgfrom legacy fields when explicit phase sections are absent- Generates distinct environment strings for prefill and decode phases
scripts/templates/job_script_template_disagg.j2 (1)
343-362: LGTM! Proper readiness check before router launch.The prefill server health check with timeout and fallback warning is a good safeguard. The 300-second timeout with 10-second polling intervals provides reasonable wait time while the warning at line 361 ensures the job proceeds with a clear warning if the server doesn't become ready.
ishandhanani#68) * support set prefill/decode profiling step separately and support agg profiling * fix wrong arg in agg. * fix review
901d60a to
362c476
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/worker_setup/worker.py (1)
290-301: Missingworker_idxargument insetup_aggregated_workerwill cause runtime failure.The
get_gpu_commandfunction signature now requiresworker_idxas the second positional parameter (afterworker_type), butsetup_aggregated_workerdoes not pass it. This will raise aTypeErrorwhen an aggregated worker is started.🔎 Proposed fix
# Build and execute SGLang command from YAML config cmd_to_run = get_gpu_command( worker_type="aggregated", + worker_idx=worker_idx, sglang_config_path=sglang_config_path, host_ip=leader_ip, port=DIST_INIT_PORT, total_nodes=nodes_per_worker, rank=local_rank, profiler=profiler, dump_config_path=dump_config_path, use_sglang_router=use_sglang_router, )
🤖 Fix all issues with AI Agents
In @scripts/templates/job_script_template_disagg.j2:
- Around line 547-566: The duplicate profiling srun block that prints "Starting
profiler..." and launches /scripts/profiling/profile.sh using
PREFILL_LEADER_NODE / DECODE_LEADER_NODE (setting PROFILING_MODE=prefill/decode
and using prefill_profile_env / decode_profile_env) must be removed or wrapped
in the same guard as the earlier block; specifically ensure this block only runs
when profiler != 'none' (i.e., wrap it with {% if profiler != 'none' %}...{%
endif %}) or delete the duplicate so profiling is launched only once and the
unnecessary srun commands and messages do not run when profiler is 'none'.
🧹 Nitpick comments (2)
examples/fp4-disagg-nsys-profiling.yaml (1)
16-17: Consider using boolean instead of string foruse_sglang_router.The value
"true"is a string, but boolean flags are typically represented astrue(without quotes) in YAML. Verify that the schema parser handles this correctly, or change to a proper boolean.🔎 Suggested fix
backend: - use_sglang_router: "true" + use_sglang_router: truescripts/worker_setup/command.py (1)
11-22: LGTM -worker_idxparameter correctly added to function signature.The parameter is properly positioned and typed.
Consider updating the docstring to document the new
worker_idxparameter:Args: worker_type: "prefill", "decode", or "aggregated" + worker_idx: Index of the worker group (0-based) sglang_config_path: Path to generated sglang_config.yaml
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/fp4-disagg-nsys-profiling.yamlscripts/templates/job_script_template_disagg.j2scripts/worker_setup/command.pyscripts/worker_setup/worker.pysrc/srtctl/core/schema.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/srtctl/core/schema.py
🧰 Additional context used
🧬 Code graph analysis (3)
examples/fp4-disagg-nsys-profiling.yaml (1)
tests/test_command_generation.py (1)
test_profiling_mode(232-268)
scripts/worker_setup/command.py (2)
scripts/worker_setup.py (2)
_validate_args(167-197)_parse_command_line_args(33-164)scripts/worker_setup/infrastructure.py (1)
setup_frontend_worker(46-65)
scripts/worker_setup/worker.py (1)
scripts/worker_setup/infrastructure.py (1)
setup_frontend_worker(46-65)
🔇 Additional comments (7)
scripts/worker_setup/worker.py (2)
129-141: LGTM - worker_idx correctly passed for prefill workers.The
worker_idxparameter is now properly forwarded toget_gpu_command, enabling per-worker profiling output paths.
175-187: LGTM - worker_idx correctly passed for decode workers.Consistent with the prefill worker change.
examples/fp4-disagg-nsys-profiling.yaml (2)
1-112: Configuration structure looks good for disaggregated profiling example.The configuration properly defines:
- Separate prefill/decode environments with appropriate timeout and disaggregation settings
- Distinct sglang_config sections for prefill and decode modes
- Reasonable profiling step ranges for each phase
113-123: Profiling configuration is well-structured.The separate
start_step/stop_stepfor prefill and decode enables targeted profiling of each phase independently, which aligns with the disaggregated architecture.scripts/worker_setup/command.py (2)
67-67: LGTM - Per-worker profiling output path correctly constructed.The nsys output path now includes
worker_idxto differentiate profiles across workers:/logs/profiles/{config_key}_{worker_idx}_{rank}.
148-181: LGTM -get_gpu_commandcorrectly propagatesworker_idx.The parameter is added to the signature and forwarded to
build_sglang_command_from_yaml. Note: The docstring should also be updated to documentworker_idx.scripts/templates/job_script_template_disagg.j2 (1)
343-362: Good addition of prefill server readiness check.This ensures the router doesn't start before the prefill server is ready to accept connections. The implementation correctly:
- Polls the
/healthendpoint with curl- Has a reasonable 600-second timeout with 30-second intervals
- Proceeds with a warning rather than failing hard on timeout
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI Agents
In @scripts/templates/job_script_template_disagg.j2:
- Around line 360-362: The current health-check timeout block logs a warning and
continues, which can cause router failures; update the block that checks
WAIT_COUNT and MAX_WAIT so it fails fast: replace the echo-only path with a
non-zero exit (e.g., echo an error and exit 1) or conditionally exit based on a
new env flag (e.g., PREFILL_FAIL_ON_TIMEOUT) so the script stops when the
prefill server never becomes ready; adjust the message to include WAIT_COUNT and
MAX_WAIT for clarity and ensure the router is not launched after this failure.
- Around line 345-346: PREFILL_LEADER_IP may be empty if get_node_ip returns an
empty string; before running the health-check curl (the URL using
http://$PREFILL_LEADER_IP:30000/health), validate PREFILL_LEADER_IP is non-empty
and a valid IP (or hostname) and bail or retry with a clear error if it is
empty; update the section that sets PREFILL_LEADER_IP (referring to get_node_ip
and the health-check curl) to check for emptiness, log an error mentioning
PREFILL_LEADER_NODE and SLURM_JOB_ID, and exit non-zero (or implement a retry
loop) so the malformed URL is never used.
🧹 Nitpick comments (1)
scripts/templates/job_script_template_disagg.j2 (1)
513-556: Consider consolidating duplicate profiler conditional checks.The profiler logic is split across two separate
{% if profiler != 'none' %}blocks (lines 513-545 and 547-556). While functionally correct, consolidating these into a single conditional block would improve readability and reduce redundancy.🔎 Proposed refactor to consolidate conditionals
{% if profiler != 'none' %} {% raw %} # Torch/NSYS profiling mode: run a single orchestrator that profiles all prefill and decode workers. echo "Starting unified profiler..." # Collect leader IPs for prefill and decode workers PREFILL_LEADER_IPS=() for idx in "${prefill_leaders[@]}"; do node_name=${nodes[$idx]} ip=$(get_node_ip "$node_name" "$SLURM_JOB_ID" "$NETWORK_INTERFACE") PREFILL_LEADER_IPS+=("$ip") done DECODE_LEADER_IPS=() for idx in "${decode_leaders[@]}"; do node_name=${nodes[$idx]} ip=$(get_node_ip "$node_name" "$SLURM_JOB_ID" "$NETWORK_INTERFACE") DECODE_LEADER_IPS+=("$ip") done PREFILL_LEADER_IPS_STR=$(IFS=,; echo "${PREFILL_LEADER_IPS[*]}") DECODE_LEADER_IPS_STR=$(IFS=,; echo "${DECODE_LEADER_IPS[*]}") # Use the first prefill leader as the orchestrator node PROFILE_ORCHESTRATOR_NODE=${nodes[${prefill_leaders[0]}]} echo "Unified profiling will run on orchestrator node: $PROFILE_ORCHESTRATOR_NODE" # Run a single profiling orchestrator that coordinates profiling across all leaders srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w $PROFILE_ORCHESTRATOR_NODE \ --output=${LOG_DIR}/profile_all.out --overlap \ bash -c "PROFILE_PREFILL_IPS=${PREFILL_LEADER_IPS_STR} PROFILE_DECODE_IPS=${DECODE_LEADER_IPS_STR} {% endraw %}{% if profiler == 'torch' %}SGLANG_TORCH_PROFILER_DIR=/logs/profiles {% endif %}{{ profiling_driver_env }} {{ prefill_profile_env }} {{ decode_profile_env }}{% raw %} /scripts/profiling/profile.sh $PREFILL_WORKERS $DECODE_WORKERS $PREFILL_GPUS $DECODE_GPUS $TOTAL_GPUS" & -{% endraw %} -{% endif %} - -{% if profiler != 'none' %} -{% raw %} + # Wait for all profiling scripts to complete (both prefill and decode) echo "Waiting for all profiling scripts to complete..." wait exit_code=$? echo "All profiling scripts finished at $(date) with exit code ${exit_code}" exit $exit_code {% endraw %} {% else %}
| PREFILL_LEADER_NODE=${nodes[${prefill_leaders[0]}]} | ||
| PREFILL_LEADER_IP=$(get_node_ip "$PREFILL_LEADER_NODE" "$SLURM_JOB_ID" "$NETWORK_INTERFACE") |
There was a problem hiding this comment.
Validate PREFILL_LEADER_IP before use in health check.
If get_node_ip returns an empty string, the subsequent curl command will use a malformed URL (http://:30000/health), causing the health check to fail silently or produce misleading results.
🔎 Proposed fix to add IP validation
PREFILL_LEADER_NODE=${nodes[${prefill_leaders[0]}]}
PREFILL_LEADER_IP=$(get_node_ip "$PREFILL_LEADER_NODE" "$SLURM_JOB_ID" "$NETWORK_INTERFACE")
+if [ -z "$PREFILL_LEADER_IP" ]; then
+ echo "Error: Could not retrieve IP address for prefill leader node $PREFILL_LEADER_NODE"
+ exit 1
+fi
MAX_WAIT=600📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| PREFILL_LEADER_NODE=${nodes[${prefill_leaders[0]}]} | |
| PREFILL_LEADER_IP=$(get_node_ip "$PREFILL_LEADER_NODE" "$SLURM_JOB_ID" "$NETWORK_INTERFACE") | |
| PREFILL_LEADER_NODE=${nodes[${prefill_leaders[0]}]} | |
| PREFILL_LEADER_IP=$(get_node_ip "$PREFILL_LEADER_NODE" "$SLURM_JOB_ID" "$NETWORK_INTERFACE") | |
| if [ -z "$PREFILL_LEADER_IP" ]; then | |
| echo "Error: Could not retrieve IP address for prefill leader node $PREFILL_LEADER_NODE" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
In @scripts/templates/job_script_template_disagg.j2 around lines 345 - 346,
PREFILL_LEADER_IP may be empty if get_node_ip returns an empty string; before
running the health-check curl (the URL using
http://$PREFILL_LEADER_IP:30000/health), validate PREFILL_LEADER_IP is non-empty
and a valid IP (or hostname) and bail or retry with a clear error if it is
empty; update the section that sets PREFILL_LEADER_IP (referring to get_node_ip
and the health-check curl) to check for emptiness, log an error mentioning
PREFILL_LEADER_NODE and SLURM_JOB_ID, and exit non-zero (or implement a retry
loop) so the malformed URL is never used.
| if [ $WAIT_COUNT -ge $MAX_WAIT ]; then | ||
| echo "Warning: Prefill server did not become ready within ${MAX_WAIT}s, proceeding anyway..." | ||
| fi |
There was a problem hiding this comment.
Proceeding after health check timeout may cause router failures.
If the prefill server fails to become ready within 600 seconds, the script logs a warning but continues to launch the router. This could lead to router initialization failures or runtime errors when the router attempts to connect to an unavailable prefill server.
Consider exiting with an error instead of proceeding, or implement a more robust retry mechanism in the router itself.
🤖 Prompt for AI Agents
In @scripts/templates/job_script_template_disagg.j2 around lines 360 - 362, The
current health-check timeout block logs a warning and continues, which can cause
router failures; update the block that checks WAIT_COUNT and MAX_WAIT so it
fails fast: replace the echo-only path with a non-zero exit (e.g., echo an error
and exit 1) or conditionally exit based on a new env flag (e.g.,
PREFILL_FAIL_ON_TIMEOUT) so the script stops when the prefill server never
becomes ready; adjust the message to include WAIT_COUNT and MAX_WAIT for clarity
and ensure the router is not launched after this failure.
6b8c0f4 to
e5ee24d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/worker_setup/worker.py (1)
305-316: Missing requiredworker_idxparameter will cause runtime error.The
get_gpu_commandfunction signature was updated to requireworker_idxas the second positional parameter (seescripts/worker_setup/command.pyline 208), but the call insetup_aggregated_workerdoes not pass this parameter. This will cause aTypeErrorat runtime.🐛 Proposed fix
cmd_to_run = get_gpu_command( worker_type="aggregated", + worker_idx=worker_idx, sglang_config_path=sglang_config_path, host_ip=leader_ip, port=DIST_INIT_PORT, total_nodes=nodes_per_worker, rank=local_rank, profiler=profiler, dump_config_path=dump_config_path, use_sglang_router=use_sglang_router, )
🤖 Fix all issues with AI agents
In @docs/sglang-router.md:
- Around line 140-145: The docs currently say your local sglang dir is mounted
to /ext-sglang-src/, but the worker setup actually takes the provided
sglang_src_dir and automatically appends "/python" before running pip install
-e; update the "How It Works" section to state that sglang_src_dir must point to
the SGLang repository root (not the python subdirectory) because the setup code
will resolve to the python/ subdirectory automatically prior to installation.
🧹 Nitpick comments (1)
scripts/worker_setup/command.py (1)
148-203: Improve exception handling ininstall_sglang_from_source.The function correctly implements the source installation workflow, but the exception handling can be improved:
- Line 171: Catching bare
Exceptionis too broad; consider catching specific exceptions likeOSErrororPermissionError- Line 172: Use
logging.exception()instead oflogging.error()to include the stack trace- Line 173: Chain the exception with
raise ... from eto preserve the exception contextAdditionally, line 163 hardcodes the
'python'subdirectory assumption. This should be documented in user-facing docs (already flagged in docs/sglang-router.md review).♻️ Proposed exception handling improvements
# List directory contents for debugging try: contents = os.listdir(abs_path) logging.info(f"Directory contains {len(contents)} items: {contents[:20]}") - except Exception as e: - logging.error(f"Cannot list directory {abs_path}: {e}") - raise RuntimeError(f"Cannot access sglang source directory: {abs_path}") + except (OSError, PermissionError) as e: + logging.exception(f"Cannot list directory {abs_path}") + raise RuntimeError(f"Cannot access sglang source directory: {abs_path}") from e
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/sglang-router.mdscripts/templates/job_script_template_disagg.j2scripts/worker_setup/command.pyscripts/worker_setup/worker.pysrc/srtctl/core/backend.pysrc/srtctl/core/schema.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/srtctl/core/schema.py
🧰 Additional context used
🧬 Code graph analysis (4)
src/srtctl/core/backend.py (1)
tests/test_command_generation.py (1)
test_profiling_mode(232-268)
scripts/worker_setup/command.py (2)
scripts/worker_setup.py (2)
_validate_args(167-197)_parse_command_line_args(33-164)tests/test_command_generation.py (1)
test_profiling_mode(232-268)
scripts/templates/job_script_template_disagg.j2 (1)
scripts/utils/check_server_health.py (1)
check_sglang_router_health(30-47)
scripts/worker_setup/worker.py (1)
scripts/worker_setup/command.py (3)
get_gpu_command(206-239)install_dynamo_wheels(121-145)install_sglang_from_source(148-203)
🪛 Ruff (0.14.10)
scripts/worker_setup/command.py
171-171: Do not catch blind exception: Exception
(BLE001)
172-172: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
173-173: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
173-173: Avoid specifying long messages outside the exception class
(TRY003)
181-181: f-string without any placeholders
Remove extraneous f prefix
(F541)
183-188: Avoid specifying long messages outside the exception class
(TRY003)
194-194: Starting a process with a partial executable path
(S607)
201-201: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (12)
src/srtctl/core/backend.py (1)
234-234: LGTM!The addition of
sglang_src_dirto template variables is correct and consistent with the existing pattern. The template will handleNonevalues gracefully when the configuration is not provided.scripts/worker_setup/worker.py (6)
10-14: LGTM!The import of
install_sglang_from_sourceis correctly added alongside the existing imports from the command module.
109-112: LGTM!The conditional installation of SGLang from source when using sglang-router mode is correctly implemented with appropriate logging.
137-149: LGTM!The addition of
worker_idxparameter toget_gpu_commandcorrectly threads the worker context through the prefill worker setup path.
174-176: LGTM!The conditional installation of SGLang from source in decode worker setup mirrors the prefill worker implementation correctly.
186-198: LGTM!The addition of
worker_idxparameter toget_gpu_commandcorrectly threads the worker context through the decode worker setup path.
277-280: LGTM!The conditional installation of SGLang from source in aggregated worker setup is consistent with the prefill and decode implementations.
scripts/templates/job_script_template_disagg.j2 (2)
181-181: LGTM!The conditional mount for
sglang_src_diris correctly implemented using Jinja2 syntax. Whensglang_src_diris defined, it will be mounted to/ext-sglang-src/in the container, otherwise the mount directive is omitted cleanly.
343-362: Prefill server readiness check is functional but has validation gaps.The new wait loop for prefill server readiness implements a reasonable polling strategy (30s intervals, 600s timeout). However, past review comments have identified valid concerns:
- IP validation missing (line 346): If
get_node_ipreturns empty, the curl will use a malformed URL- Timeout handling (lines 360-362): Proceeding with a warning after timeout may cause router initialization failures
These concerns remain unaddressed in the current implementation.
Given that these issues were flagged in previous reviews, please verify whether:
- IP validation will be added as a separate follow-up
- The timeout warning-only behavior is intentional for this use case
scripts/worker_setup/command.py (3)
11-22: LGTM!The addition of
worker_idxparameter tobuild_sglang_command_from_yamlcorrectly extends the function signature to support per-worker command construction. The parameter is appropriately positioned and typed.
67-67: LGTM!The updated nsys profiling output path now includes
worker_idxin the filename (/logs/profiles/{config_key}_w{worker_idx}_{rank}), which properly isolates profiling data per worker. This prevents file collisions in multi-worker scenarios.
206-239: LGTM!The addition of
worker_idxparameter toget_gpu_commandand its propagation tobuild_sglang_command_from_yamlcorrectly threads per-worker context through the command construction pipeline. The parameter is appropriately positioned as the second argument afterworker_type.
| ### How It Works | ||
|
|
||
| 1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container | ||
| 2. Before launching workers, the script runs: `pip install -e . --no-deps` | ||
| 3. Workers use your local sglang code instead of the container's pre-installed version | ||
|
|
There was a problem hiding this comment.
Document the automatic 'python' subdirectory resolution.
The documentation states that the local directory is mounted to /ext-sglang-src/, but the implementation in scripts/worker_setup/command.py (line 163) automatically appends /python to this path before installation. Users should be informed that sglang_src_dir must point to the SGLang repository root (not the python subdirectory), as the code will automatically resolve to the python/ subdirectory within it.
📝 Suggested documentation update
### How It Works
-1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
-2. Before launching workers, the script runs: `pip install -e . --no-deps`
+1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
+2. The installation automatically uses the `python/` subdirectory within the mounted source
+3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/`
-3. Workers use your local sglang code instead of the container's pre-installed version
+4. Workers use your local sglang code instead of the container's pre-installed version📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### How It Works | |
| 1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container | |
| 2. Before launching workers, the script runs: `pip install -e . --no-deps` | |
| 3. Workers use your local sglang code instead of the container's pre-installed version | |
| ### How It Works | |
| 1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container | |
| 2. The installation automatically uses the `python/` subdirectory within the mounted source | |
| 3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/` | |
| 4. Workers use your local sglang code instead of the container's pre-installed version |
🤖 Prompt for AI Agents
In @docs/sglang-router.md around lines 140 - 145, The docs currently say your
local sglang dir is mounted to /ext-sglang-src/, but the worker setup actually
takes the provided sglang_src_dir and automatically appends "/python" before
running pip install -e; update the "How It Works" section to state that
sglang_src_dir must point to the SGLang repository root (not the python
subdirectory) because the setup code will resolve to the python/ subdirectory
automatically prior to installation.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/accuracy.md (1)
55-55: Clarify the approximation symbol in the accuracy guidance note.Line 55 uses
~0.81(tilde indicating "approximately"), but it's unclear whether this is an empirical observation or a rough estimate. For benchmark documentation, clarify whether this value was measured precisely in testing or is an approximate guideline.If this is an empirical result, consider changing to a precise value or adding context like "approximately 0.81 based on testing" or "capped around 0.81". If it's a guideline, consider adding a brief explanation of the variance range (e.g., "typically in the 0.79–0.82 range").
docs/sglang-router.md (2)
234-240: Clarify the interaction between--no-depsand pre-installed dependencies.Lines 234–240 note that
--no-depsis used and the container must have compatible dependencies. However, this could be clearer:
- Does this mean all transitive dependencies must already exist in the container?
- What happens if a local sglang version requires a newer version of a dependency not in the container?
Consider adding an example scenario or a troubleshooting tip (e.g., "If you encounter import errors, ensure your container has dependencies >= X version").
212-232: Add guidance on verifying sglang_src_dir path before deployment.The example configuration (lines 212–232) shows
sglang_src_dir: "/home/username/projects/sglang", but the notes (line 237) say the directory "must exist on the host running srtctl." Consider adding a quick pre-flight check suggestion, such as:# Verify the source exists before applying srtctl apply -f recipes/debug-sglang-router.yaml --validateOr add a note: "You can validate the path exists locally before applying:
test -d /home/username/projects/sglang && echo 'OK' || echo 'Path not found'"This helps users catch configuration errors early.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/accuracy.mddocs/profiling.mddocs/sglang-router.mdsrc/srtctl/core/schema.py
✅ Files skipped from review due to trivial changes (2)
- docs/profiling.md
- src/srtctl/core/schema.py
🔇 Additional comments (1)
docs/sglang-router.md (1)
180-240: Verify implementation alignment with sglang_src_dir documentation.The new "Debugging with SGLang Source Code" section (lines 180–240) documents a feature for mounting and installing sglang from source. Key claims should be verified against the actual implementation:
- Mount location (line 196): Does the implementation mount to
/ext-sglang-src/?- Installation command (line 197): Does the code execute
pip install -e . --no-depsexactly?- Constraint enforcement (line 236): Is the feature truly gated to
use_sglang_router: trueonly?- Scope (line 204): Does the feature apply to all "prefill/decode/aggregated workers" as stated?
Additionally, the AI-generated summary mentions that this content "appears in two locations" within the file. Please confirm whether the section is duplicated elsewhere in the markdown; if so, consolidate to a single source of truth.
|
Too much has changed. If this feature is needed in the future we can rebase |
Add the following:
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.