add `nsys` profiling and swap to `bench_serving` instead of `bench_one_batch` by weireweire · Pull Request #41 · ishandhanani/srt-slurm

weireweire · 2025-11-28T12:01:57Z

now profiling can set

profiling:
  type: "torch"
  prefill:
    isl: 128000
    osl: 16
    concurrency: 1
    start_step: 0 # sglang bench don't support none 0 yet.
    stop_step: 16
  decode:
    isl: 8
    osl: 16
    concurrency: 1
    start_step: 0
    stop_step: 16

Besides, I merged err and out file, because .err actually not really error for these framework. and separation loses the information about the time sequence.

Summary by CodeRabbit

New Features
- Added profiling support with selectable types: none, torch, nsys; serving behavior switches when profiling is active.
- New CLI option to choose profiler and pass profiling settings to launches.
Documentation
- New profiling section with YAML examples, phase-specific (prefill/decode) settings, and note that profiling and benchmarking cannot run concurrently.
Validation
- Job validation updated to enforce profiling constraints and single-worker requirements where applicable.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

ishandhanani · 2025-12-04T07:16:42Z

@coderabbitai review

coderabbitai · 2025-12-04T07:16:58Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-12-04T07:17:01Z

Walkthrough

Replace the boolean profiling flag with a string-based profiler ("none", "torch", "nsys"), add per-phase profiling config, propagate the profiler through schema, backend rendering, worker CLI/command builders, SLURM templates and profiling scripts, and enforce mutual exclusion with benchmarking.

Changes

Cohort / File(s)	Summary
Schema & Config `src/srtctl/core/schema.py`	Add `ProfilingType`, `ProfilingPhaseConfig`, `ProfilingConfig`; add `profiling` to `JobConfig`; remove `enable_profiling` from `BackendConfig`; update validation to use `profiling.type` and enforce profiling/benchmark mutual exclusion and worker-count constraints.
Backend command rendering `src/srtctl/backends/sglang.py`	Replace boolean profiling with `profiling.type`; choose launcher (`dynamo.sglang` vs `sglang.launch_server`) and add NSYS prefix when appropriate; skip disaggregation/config-dump for profiling; expose `profiler`, `prefill_profile_env`, `decode_profile_env` to templates.
Worker CLI & setup `scripts/worker_setup.py`, `scripts/worker_setup/worker.py`	Replace `--sglang-torch-profiler` boolean CLI flag with `--profiler` (`none`,`torch`,`nsys`); update worker setup function signatures and call sites to accept and forward `profiler: str`.
Command builders `scripts/worker_setup/command.py`	Change `use_profiling: bool` → `profiler: str`; build commands selecting module and env vars based on `profiler`; add `nsys` prefix handling and torch profiler env injection; preserve config-based path when `none`.
SLURM templates `scripts/templates/job_script_template_agg.j2`, `scripts/templates/job_script_template_disagg.j2`	Replace boolean gates with `profiler` checks (`profiler != 'none'`, `profiler == 'torch'`); emit unified `--profiler` flag; conditionally inject `SGLANG_TORCH_PROFILER_DIR` and phase env strings; adjust wait/exit behavior when profiling active.
Profiling script `scripts/profiling/profile.sh`	Move from mode-based to env-driven config (require `PROFILE_ISL`/`PROFILE_OSL`); default start/stop steps; create profiler dir only if `SGLANG_TORCH_PROFILER_DIR` set and otherwise use remote HTTP POST to trigger profiling; switch to serving-based profiling flow and add lm-eval step.
Documentation `docs/installation.md`	Add top-level profiling section in job YAML (type, prefill/decode phase params), example YAML, note that profiling and benchmarking cannot run concurrently, and clarify serving module selection based on profiler type.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as User CLI (srtctl)
    participant Scheduler as SLURM / job script
    participant Worker as Worker process
    participant Profiler as Profiler dir / Remote Profiler API

    CLI->>Scheduler: submit job with profiling.type (none|torch|nsys)
    Scheduler->>Worker: start worker with `--profiler` and phase envs
    alt profiling.type == "torch" and SGLANG_TORCH_PROFILER_DIR set
        Worker->>Profiler: create profiler dir
        Worker->>Worker: start python -m sglang.launch_server under torch profiler
    else profiling.type == "torch" and no local profiler dir
        Worker->>Profiler: HTTP POST /start_profile (remote trigger)
        Worker->>Worker: start python -m sglang.launch_server
    else profiling.type == "nsys"
        Worker->>Worker: start python prefixed with nsys CLI wrapper
    else profiling.type == "none"
        Worker->>Worker: start python -m dynamo.sglang (normal serving)
    end
    Worker->>Scheduler: report exit code and results path

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas to focus:
- src/srtctl/core/schema.py: new schema models and validation interactions with benchmarking/disaggregation.
- src/srtctl/backends/sglang.py: command rendering, NSYS prefix, env-building helpers and template variable injection.
- scripts/worker_setup/command.py and scripts/worker_setup/worker.py: command construction across profiler modes and updated signatures.
- scripts/profiling/profile.sh: switch to env-driven flow, remote-start vs local-dir handling, and integration with bench_serving.
- SLURM templates (job_script_template_*.j2): ensure --profiler propagation and correct wait/exit semantics.

Poem

🐰
I swapped a bool for three bright ways,
"none", "torch", "nsys" — I danced and praised,
Templates, scripts, schema hopped in tune,
Profiling wakes beneath the moon,
A tiny rabbit cheers the new profiling day! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes two main changes: adding nsys profiling support and replacing bench_one_batch with bench_serving.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

scripts/profiling/profile.sh (2)
48-48: Remove unused variable PROFILE_STEPS_ARG.

This variable is assigned but never used in the script.
 # Determine profiling parameters strictly from environment 
-PROFILE_STEPS_ARG=""
 CLI_ARGS=""
49-50: Remove unused CLI_ARGS variable.

CLI_ARGS is built with --batch-size but never used in the bench_serving command. The concurrency is passed via --max-concurrency directly.
 # Determine profiling parameters strictly from environment 
-CLI_ARGS=""
-[[ -n "${PROFILE_CONCURRENCY}" ]] && CLI_ARGS+=" --batch-size ${PROFILE_CONCURRENCY}"
 # Require ISL/OSL to be provided; do not pass them as CLI args here
src/srtctl/backends/sglang.py (1)
289-301: Consider simplifying the conditional checks.

The pattern if "key" in cfg and cfg["key"] is not None can be simplified to if cfg.get("key") is not None since dict.get() returns None for missing keys.
-            if "isl" in cfg and cfg["isl"] is not None:
-                parts.append(f"PROFILE_ISL={cfg['isl']}")
-            if "osl" in cfg and cfg["osl"] is not None:
-                parts.append(f"PROFILE_OSL={cfg['osl']}")
-            if "concurrency" in cfg and cfg["concurrency"] is not None:
-                parts.append(f"PROFILE_CONCURRENCY={cfg['concurrency']}")
-            if "start_step" in cfg and cfg["start_step"] is not None:
-                parts.append(f"PROFILE_START_STEP={cfg['start_step']}")
-            if "stop_step" in cfg and cfg["stop_step"] is not None:
-                parts.append(f"PROFILE_STOP_STEP={cfg['stop_step']}")
+            for key, env_var in [
+                ("isl", "PROFILE_ISL"),
+                ("osl", "PROFILE_OSL"),
+                ("concurrency", "PROFILE_CONCURRENCY"),
+                ("start_step", "PROFILE_START_STEP"),
+                ("stop_step", "PROFILE_STOP_STEP"),
+            ]:
+                if cfg.get(key) is not None:
+                    parts.append(f"{env_var}={cfg[key]}")
src/srtctl/core/schema.py (1)
303-306: Remove or clarify the no-op code block.

This block contains only a pass statement with a comment stating "This is fine - backend will handle disabling it". If no action is needed here, consider removing the entire if block to avoid confusion.
-        # Auto-disable config dump when profiling (already handled in backend, but validate here too)
-        if self.enable_config_dump:
-            # This is fine - backend will handle disabling it
-            pass
scripts/worker_setup/worker.py (1)

92-93: Pre-existing: File handles not explicitly closed.

The file handles for frontend logs are opened but never explicitly closed. While this is intentional for long-running background processes (the frontend runs until job completion), consider using a context manager or documenting this behavior for clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8162742 and 062f5a6.

📒 Files selected for processing (9)

docs/installation.md (1 hunks)
scripts/profiling/profile.sh (1 hunks)
scripts/templates/job_script_template_agg.j2 (7 hunks)
scripts/templates/job_script_template_disagg.j2 (8 hunks)
scripts/worker_setup.py (4 hunks)
scripts/worker_setup/command.py (6 hunks)
scripts/worker_setup/worker.py (6 hunks)
src/srtctl/backends/sglang.py (5 hunks)
src/srtctl/core/schema.py (3 hunks)

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/profiling/profile.sh

[warning] 48-48: PROFILE_STEPS_ARG appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (22)

docs/installation.md (1)

155-179: Clear and comprehensive profiling documentation.

The documentation effectively explains the new profiling configuration with good examples. The mutual exclusion note with benchmarking and the serving behavior differences are helpful for users.

scripts/worker_setup.py (2)

105-111: Clean transition to string-based profiler argument.

The new --profiler argument with explicit choices provides clear validation and better extensibility compared to the previous boolean flag. Default of "none" maintains backward compatibility.

178-218: Consistent propagation of profiler to all worker types.

The profiler argument is correctly passed to setup_prefill_worker, setup_decode_worker, and setup_aggregated_worker, ensuring uniform profiling behavior across all worker types.

scripts/templates/job_script_template_agg.j2 (3)

193-195: Correct integration of profiler argument into worker commands.

The --profiler {{ profiler }} flag is properly added to WORKER_ARGS, ensuring all worker invocations receive the profiling configuration.

342-359: Appropriate wait behavior for profiling vs. non-profiling modes.

The wait logic correctly uses wait (all tasks) when profiling is active to ensure the profiling script completes, versus wait -n (first task) for normal operation.

323-340: Verify decode_profile_env is populated by the backend.

The profiling block correctly gates on profiler != 'none' and conditionally includes SGLANG_TORCH_PROFILER_DIR only for torch profiler. However, decode_profile_env must be confirmed as populated in the backend template context with environment variables (like PROFILE_ISL, PROFILE_OSL) expected by the profiling script.

scripts/profiling/profile.sh (1)

52-65: Good validation with helpful defaults.

The required parameter validation for PROFILE_ISL and PROFILE_OSL with early exit, combined with sensible defaults for step ranges with warnings, provides a robust configuration approach.

scripts/worker_setup/command.py (4)

13-21: Clean API update for string-based profiler.

The function signature change from boolean to string-based profiler with clear type hints and default value provides better extensibility for future profiling methods.

63-69: Well-constructed nsys profiling integration.

The nsys prefix includes appropriate options for CUDA profiling (cuda,nvtx, --cuda-graph-trace=node, --capture-range-end stop) and uses a consistent output path pattern with torch profiler.

71-84: Correct flag handling with appropriate filtering.

The iteration over config flags with type-aware handling (booleans, lists, scalars) and explicit exclusion of disaggregation-mode for profiling mode is well implemented.

145-174: Consistent API update for get_gpu_command.

The function correctly mirrors the profiler parameter addition and passes it through to build_sglang_command_from_yaml.

src/srtctl/backends/sglang.py (3)

104-112: LGTM! Profiler type handling looks correct.

The conditional logic properly routes to the appropriate Python module and nsys prefix based on profiling type. The fallback to dynamo.sglang for non-profiling mode maintains backward compatibility.

139-147: LGTM! Correct handling of flag exclusion during profiling.

The logic appropriately skips the disaggregation-mode flag when profiling is active, as sglang.launch_server doesn't accept this flag.

338-340: LGTM! Template variables properly integrated.

The new profiling-related template variables are correctly added and will be available for the Jinja templates.

src/srtctl/core/schema.py (2)

165-189: LGTM! Well-structured profiling configuration schema.

The new ProfilingType enum, ProfilingPhaseConfig, and ProfilingConfig classes provide a clean, type-safe structure for profiling configuration. The enum values ("nsys", "torch", "none") align with usage throughout the codebase.

308-334: LGTM! Comprehensive validation logic.

The mutual exclusivity check with non-MANUAL benchmarks and the single-worker constraint validations are well-implemented with clear error messages.

scripts/templates/job_script_template_disagg.j2 (3)

193-195: LGTM! Profiler argument correctly added to worker args.

The profiler mode is properly injected into WORKER_ARGS for all worker types.

363-370: LGTM! Profiling environment correctly conditionally injected.

The template properly conditionalizes SGLANG_TORCH_PROFILER_DIR for torch profiler only, while allowing the generic prefill_profile_env/decode_profile_env to be injected for all profiling modes.

374-390: LGTM! Correct wait behavior for profiling vs non-profiling modes.

The differentiated wait logic ensures profiling scripts complete before job exit, while benchmark mode uses the original wait -n behavior.

scripts/worker_setup/worker.py (3)

55-67: LGTM! Clean parameter refactor for setup_prefill_worker.

The change from boolean sglang_torch_profiler to string profiler with default "none" aligns with the new multi-profiler support and is properly propagated to get_gpu_command.

120-131: LGTM! Consistent parameter refactor for setup_decode_worker.

162-174: LGTM! Consistent parameter refactor for setup_aggregated_worker.

coderabbitai · 2025-12-04T07:20:31Z

+else
+    curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"
+    mkdir -p "/logs/profiles" 2>/dev/null || true
+fi


⚠️ Potential issue | 🟡 Minor

Add error handling for the remote profiling start request.

If the curl request fails (network issue, server error), the script continues without profiling being enabled. Consider checking the response status.

else - curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}" + response=$(curl -s -w "\n%{http_code}" -X POST http://${head_node}:${head_port}/start_profile \ + -H "Content-Type: application/json" \ + -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}") + http_code=$(echo "$response" | tail -n1) + if [[ "$http_code" != "200" ]]; then + echo "Warning: Failed to start remote profiling (HTTP $http_code)" + fi mkdir -p "/logs/profiles" 2>/dev/null || true fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

else

curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"

mkdir -p "/logs/profiles" 2>/dev/null || true

fi

else

response=$(curl -s -w "\n%{http_code}" -X POST http://${head_node}:${head_port}/start_profile \

-H "Content-Type: application/json" \

-d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}")

http_code=$(echo "$response" | tail -n1)

if [[ "$http_code" != "200" ]]; then

echo "Warning: Failed to start remote profiling (HTTP $http_code)"

fi

mkdir -p "/logs/profiles" 2>/dev/null || true

fi

coderabbitai · 2025-12-04T07:20:32Z

+--model ${model_name} \
+--host ${head_node} --port ${head_port} \
+--dataset-name random \
+--max-concurrency $PROFILE_CONCURRENCY \


⚠️ Potential issue | 🟡 Minor

Add validation or default for PROFILE_CONCURRENCY.

PROFILE_CONCURRENCY is used directly without validation, unlike PROFILE_ISL and PROFILE_OSL. If unset, this will produce an invalid command.

Consider adding validation similar to ISL/OSL:

if [[ -z "${PROFILE_CONCURRENCY}" ]]; then echo "Warning: PROFILE_CONCURRENCY not set; defaulting to 1" PROFILE_CONCURRENCY=1 fi

🤖 Prompt for AI Agents

In scripts/profiling/profile.sh around line 93, PROFILE_CONCURRENCY is used directly and can be empty, producing an invalid command; add a validation block like the existing PROFILE_ISL/PROFILE_OSL checks that verifies PROFILE_CONCURRENCY is set and a positive integer, echoing a warning and defaulting it to 1 when unset or invalid (e.g., if empty or non-numeric), so the subsequent --max-concurrency argument always receives a valid value.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

scripts/worker_setup/command.py (1)
18-18: Consider validating the profiler parameter against allowed values.

The profiler parameter accepts any string but only "none", "torch", and "nsys" are valid. Invalid values would silently fall into the profiling code path (since profiler != "none" would be True) but without proper setup.

Add validation at the start of build_sglang_command_from_yaml:
VALID_PROFILERS = {"none", "torch", "nsys"}

def build_sglang_command_from_yaml(
    # ... params ...
) -> str:
    if profiler not in VALID_PROFILERS:
        raise ValueError(f"Invalid profiler '{profiler}'. Must be one of: {VALID_PROFILERS}")
    # ... rest of function
Also applies to: 36-36, 149-149, 161-161
scripts/worker_setup/worker.py (1)
92-100: Consider using context managers or explicit cleanup for file handles.

The file handles frontend_stdout and frontend_stderr are opened but never explicitly closed. While they remain open intentionally for the background process, if run_command fails before returning a process, the handles would leak (though Python will close them on exit).

This is a minor concern since the files are needed for the lifetime of the background process. If you want to be more explicit:
try:
    frontend_stdout = open("/logs/frontend.out", "w")
    frontend_stderr = open("/logs/frontend.err", "w")
    frontend_process = run_command(frontend_cmd, background=True, stdout=frontend_stdout, stderr=frontend_stderr)
    if not frontend_process:
        frontend_stdout.close()
        frontend_stderr.close()
        raise RuntimeError("Failed to start frontend")
except Exception:
    frontend_stdout.close()
    frontend_stderr.close()
    raise

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 062f5a6 and 1ca09e4.

📒 Files selected for processing (3)

docs/installation.md (1 hunks)
scripts/worker_setup/command.py (6 hunks)
scripts/worker_setup/worker.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

docs/installation.md

🔇 Additional comments (8)

scripts/worker_setup/command.py (4)

62-62: LGTM - nsys profile command construction is appropriate.

The nsys profile options are well-chosen for CUDA profiling with graph tracing. The --force-overwrite true prevents profile file conflicts.

64-92: Profiling mode command construction looks correct.

The logic properly differentiates between nsys (wrapping with nsys prefix) and torch profiling (inline python invocation), and correctly builds all SGLang flags from the config.

93-103: LGTM - non-profiling mode correctly uses config-based invocation.

142-171: LGTM - clean parameter propagation to command builder.

The function correctly passes the new profiler parameter through to build_sglang_command_from_yaml.

scripts/worker_setup/worker.py (4)

55-117: LGTM - prefill worker correctly updated to use string-based profiler.

The parameter change from boolean to string is correctly applied and propagated to get_gpu_command.

120-159: LGTM - decode worker correctly updated.

Consistent parameter change and propagation.

162-224: LGTM - aggregated worker correctly updated.

Consistent parameter change and propagation. Note the same file handle pattern exists here (lines 199-200) as discussed for the prefill worker.

103-104: Verify the GPU type check logic—the condition appears logically contradictory.

The condition gpu_type.startswith("gb200") and not gpu_type.startswith("gb300") appears redundant since a string cannot simultaneously start with both "gb200" and "gb300". However, this may be intentional if GPU naming includes variants like "gb200" and "gb2000". Clarify whether this should be a different exclusion (e.g., excluding a specific GB200 variant) or if the condition should be restructured.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

scripts/profiling/profile.sh (2)

84-84: Existing concern: curl error handling.

This review comment echoes a previous reviewer's concern about the lack of error handling for the remote profiling start request.

91-91: Existing concern: validate PROFILE_CONCURRENCY.

This review comment echoes a previous reviewer's concern about the lack of validation for PROFILE_CONCURRENCY before use.

🧹 Nitpick comments (1)

scripts/profiling/profile.sh (1)
48-48: Remove unused variable.

PROFILE_STEPS_ARG is assigned but never used in the script.

Apply this diff:
-PROFILE_STEPS_ARG=""
 CLI_ARGS=""

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ca09e4 and 6917ba5.

📒 Files selected for processing (9)

docs/installation.md (1 hunks)
scripts/profiling/profile.sh (1 hunks)
scripts/templates/job_script_template_agg.j2 (7 hunks)
scripts/templates/job_script_template_disagg.j2 (8 hunks)
scripts/worker_setup.py (4 hunks)
scripts/worker_setup/command.py (6 hunks)
scripts/worker_setup/worker.py (6 hunks)
src/srtctl/backends/sglang.py (5 hunks)
src/srtctl/core/schema.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

docs/installation.md
scripts/worker_setup/worker.py
scripts/worker_setup/command.py

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/profiling/profile.sh

[warning] 48-48: PROFILE_STEPS_ARG appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (16)

scripts/worker_setup.py (2)

106-111: LGTM: Clean profiler argument design.

The string-based profiler choices provide clear options and the default of "none" is safe.

179-218: LGTM: Consistent profiler propagation.

The profiler argument is correctly propagated to all worker setup functions (prefill, decode, aggregated).

scripts/profiling/profile.sh (3)

47-65: LGTM: Environment-driven profiling configuration.

The shift from mode-based to environment-driven parameter selection is cleaner. The validation for ISL/OSL with error exit and defaults for start/stop steps with warnings are appropriate.

71-80: LGTM: Conditional profiling directory setup.

The logic to conditionally create profiler directories and set activities based on whether SGLANG_TORCH_PROFILER_DIR is provided is correct.

86-107: LGTM: Serving-based profiling and evaluation flow.

The replacement of bench_one_batch_server with bench_serving and the addition of lm-eval align with the PR objectives to support remote profiling.

src/srtctl/core/schema.py (2)

165-189: LGTM: Well-structured profiling configuration models.

The new ProfilingType enum, ProfilingPhaseConfig, and ProfilingConfig classes provide a clean, type-safe profiling configuration system with sensible defaults.

297-334: LGTM: Proper profiling validation constraints.

The validation logic correctly enforces:

Mutual exclusion between profiling and benchmarking

Single-worker constraint when profiling is enabled (for both disaggregated and aggregated modes)

The short-circuit when profiling is "none" avoids unnecessary checks.

src/srtctl/backends/sglang.py (3)

104-127: LGTM: Profiler-based command rendering.

The logic correctly selects:

NSYS prefix + sglang.launch_server for nsys profiling

sglang.launch_server for torch profiling

dynamo.sglang for normal operation

Safe dictionary access with .get() and default "none" prevents errors.

139-147: LGTM: Skip incompatible flag when profiling.

Correctly skips the disaggregation-mode flag for sglang.launch_server, which doesn't support it.

284-306: LGTM: Clean profiling environment construction.

The build_env_str helper cleanly converts profiling configuration into environment variable assignments for injection into templates.

scripts/templates/job_script_template_agg.j2 (3)

193-194: LGTM: Unified profiler argument propagation.

The profiler value from the config is correctly propagated to all worker setup commands through WORKER_ARGS.

323-340: LGTM: Conditional profiling execution.

The profiling block correctly:

Gates on profiler != 'none'

Conditionally sets SGLANG_TORCH_PROFILER_DIR only for torch profiling

Injects phase-specific environment variables via decode_profile_env

342-359: LGTM: Profiler-aware wait logic.

The wait logic correctly branches:

When profiling is enabled: waits for profiling script completion and propagates its exit code

When profiling is disabled: uses wait -n to wait for first task (benchmark) completion

scripts/templates/job_script_template_disagg.j2 (3)

193-194: LGTM: Unified profiler argument propagation.

Consistent with the aggregated template, the profiler value is correctly propagated through WORKER_ARGS.

350-372: LGTM: Disaggregated profiling execution.

The template correctly:

Launches separate profiling for prefill and decode workers

Conditionally sets SGLANG_TORCH_PROFILER_DIR only for torch profiling

Injects phase-specific environment variables (prefill_profile_env, decode_profile_env)

374-391: LGTM: Correct wait semantics for disaggregated profiling.

The wait logic properly:

Waits for all profiling scripts (both prefill and decode) when profiling is enabled

Falls back to wait -n for benchmark completion when profiling is disabled

weireweire force-pushed the add-custom-mount branch 3 times, most recently from e219567 to 062f5a6 Compare December 4, 2025 06:08

weireweire changed the title ~~[draft] profiling param set. unify .out/.err~~ profiling param set. nsys profiling support. unify .out/.err Dec 4, 2025

ishandhanani changed the title ~~profiling param set. nsys profiling support. unify .out/.err~~ add nsys profiling and swap to bench_serving instead of bench_one_batch Dec 4, 2025

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

weireweire added 3 commits December 4, 2025 23:30

profiling param set. unify .out/.err

773d3a0

support nsys profiling. use bench_serving as benchmark.

a358123

separate nsys multinode report name. suport torch profiler start step

6917ba5

weireweire force-pushed the add-custom-mount branch from 1ca09e4 to 6917ba5 Compare December 5, 2025 07:30

ishandhanani merged commit b5517c7 into ishandhanani:main Dec 5, 2025
1 check was pending

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

This was referenced Dec 11, 2025

[draft]support sglang native disagg and profiling. #55

Merged

sglang router profiling for PD disagg #60

Closed

ishandhanani mentioned this pull request Dec 17, 2025

sgl-router and docs #61

Merged

This was referenced Dec 19, 2025

pass in frontend args #64

Merged

complete refactor #65

Merged

support set prefill/decode profiling step separately and agg profiling #68

Merged

This was referenced Jan 11, 2026

Update MTP recipe with multiple draft steps #78

Merged

promethus + telemetry #72

Closed

This was referenced Jan 21, 2026

feat: TRTLLM support #79

Merged

Yeswanthk/dsr1 workeronly agg #99

Open

coderabbitai bot mentioned this pull request Jan 29, 2026

feat: add vllm backend with dp+ep support for deepseek-r1 #112

Merged

3 tasks

coderabbitai bot mentioned this pull request Feb 12, 2026

Support Profiling Benchmark for Various Setup with nsys/torch_profiler #170

Merged

coderabbitai bot mentioned this pull request Mar 30, 2026

trtllm nsys #231

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `nsys` profiling and swap to `bench_serving` instead of `bench_one_batch`#41

add `nsys` profiling and swap to `bench_serving` instead of `bench_one_batch`#41
ishandhanani merged 3 commits intoishandhanani:mainfrom
weireweire:add-custom-mount

weireweire commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

ishandhanani commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 4, 2025

Uh oh!

coderabbitai bot Dec 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-else
-    curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"
-    mkdir -p "/logs/profiles" 2>/dev/null || true
-fi
+else
+    response=$(curl -s -w "\n%{http_code}" -X POST http://${head_node}:${head_port}/start_profile \
+        -H "Content-Type: application/json" \
+        -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}")
+    http_code=$(echo "$response" | tail -n1)
+    if [[ "$http_code" != "200" ]]; then
+        echo "Warning: Failed to start remote profiling (HTTP $http_code)"
+    fi
+    mkdir -p "/logs/profiles" 2>/dev/null || true
+fi

Conversation

weireweire commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

ishandhanani commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weireweire commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading