Skip to content

add nsys profiling and swap to bench_serving instead of bench_one_batch#41

Merged
ishandhanani merged 3 commits intoishandhanani:mainfrom
weireweire:add-custom-mount
Dec 5, 2025
Merged

add nsys profiling and swap to bench_serving instead of bench_one_batch#41
ishandhanani merged 3 commits intoishandhanani:mainfrom
weireweire:add-custom-mount

Conversation

@weireweire
Copy link
Copy Markdown
Collaborator

@weireweire weireweire commented Nov 28, 2025

now profiling can set

profiling:
  type: "torch"
  prefill:
    isl: 128000
    osl: 16
    concurrency: 1
    start_step: 0 # sglang bench don't support none 0 yet.
    stop_step: 16
  decode:
    isl: 8
    osl: 16
    concurrency: 1
    start_step: 0
    stop_step: 16

Besides, I merged err and out file, because .err actually not really error for these framework. and separation loses the information about the time sequence.

Summary by CodeRabbit

  • New Features

    • Added profiling support with selectable types: none, torch, nsys; serving behavior switches when profiling is active.
    • New CLI option to choose profiler and pass profiling settings to launches.
  • Documentation

    • New profiling section with YAML examples, phase-specific (prefill/decode) settings, and note that profiling and benchmarking cannot run concurrently.
  • Validation

    • Job validation updated to enforce profiling constraints and single-worker requirements where applicable.

✏️ Tip: You can customize this high-level summary in your review settings.

@weireweire weireweire force-pushed the add-custom-mount branch 3 times, most recently from e219567 to 062f5a6 Compare December 4, 2025 06:08
@weireweire weireweire changed the title [draft] profiling param set. unify .out/.err profiling param set. nsys profiling support. unify .out/.err Dec 4, 2025
@ishandhanani ishandhanani changed the title profiling param set. nsys profiling support. unify .out/.err add nsys profiling and swap to bench_serving instead of bench_one_batch Dec 4, 2025
@ishandhanani
Copy link
Copy Markdown
Owner

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 4, 2025

Walkthrough

Replace the boolean profiling flag with a string-based profiler ("none", "torch", "nsys"), add per-phase profiling config, propagate the profiler through schema, backend rendering, worker CLI/command builders, SLURM templates and profiling scripts, and enforce mutual exclusion with benchmarking.

Changes

Cohort / File(s) Summary
Schema & Config
src/srtctl/core/schema.py
Add ProfilingType, ProfilingPhaseConfig, ProfilingConfig; add profiling to JobConfig; remove enable_profiling from BackendConfig; update validation to use profiling.type and enforce profiling/benchmark mutual exclusion and worker-count constraints.
Backend command rendering
src/srtctl/backends/sglang.py
Replace boolean profiling with profiling.type; choose launcher (dynamo.sglang vs sglang.launch_server) and add NSYS prefix when appropriate; skip disaggregation/config-dump for profiling; expose profiler, prefill_profile_env, decode_profile_env to templates.
Worker CLI & setup
scripts/worker_setup.py, scripts/worker_setup/worker.py
Replace --sglang-torch-profiler boolean CLI flag with --profiler (none,torch,nsys); update worker setup function signatures and call sites to accept and forward profiler: str.
Command builders
scripts/worker_setup/command.py
Change use_profiling: boolprofiler: str; build commands selecting module and env vars based on profiler; add nsys prefix handling and torch profiler env injection; preserve config-based path when none.
SLURM templates
scripts/templates/job_script_template_agg.j2, scripts/templates/job_script_template_disagg.j2
Replace boolean gates with profiler checks (profiler != 'none', profiler == 'torch'); emit unified --profiler flag; conditionally inject SGLANG_TORCH_PROFILER_DIR and phase env strings; adjust wait/exit behavior when profiling active.
Profiling script
scripts/profiling/profile.sh
Move from mode-based to env-driven config (require PROFILE_ISL/PROFILE_OSL); default start/stop steps; create profiler dir only if SGLANG_TORCH_PROFILER_DIR set and otherwise use remote HTTP POST to trigger profiling; switch to serving-based profiling flow and add lm-eval step.
Documentation
docs/installation.md
Add top-level profiling section in job YAML (type, prefill/decode phase params), example YAML, note that profiling and benchmarking cannot run concurrently, and clarify serving module selection based on profiler type.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as User CLI (srtctl)
    participant Scheduler as SLURM / job script
    participant Worker as Worker process
    participant Profiler as Profiler dir / Remote Profiler API

    CLI->>Scheduler: submit job with profiling.type (none|torch|nsys)
    Scheduler->>Worker: start worker with `--profiler` and phase envs
    alt profiling.type == "torch" and SGLANG_TORCH_PROFILER_DIR set
        Worker->>Profiler: create profiler dir
        Worker->>Worker: start python -m sglang.launch_server under torch profiler
    else profiling.type == "torch" and no local profiler dir
        Worker->>Profiler: HTTP POST /start_profile (remote trigger)
        Worker->>Worker: start python -m sglang.launch_server
    else profiling.type == "nsys"
        Worker->>Worker: start python prefixed with nsys CLI wrapper
    else profiling.type == "none"
        Worker->>Worker: start python -m dynamo.sglang (normal serving)
    end
    Worker->>Scheduler: report exit code and results path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas to focus:
    • src/srtctl/core/schema.py: new schema models and validation interactions with benchmarking/disaggregation.
    • src/srtctl/backends/sglang.py: command rendering, NSYS prefix, env-building helpers and template variable injection.
    • scripts/worker_setup/command.py and scripts/worker_setup/worker.py: command construction across profiler modes and updated signatures.
    • scripts/profiling/profile.sh: switch to env-driven flow, remote-start vs local-dir handling, and integration with bench_serving.
    • SLURM templates (job_script_template_*.j2): ensure --profiler propagation and correct wait/exit semantics.

Poem

🐰
I swapped a bool for three bright ways,
"none", "torch", "nsys" — I danced and praised,
Templates, scripts, schema hopped in tune,
Profiling wakes beneath the moon,
A tiny rabbit cheers the new profiling day! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes two main changes: adding nsys profiling support and replacing bench_one_batch with bench_serving.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
scripts/profiling/profile.sh (2)

48-48: Remove unused variable PROFILE_STEPS_ARG.

This variable is assigned but never used in the script.

 # Determine profiling parameters strictly from environment 
-PROFILE_STEPS_ARG=""
 CLI_ARGS=""

49-50: Remove unused CLI_ARGS variable.

CLI_ARGS is built with --batch-size but never used in the bench_serving command. The concurrency is passed via --max-concurrency directly.

 # Determine profiling parameters strictly from environment 
-CLI_ARGS=""
-[[ -n "${PROFILE_CONCURRENCY}" ]] && CLI_ARGS+=" --batch-size ${PROFILE_CONCURRENCY}"
 # Require ISL/OSL to be provided; do not pass them as CLI args here
src/srtctl/backends/sglang.py (1)

289-301: Consider simplifying the conditional checks.

The pattern if "key" in cfg and cfg["key"] is not None can be simplified to if cfg.get("key") is not None since dict.get() returns None for missing keys.

-            if "isl" in cfg and cfg["isl"] is not None:
-                parts.append(f"PROFILE_ISL={cfg['isl']}")
-            if "osl" in cfg and cfg["osl"] is not None:
-                parts.append(f"PROFILE_OSL={cfg['osl']}")
-            if "concurrency" in cfg and cfg["concurrency"] is not None:
-                parts.append(f"PROFILE_CONCURRENCY={cfg['concurrency']}")
-            if "start_step" in cfg and cfg["start_step"] is not None:
-                parts.append(f"PROFILE_START_STEP={cfg['start_step']}")
-            if "stop_step" in cfg and cfg["stop_step"] is not None:
-                parts.append(f"PROFILE_STOP_STEP={cfg['stop_step']}")
+            for key, env_var in [
+                ("isl", "PROFILE_ISL"),
+                ("osl", "PROFILE_OSL"),
+                ("concurrency", "PROFILE_CONCURRENCY"),
+                ("start_step", "PROFILE_START_STEP"),
+                ("stop_step", "PROFILE_STOP_STEP"),
+            ]:
+                if cfg.get(key) is not None:
+                    parts.append(f"{env_var}={cfg[key]}")
src/srtctl/core/schema.py (1)

303-306: Remove or clarify the no-op code block.

This block contains only a pass statement with a comment stating "This is fine - backend will handle disabling it". If no action is needed here, consider removing the entire if block to avoid confusion.

-        # Auto-disable config dump when profiling (already handled in backend, but validate here too)
-        if self.enable_config_dump:
-            # This is fine - backend will handle disabling it
-            pass
scripts/worker_setup/worker.py (1)

92-93: Pre-existing: File handles not explicitly closed.

The file handles for frontend logs are opened but never explicitly closed. While this is intentional for long-running background processes (the frontend runs until job completion), consider using a context manager or documenting this behavior for clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8162742 and 062f5a6.

📒 Files selected for processing (9)
  • docs/installation.md (1 hunks)
  • scripts/profiling/profile.sh (1 hunks)
  • scripts/templates/job_script_template_agg.j2 (7 hunks)
  • scripts/templates/job_script_template_disagg.j2 (8 hunks)
  • scripts/worker_setup.py (4 hunks)
  • scripts/worker_setup/command.py (6 hunks)
  • scripts/worker_setup/worker.py (6 hunks)
  • src/srtctl/backends/sglang.py (5 hunks)
  • src/srtctl/core/schema.py (3 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/profiling/profile.sh

[warning] 48-48: PROFILE_STEPS_ARG appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (22)
docs/installation.md (1)

155-179: Clear and comprehensive profiling documentation.

The documentation effectively explains the new profiling configuration with good examples. The mutual exclusion note with benchmarking and the serving behavior differences are helpful for users.

scripts/worker_setup.py (2)

105-111: Clean transition to string-based profiler argument.

The new --profiler argument with explicit choices provides clear validation and better extensibility compared to the previous boolean flag. Default of "none" maintains backward compatibility.


178-218: Consistent propagation of profiler to all worker types.

The profiler argument is correctly passed to setup_prefill_worker, setup_decode_worker, and setup_aggregated_worker, ensuring uniform profiling behavior across all worker types.

scripts/templates/job_script_template_agg.j2 (3)

193-195: Correct integration of profiler argument into worker commands.

The --profiler {{ profiler }} flag is properly added to WORKER_ARGS, ensuring all worker invocations receive the profiling configuration.


342-359: Appropriate wait behavior for profiling vs. non-profiling modes.

The wait logic correctly uses wait (all tasks) when profiling is active to ensure the profiling script completes, versus wait -n (first task) for normal operation.


323-340: Verify decode_profile_env is populated by the backend.

The profiling block correctly gates on profiler != 'none' and conditionally includes SGLANG_TORCH_PROFILER_DIR only for torch profiler. However, decode_profile_env must be confirmed as populated in the backend template context with environment variables (like PROFILE_ISL, PROFILE_OSL) expected by the profiling script.

scripts/profiling/profile.sh (1)

52-65: Good validation with helpful defaults.

The required parameter validation for PROFILE_ISL and PROFILE_OSL with early exit, combined with sensible defaults for step ranges with warnings, provides a robust configuration approach.

scripts/worker_setup/command.py (4)

13-21: Clean API update for string-based profiler.

The function signature change from boolean to string-based profiler with clear type hints and default value provides better extensibility for future profiling methods.


63-69: Well-constructed nsys profiling integration.

The nsys prefix includes appropriate options for CUDA profiling (cuda,nvtx, --cuda-graph-trace=node, --capture-range-end stop) and uses a consistent output path pattern with torch profiler.


71-84: Correct flag handling with appropriate filtering.

The iteration over config flags with type-aware handling (booleans, lists, scalars) and explicit exclusion of disaggregation-mode for profiling mode is well implemented.


145-174: Consistent API update for get_gpu_command.

The function correctly mirrors the profiler parameter addition and passes it through to build_sglang_command_from_yaml.

src/srtctl/backends/sglang.py (3)

104-112: LGTM! Profiler type handling looks correct.

The conditional logic properly routes to the appropriate Python module and nsys prefix based on profiling type. The fallback to dynamo.sglang for non-profiling mode maintains backward compatibility.


139-147: LGTM! Correct handling of flag exclusion during profiling.

The logic appropriately skips the disaggregation-mode flag when profiling is active, as sglang.launch_server doesn't accept this flag.


338-340: LGTM! Template variables properly integrated.

The new profiling-related template variables are correctly added and will be available for the Jinja templates.

src/srtctl/core/schema.py (2)

165-189: LGTM! Well-structured profiling configuration schema.

The new ProfilingType enum, ProfilingPhaseConfig, and ProfilingConfig classes provide a clean, type-safe structure for profiling configuration. The enum values ("nsys", "torch", "none") align with usage throughout the codebase.


308-334: LGTM! Comprehensive validation logic.

The mutual exclusivity check with non-MANUAL benchmarks and the single-worker constraint validations are well-implemented with clear error messages.

scripts/templates/job_script_template_disagg.j2 (3)

193-195: LGTM! Profiler argument correctly added to worker args.

The profiler mode is properly injected into WORKER_ARGS for all worker types.


363-370: LGTM! Profiling environment correctly conditionally injected.

The template properly conditionalizes SGLANG_TORCH_PROFILER_DIR for torch profiler only, while allowing the generic prefill_profile_env/decode_profile_env to be injected for all profiling modes.


374-390: LGTM! Correct wait behavior for profiling vs non-profiling modes.

The differentiated wait logic ensures profiling scripts complete before job exit, while benchmark mode uses the original wait -n behavior.

scripts/worker_setup/worker.py (3)

55-67: LGTM! Clean parameter refactor for setup_prefill_worker.

The change from boolean sglang_torch_profiler to string profiler with default "none" aligns with the new multi-profiler support and is properly propagated to get_gpu_command.


120-131: LGTM! Consistent parameter refactor for setup_decode_worker.


162-174: LGTM! Consistent parameter refactor for setup_aggregated_worker.

Comment on lines +79 to +82
else
curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"
mkdir -p "/logs/profiles" 2>/dev/null || true
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for the remote profiling start request.

If the curl request fails (network issue, server error), the script continues without profiling being enabled. Consider checking the response status.

 else
-    curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"
+    response=$(curl -s -w "\n%{http_code}" -X POST http://${head_node}:${head_port}/start_profile \
+        -H "Content-Type: application/json" \
+        -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}")
+    http_code=$(echo "$response" | tail -n1)
+    if [[ "$http_code" != "200" ]]; then
+        echo "Warning: Failed to start remote profiling (HTTP $http_code)"
+    fi
     mkdir -p "/logs/profiles" 2>/dev/null || true
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else
curl -X POST http://${head_node}:${head_port}/start_profile -H "Content-Type: application/json" -d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}"
mkdir -p "/logs/profiles" 2>/dev/null || true
fi
else
response=$(curl -s -w "\n%{http_code}" -X POST http://${head_node}:${head_port}/start_profile \
-H "Content-Type: application/json" \
-d "{\"start_step\": \"$PROFILE_START_STEP\", \"num_steps\": $((PROFILE_STOP_STEP-PROFILE_START_STEP)), \"activities\": [\"CUDA_PROFILER\"]}")
http_code=$(echo "$response" | tail -n1)
if [[ "$http_code" != "200" ]]; then
echo "Warning: Failed to start remote profiling (HTTP $http_code)"
fi
mkdir -p "/logs/profiles" 2>/dev/null || true
fi

--model ${model_name} \
--host ${head_node} --port ${head_port} \
--dataset-name random \
--max-concurrency $PROFILE_CONCURRENCY \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation or default for PROFILE_CONCURRENCY.

PROFILE_CONCURRENCY is used directly without validation, unlike PROFILE_ISL and PROFILE_OSL. If unset, this will produce an invalid command.

Consider adding validation similar to ISL/OSL:

if [[ -z "${PROFILE_CONCURRENCY}" ]]; then
    echo "Warning: PROFILE_CONCURRENCY not set; defaulting to 1"
    PROFILE_CONCURRENCY=1
fi
🤖 Prompt for AI Agents
In scripts/profiling/profile.sh around line 93, PROFILE_CONCURRENCY is used
directly and can be empty, producing an invalid command; add a validation block
like the existing PROFILE_ISL/PROFILE_OSL checks that verifies
PROFILE_CONCURRENCY is set and a positive integer, echoing a warning and
defaulting it to 1 when unset or invalid (e.g., if empty or non-numeric), so the
subsequent --max-concurrency argument always receives a valid value.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
scripts/worker_setup/command.py (1)

18-18: Consider validating the profiler parameter against allowed values.

The profiler parameter accepts any string but only "none", "torch", and "nsys" are valid. Invalid values would silently fall into the profiling code path (since profiler != "none" would be True) but without proper setup.

Add validation at the start of build_sglang_command_from_yaml:

VALID_PROFILERS = {"none", "torch", "nsys"}

def build_sglang_command_from_yaml(
    # ... params ...
) -> str:
    if profiler not in VALID_PROFILERS:
        raise ValueError(f"Invalid profiler '{profiler}'. Must be one of: {VALID_PROFILERS}")
    # ... rest of function

Also applies to: 36-36, 149-149, 161-161

scripts/worker_setup/worker.py (1)

92-100: Consider using context managers or explicit cleanup for file handles.

The file handles frontend_stdout and frontend_stderr are opened but never explicitly closed. While they remain open intentionally for the background process, if run_command fails before returning a process, the handles would leak (though Python will close them on exit).

This is a minor concern since the files are needed for the lifetime of the background process. If you want to be more explicit:

try:
    frontend_stdout = open("/logs/frontend.out", "w")
    frontend_stderr = open("/logs/frontend.err", "w")
    frontend_process = run_command(frontend_cmd, background=True, stdout=frontend_stdout, stderr=frontend_stderr)
    if not frontend_process:
        frontend_stdout.close()
        frontend_stderr.close()
        raise RuntimeError("Failed to start frontend")
except Exception:
    frontend_stdout.close()
    frontend_stderr.close()
    raise
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 062f5a6 and 1ca09e4.

📒 Files selected for processing (3)
  • docs/installation.md (1 hunks)
  • scripts/worker_setup/command.py (6 hunks)
  • scripts/worker_setup/worker.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/installation.md
🔇 Additional comments (8)
scripts/worker_setup/command.py (4)

62-62: LGTM - nsys profile command construction is appropriate.

The nsys profile options are well-chosen for CUDA profiling with graph tracing. The --force-overwrite true prevents profile file conflicts.


64-92: Profiling mode command construction looks correct.

The logic properly differentiates between nsys (wrapping with nsys prefix) and torch profiling (inline python invocation), and correctly builds all SGLang flags from the config.


93-103: LGTM - non-profiling mode correctly uses config-based invocation.


142-171: LGTM - clean parameter propagation to command builder.

The function correctly passes the new profiler parameter through to build_sglang_command_from_yaml.

scripts/worker_setup/worker.py (4)

55-117: LGTM - prefill worker correctly updated to use string-based profiler.

The parameter change from boolean to string is correctly applied and propagated to get_gpu_command.


120-159: LGTM - decode worker correctly updated.

Consistent parameter change and propagation.


162-224: LGTM - aggregated worker correctly updated.

Consistent parameter change and propagation. Note the same file handle pattern exists here (lines 199-200) as discussed for the prefill worker.


103-104: Verify the GPU type check logic—the condition appears logically contradictory.

The condition gpu_type.startswith("gb200") and not gpu_type.startswith("gb300") appears redundant since a string cannot simultaneously start with both "gb200" and "gb300". However, this may be intentional if GPU naming includes variants like "gb200" and "gb2000". Clarify whether this should be a different exclusion (e.g., excluding a specific GB200 variant) or if the condition should be restructured.

@ishandhanani ishandhanani merged commit b5517c7 into ishandhanani:main Dec 5, 2025
1 check was pending
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
scripts/profiling/profile.sh (2)

84-84: Existing concern: curl error handling.

This review comment echoes a previous reviewer's concern about the lack of error handling for the remote profiling start request.


91-91: Existing concern: validate PROFILE_CONCURRENCY.

This review comment echoes a previous reviewer's concern about the lack of validation for PROFILE_CONCURRENCY before use.

🧹 Nitpick comments (1)
scripts/profiling/profile.sh (1)

48-48: Remove unused variable.

PROFILE_STEPS_ARG is assigned but never used in the script.

Apply this diff:

-PROFILE_STEPS_ARG=""
 CLI_ARGS=""
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ca09e4 and 6917ba5.

📒 Files selected for processing (9)
  • docs/installation.md (1 hunks)
  • scripts/profiling/profile.sh (1 hunks)
  • scripts/templates/job_script_template_agg.j2 (7 hunks)
  • scripts/templates/job_script_template_disagg.j2 (8 hunks)
  • scripts/worker_setup.py (4 hunks)
  • scripts/worker_setup/command.py (6 hunks)
  • scripts/worker_setup/worker.py (6 hunks)
  • src/srtctl/backends/sglang.py (5 hunks)
  • src/srtctl/core/schema.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/installation.md
  • scripts/worker_setup/worker.py
  • scripts/worker_setup/command.py
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/profiling/profile.sh

[warning] 48-48: PROFILE_STEPS_ARG appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (16)
scripts/worker_setup.py (2)

106-111: LGTM: Clean profiler argument design.

The string-based profiler choices provide clear options and the default of "none" is safe.


179-218: LGTM: Consistent profiler propagation.

The profiler argument is correctly propagated to all worker setup functions (prefill, decode, aggregated).

scripts/profiling/profile.sh (3)

47-65: LGTM: Environment-driven profiling configuration.

The shift from mode-based to environment-driven parameter selection is cleaner. The validation for ISL/OSL with error exit and defaults for start/stop steps with warnings are appropriate.


71-80: LGTM: Conditional profiling directory setup.

The logic to conditionally create profiler directories and set activities based on whether SGLANG_TORCH_PROFILER_DIR is provided is correct.


86-107: LGTM: Serving-based profiling and evaluation flow.

The replacement of bench_one_batch_server with bench_serving and the addition of lm-eval align with the PR objectives to support remote profiling.

src/srtctl/core/schema.py (2)

165-189: LGTM: Well-structured profiling configuration models.

The new ProfilingType enum, ProfilingPhaseConfig, and ProfilingConfig classes provide a clean, type-safe profiling configuration system with sensible defaults.


297-334: LGTM: Proper profiling validation constraints.

The validation logic correctly enforces:

  • Mutual exclusion between profiling and benchmarking
  • Single-worker constraint when profiling is enabled (for both disaggregated and aggregated modes)

The short-circuit when profiling is "none" avoids unnecessary checks.

src/srtctl/backends/sglang.py (3)

104-127: LGTM: Profiler-based command rendering.

The logic correctly selects:

  • NSYS prefix + sglang.launch_server for nsys profiling
  • sglang.launch_server for torch profiling
  • dynamo.sglang for normal operation

Safe dictionary access with .get() and default "none" prevents errors.


139-147: LGTM: Skip incompatible flag when profiling.

Correctly skips the disaggregation-mode flag for sglang.launch_server, which doesn't support it.


284-306: LGTM: Clean profiling environment construction.

The build_env_str helper cleanly converts profiling configuration into environment variable assignments for injection into templates.

scripts/templates/job_script_template_agg.j2 (3)

193-194: LGTM: Unified profiler argument propagation.

The profiler value from the config is correctly propagated to all worker setup commands through WORKER_ARGS.


323-340: LGTM: Conditional profiling execution.

The profiling block correctly:

  • Gates on profiler != 'none'
  • Conditionally sets SGLANG_TORCH_PROFILER_DIR only for torch profiling
  • Injects phase-specific environment variables via decode_profile_env

342-359: LGTM: Profiler-aware wait logic.

The wait logic correctly branches:

  • When profiling is enabled: waits for profiling script completion and propagates its exit code
  • When profiling is disabled: uses wait -n to wait for first task (benchmark) completion
scripts/templates/job_script_template_disagg.j2 (3)

193-194: LGTM: Unified profiler argument propagation.

Consistent with the aggregated template, the profiler value is correctly propagated through WORKER_ARGS.


350-372: LGTM: Disaggregated profiling execution.

The template correctly:

  • Launches separate profiling for prefill and decode workers
  • Conditionally sets SGLANG_TORCH_PROFILER_DIR only for torch profiling
  • Injects phase-specific environment variables (prefill_profile_env, decode_profile_env)

374-391: LGTM: Correct wait semantics for disaggregated profiling.

The wait logic properly:

  • Waits for all profiling scripts (both prefill and decode) when profiling is enabled
  • Falls back to wait -n for benchmark completion when profiling is disabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants