Yeswanthk/dsr1 workeronly agg by yeswanthk-26 · Pull Request #99 · ishandhanani/srt-slurm

yeswanthk-26 · 2026-01-26T17:25:28Z

Updated PR description

1.Support I Added

what

Make stock SGLang behavior automatic (no user-facing frontend.type: none):
- Aggregated + agg_workers=1: launch only sglang.launch_server and run health + SA-bench directly against the worker OAI endpoint (no router).
- Aggregated + agg_workers>1: use the SGLang router.
- Disaggregated: use the SGLang router.
Update orchestrator stages to skip frontend bring-up when in direct-to-worker mode and route health/bench accordingly.
Add DeepSeek-R1 FP8 SGLang H100 aggregated recipes (TP=8, PP=2):
- recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml
- recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml (includes slurm.time_limit: 02:00:00)
- recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml
Add Idle-GPU reaper exemption via sbatch_directives.comment (JSON) for long model-load/warmup.
Harden head infra: bind NATS to the node’s private IP instead of 0.0.0.0.

Why
Stock SGLang aggregated with a single worker already exposes an OAI-compatible server; launching an extra router/frontend prevents hitting the worker directly. This change makes the correct behavior the default and keeps router semantics for multi-worker and disaggregated modes.

2. AGG DSR-1 SGLANG FP8 H100

1K/1K (ISL=1024, OSL=1024)

Config	Mode	Total Nodes	Total GPUs	TP	PP
`h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K`	Aggregated SGLang (worker-only / direct-to-worker)	2	16	8	2

Concurrency sweep: 1x2x4x8x16x32x64x128x256x512
SLURM time limit: 02:00:00

8K/1K (ISL=8192, OSL=1024)

Config	Mode	Total Nodes	Total GPUs	TP	PP
`h100-dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K`	Aggregated SGLang (worker-only / direct-to-worker)	2	16	8	2

Concurrency sweep: 1x2x64
SLURM time limit: inherits repo defaults (not set in recipe)

1K/8K (ISL=1024, OSL=8192)

Config	Mode	Total Nodes	Total GPUs	TP	PP
`h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K`	Aggregated SGLang (worker-only / direct-to-worker)	2	16	8	2

Concurrency sweep: 1x2x4x8x16x32x64x128x176x256
SLURM time limit: 03:00:00

Key settings (common across these 3 recipes)

Container: docker://lmsysorg/sglang:v0.5.8-cu130-runtime
Model: DeepSeek-R1 FP8 (served-model-name: deepseek-ai/DeepSeek-R1-0528, precision: fp8, trust-remote-code: true)
Cluster shape: 2 nodes × 8 H100/node = 16 GPUs, agg_workers: 1 (single aggregated worker group)
Parallelism: tensor-parallel-size: 8, pipeline-parallel-size: 2, data-parallel-size: 1
Attention backend: flashinfer
Idle GPU reaper exemption: sbatch_directives.comment includes OccupiedIdleGPUsJobReaper exemption (60 min) for long load/warmup
Readiness wait: health_check: interval_seconds=10, max_attempts=720 (wait up to ~2 hours for the worker to come up)

Summary by CodeRabbit

New Features
- Direct-to-worker deployment mode to run workers without an intermediate frontend.
- Dynamic, job-specific port base and runtime-selected frontend port to reduce collisions.
Improvements
- Longer health-check timeouts to accommodate multi-node weight loading and warmup.
- Improved node IP detection preferring private network addresses.
- Smarter startup flow that conditionally skips frontend/head services when appropriate.
Documentation & Configuration
- Added new H100 FP8 deployment templates for the SGLang backend.

coderabbitai · 2026-01-26T17:25:48Z

📝 Walkthrough

Walkthrough

Adds worker-only (direct/none) SGLang support and three H100 FP8 worker-only recipe YAMLs; introduces runtime fields (effective_frontend_type, frontend_port, sys_port_base), deterministic sys-port seeding, direct health-check path, sglang launch adjustments, NATS bind-to-host changes, and improved IP selection logic. (50 words)

Changes

Cohort / File(s)	Summary
Worker-only recipe files `recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`, `recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`, `recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml`	Add three H100 FP8 SGLang worker-only aggregation configs (TP=8, DP=1, PP=2) with model/tokenizer paths, aggregated environment, tuning flags, sa-bench benchmarks, and extended health_check timeouts for multi-node weight loading.
Backend: sglang launch logic `src/srtctl/backends/sglang.py`	Treat `direct` and `none` as sglang.launch_server-equivalent; include them in profiling/selection checks; skip `dump-config-to` when frontend_type is in (`sglang`, `direct`, `none`); update docs.
Health checks: direct-to-worker `src/srtctl/core/health.py`	Add branch for frontend_type in (`direct`, `none`) in wait_for_model that polls worker `/health` and `/v1/models` and requires a non-empty model list; includes timeout, stop_event handling, logging, and exceptions.
Runtime: ports, image, context fields `src/srtctl/core/runtime.py`	Add public fields `effective_frontend_type`, `frontend_port`, `sys_port_base`; allow `container_image: str
CLI: sweep/startup/frontend/worker/benchmark `src/srtctl/cli/do_sweep.py`, `src/srtctl/cli/mixins/frontend_stage.py`, `src/srtctl/cli/mixins/worker_stage.py`, `src/srtctl/cli/mixins/benchmark_stage.py`	Use `runtime.effective_frontend_type` and `runtime.frontend_port` instead of static config; skip frontend/nginx when effective type is `direct`/`none`; make head infra startup conditional for direct mode; update dynamic URLs/ports and sys_port_base usage.
Head setup: NATS binding and host IP `src/srtctl/cli/setup_head.py`	`start_nats` now accepts `host_ip` and binds NATS to that host_ip and explicit port; wait-for-service checks and logs query host_ip instead of localhost.
IP utils: prefer private addresses `src/srtctl/core/ip_utils/__init__.py`	Enhance `get_node_ip` to parse IPv4 candidates and prefer RFC1918 private addresses (10., 172.16–31., 192.168.); log chosen IP and fall back to cleaned raw output when no candidates.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant DoSweep
    participant SetupHead
    participant Frontend
    participant Backend
    participant Health
    participant Worker

    User->>DoSweep: submit job (effective_frontend_type = "direct"/"none")
    alt effective_frontend_type == "direct" or "none"
        DoSweep->>SetupHead: skip head infra startup
        Note over SetupHead: NATS/etcd not started on head
    else
        DoSweep->>SetupHead: start head infra (NATS/etcd) bound to host_ip
    end

    DoSweep->>Frontend: start_frontend()
    alt effective_frontend_type == "direct" or "none"
        Frontend-->>DoSweep: return early (frontend disabled)
    else
        Frontend->>Frontend: configure router/nginx on frontend_port
    end

    DoSweep->>Backend: launch worker(s) (sglang.launch_server if sglang/direct/none)
    Backend->>Worker: start SGLang worker process

    DoSweep->>Health: wait_for_model()
    alt effective_frontend_type == "direct" or "none"
        Health->>Worker: poll /health
        Worker-->>Health: 200
        Health->>Worker: poll /v1/models
        Worker-->>Health: non-empty → Ready
    else
        Health->>Frontend: check via frontend/router
        Frontend-->>Health: Ready
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

complete refactor #65: Overlaps runtime/effective_frontend_type, direct-to-worker flow, and health/wait logic changes.
add nsys profiling and swap to bench_serving instead of bench_one_batch #41: Modifies sglang backend launcher selection and config-dump behavior; overlaps backend adjustments.
[draft]support sglang native disagg and profiling. #55: Related edits to src/srtctl/backends/sglang.py affecting frontend-type and flag emission logic.

Suggested reviewers

ishandhanani
trevor-m

Poem

🐰 I hopped in with configs bright and new,
Worker-only roads and ports that stew.
I polled the health and found the model true,
Picked private IPs and seeded ports too.
A carrot deploy — hop, skip, hooray! 🥕🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and uses non-descriptive abbreviations (dsr1, agg, workeronly) without clear context about what changed or why it matters to the codebase.	Provide a more descriptive title such as 'Support direct-to-worker SGLang aggregation with automatic frontend routing' or 'Add DeepSeek-R1 FP8 SGLang H100 recipes with direct-to-worker mode'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/srtctl/cli/setup_head.py`:
- Around line 134-146: The current NATS startup command binds to 0.0.0.0 (cmd
list built with "-a", "0.0.0.0"), which is overly broad; change the bind address
to the node's private interface (use the existing host_ip variable) so NATS only
listens on the cluster interface. Update the cmd construction in setup_head.py
to replace the literal "0.0.0.0" with host_ip (and ensure host_ip is computed
earlier), and confirm any port checks like wait_for_port(head_node, 4222) use
the same host_ip/hostname to match the bind address.

🧹 Nitpick comments (2)

src/srtctl/cli/do_sweep.py (1)
82-91: Consider more robust signature detection.

Catching TypeError for backward compatibility is fragile—it could mask unrelated TypeError exceptions from within endpoints_to_processes. Consider using inspect.signature to check if base_sys_port is supported, or document this as a known limitation.
♻️ Suggested alternative using inspect
+import inspect
+
 `@functools.cached_property`
 def backend_processes(self) -> list[Process]:
     """Compute physical process topology from endpoints (cached)."""
     # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
     # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
-    try:
-        return self.backend.endpoints_to_processes(
-            self.endpoints,
-            base_sys_port=self.runtime.sys_port_base,
-        )
-    except TypeError:
-        # Backends that don't accept base_sys_port keep their default behavior.
+    sig = inspect.signature(self.backend.endpoints_to_processes)
+    if "base_sys_port" in sig.parameters:
+        return self.backend.endpoints_to_processes(
+            self.endpoints,
+            base_sys_port=self.runtime.sys_port_base,
+        )
+    else:
         return self.backend.endpoints_to_processes(self.endpoints)
src/srtctl/core/runtime.py (1)
129-146: Port calculation logic is correct, but exception handling is overly broad.

The sys_port_base calculation correctly constrains ports to 10000–29900 (staying under signed 16-bit max). However, using bare except Exception when only ValueError is expected from int() could mask unexpected errors.
♻️ Narrow the exception type
         try:
             job_seed = int(job_id)
-        except Exception:
+        except ValueError:
             job_seed = abs(hash(job_id))
         sys_port_base = 10000 + (job_seed % 200) * 100

ishandhanani

Not sure if we want frontend type none.

When we run dynamo we need frontend and worker to be both on when its agg singel node. When running stock sglang, we just want the worker. This should be handled under the hood

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 67-71: The comment claiming the health timeout "matches 1:30:00
time limit" is incorrect because slurm.time_limit is 02:00:00; update the
comment and/or the values under health_check (max_attempts and/or
interval_seconds) so they correctly reflect the intended timeout relationship
with slurm.time_limit. Specifically, edit the comment above health_check and
adjust health_check.max_attempts (or interval_seconds) so max_attempts *
interval_seconds equals the desired timeout (e.g., 7200s for 2:00:00) and
reference the health_check, max_attempts, interval_seconds, and slurm.time_limit
symbols when making the change.

In `@src/srtctl/core/runtime.py`:
- Around line 137-175: The code assumes only "direct" needs special handling:
update port selection and startup to treat "none" like "direct". In runtime.py
change the frontend_port logic to use 30000 when effective_frontend_type is
"direct" or "none" (i.e., frontend_port = 30000 if effective_frontend_type in
("direct", "none") else 8000). In FrontendStageMixin.start_frontend() update the
skip condition to return/skip startup when effective_frontend_type is "direct"
or "none" so it does not attempt to instantiate a non-existent frontend
implementation; ensure references to config.frontend.type or
effective_frontend_type are used consistently to detect "none".

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 64-68: This file sets a long health_check timeout (max_attempts:
540, interval_seconds: 10) but lacks an explicit Slurm time limit; add a
slurm.time_limit setting (e.g., time_limit: "02:00:00") to the YAML so the job
walltime exceeds the 90-minute health check window—place the time_limit under
the same slurm configuration block used in other variants (match the 1K_8K
pattern) to ensure the job isn't killed before health checks complete.

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 69-71: The comment for health_check (max_attempts: 540,
interval_seconds: 10) incorrectly says "matches 1:30:00 time limit" while
slurm.time_limit is "02:00:00"; update the comment to accurately reflect that
540 attempts * 10s = 5400s (1:30:00) is a subset of the job time (2:00:00) or
change wording to state it equals 1:30:00 and is less than the slurm.time_limit;
adjust the inline comment next to health_check, max_attempts, or
interval_seconds accordingly to reference the correct durations and relationship
to slurm.time_limit.

In `@src/srtctl/core/runtime.py`:
- Around line 227-236: The cargo_env file is being created under configs_dir
(variable cargo_env) which is inside the source tree; change it to live in the
ephemeral log_dir instead: build the path as log_dir / "cargo_env" (instead of
configs_dir / "cargo_env"), ensure cargo_env.parent.mkdir(parents=True,
exist_ok=True) and cargo_env.touch(exist_ok=True) still run, and update the
container_mounts entry to mount cargo_env.resolve() to Path("/root/.cargo/env");
keep the try/except best-effort behavior and do not create or modify any
gitignore changes.

🧹 Nitpick comments (3)

src/srtctl/core/runtime.py (1)

162-173: Port collision handling is reasonable, but collision probability should be understood.

With 200 possible base values (job_seed % 200), there's a ~5% chance of collision at ~14 concurrent jobs sharing the same nodes (birthday paradox). This is acceptable given:

Ports are per-node, so only jobs sharing nodes can collide

Each job gets a 100-port range for internal allocation headroom

The constraint of staying under 32767 for Dynamo's i16 requirement limits options

Consider documenting this trade-off in a comment if cluster density is expected to be high.

src/srtctl/core/health.py (1)

339-480: Consider extracting common polling infrastructure, but current structure is acceptable.

The wait_for_model function has some structural duplication between the direct/none branch and sglang/dynamo branches (timeout handling, progress reporting, exception handling). However, the core logic differs enough that the current explicit structure is readable and maintainable.
src/srtctl/cli/do_sweep.py (1)
82-91: Consider catching a more specific exception or using introspection.

The TypeError catch is a pragmatic backward-compat approach, but it may inadvertently swallow unrelated TypeErrors from bugs inside endpoints_to_processes. Consider using inspect.signature to check if base_sys_port is accepted, or at minimum, log when the fallback is triggered so issues are visible.
♻️ Optional: Use signature introspection instead of try/except
+import inspect
+
 `@functools.cached_property`
 def backend_processes(self) -> list[Process]:
     """Compute physical process topology from endpoints (cached)."""
-    # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
-    # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
-    try:
-        return self.backend.endpoints_to_processes(
-            self.endpoints,
-            base_sys_port=self.runtime.sys_port_base,
-        )
-    except TypeError:
-        # Backends that don't accept base_sys_port keep their default behavior.
-        return self.backend.endpoints_to_processes(self.endpoints)
+    # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
+    # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
+    sig = inspect.signature(self.backend.endpoints_to_processes)
+    if "base_sys_port" in sig.parameters:
+        return self.backend.endpoints_to_processes(
+            self.endpoints,
+            base_sys_port=self.runtime.sys_port_base,
+        )
+    return self.backend.endpoints_to_processes(self.endpoints)

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 66-68: The health_check block contains an invalid line ("\g<1>540
...") causing YAML parse failures; update the health_check mapping (the block
containing interval_seconds and the malformed line) by removing the invalid
token and adding a proper key/value pair max_attempts: 540 so the block reads a
valid mapping with max_attempts and interval_seconds.

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml`:
- Line 70: Trailing blank lines exist at the end of the YAML document; open the
YAML file and remove extra blank lines so the file ends with a single newline
only (no multiple empty lines at EOF), ensuring the final character is a single
newline to satisfy the linter for dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml.
- Around line 64-68: The health_check block contains a corrupted token
"\g<1>540" which breaks YAML; instead remove that token and add a proper
key/value such as timeout_seconds: 540 under the health_check mapping (preserve
indentation and the existing interval_seconds: 10), so the block reads with
valid keys health_check -> timeout_seconds and interval_seconds.

…TS bind

…namo)

yeswanthk-26 force-pushed the yeswanthk/dsr1-workeronly-agg branch from 0d0462c to 76a89ed Compare January 26, 2026 17:30

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

Comment thread src/srtctl/cli/setup_head.py

ishandhanani requested changes Jan 26, 2026

View reviewed changes

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml

Comment thread src/srtctl/core/runtime.py

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml

Comment thread src/srtctl/core/runtime.py

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml

Comment thread recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml

yeswanthk-26 added 5 commits February 6, 2026 06:25

Add worker-only aggregated mode + DSR1 FP8 SGLang H100 recipes (tp8 pp2)

4d6fde0

Infer stock SGLang worker-only agg; add idle-GPU exemption; harden NA…

56bfb21

…TS bind

Use effective frontend type for worker launch (direct vs router vs dy…

a45a85f

…namo)

Fix cargo_env location and recipe sbatch/timeout tweaks

4774ec2

Dsr1 sglang fp8 h100 1k/1k, 8k/1k and 1k/8k

74947de

yeswanthk-26 force-pushed the yeswanthk/dsr1-workeronly-agg branch from 31b7537 to 74947de Compare February 6, 2026 14:29

Add DSR1 FP8 H100 aggregated nsys profiling recipe (mount host nsys)

0b69174

Conversation

yeswanthk-26 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated PR description

1.Support I Added

1K/1K (ISL=1024, OSL=1024)

8K/1K (ISL=8192, OSL=1024)

1K/8K (ISL=1024, OSL=8192)

Key settings (common across these 3 recipes)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ishandhanani left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yeswanthk-26 commented Jan 26, 2026 •

edited

Loading

coderabbitai bot commented Jan 26, 2026 •

edited

Loading