Skip to content

Yeswanthk/dsr1 workeronly agg#99

Open
yeswanthk-26 wants to merge 6 commits intoishandhanani:mainfrom
yeswanthk-26:yeswanthk/dsr1-workeronly-agg
Open

Yeswanthk/dsr1 workeronly agg#99
yeswanthk-26 wants to merge 6 commits intoishandhanani:mainfrom
yeswanthk-26:yeswanthk/dsr1-workeronly-agg

Conversation

@yeswanthk-26
Copy link
Copy Markdown

@yeswanthk-26 yeswanthk-26 commented Jan 26, 2026

Updated PR description

1.Support I Added

what

  • Make stock SGLang behavior automatic (no user-facing frontend.type: none):
    • Aggregated + agg_workers=1: launch only sglang.launch_server and run health + SA-bench directly against the worker OAI endpoint (no router).
    • Aggregated + agg_workers>1: use the SGLang router.
    • Disaggregated: use the SGLang router.
  • Update orchestrator stages to skip frontend bring-up when in direct-to-worker mode and route health/bench accordingly.
  • Add DeepSeek-R1 FP8 SGLang H100 aggregated recipes (TP=8, PP=2):
    • recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml
    • recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml (includes slurm.time_limit: 02:00:00)
    • recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml
  • Add Idle-GPU reaper exemption via sbatch_directives.comment (JSON) for long model-load/warmup.
  • Harden head infra: bind NATS to the node’s private IP instead of 0.0.0.0.

Why
Stock SGLang aggregated with a single worker already exposes an OAI-compatible server; launching an extra router/frontend prevents hitting the worker directly. This change makes the correct behavior the default and keeps router semantics for multi-worker and disaggregated modes.

2. AGG DSR-1 SGLANG FP8 H100

1K/1K (ISL=1024, OSL=1024)

Config Mode Total Nodes Total GPUs TP PP
h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K Aggregated SGLang (worker-only / direct-to-worker) 2 16 8 2
  • Concurrency sweep: 1x2x4x8x16x32x64x128x256x512
  • SLURM time limit: 02:00:00

8K/1K (ISL=8192, OSL=1024)

Config Mode Total Nodes Total GPUs TP PP
h100-dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K Aggregated SGLang (worker-only / direct-to-worker) 2 16 8 2
  • Concurrency sweep: 1x2x64
  • SLURM time limit: inherits repo defaults (not set in recipe)

1K/8K (ISL=1024, OSL=8192)

Config Mode Total Nodes Total GPUs TP PP
h100-dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K Aggregated SGLang (worker-only / direct-to-worker) 2 16 8 2
  • Concurrency sweep: 1x2x4x8x16x32x64x128x176x256
  • SLURM time limit: 03:00:00

Key settings (common across these 3 recipes)

  • Container: docker://lmsysorg/sglang:v0.5.8-cu130-runtime
  • Model: DeepSeek-R1 FP8 (served-model-name: deepseek-ai/DeepSeek-R1-0528, precision: fp8, trust-remote-code: true)
  • Cluster shape: 2 nodes × 8 H100/node = 16 GPUs, agg_workers: 1 (single aggregated worker group)
  • Parallelism: tensor-parallel-size: 8, pipeline-parallel-size: 2, data-parallel-size: 1
  • Attention backend: flashinfer
  • Idle GPU reaper exemption: sbatch_directives.comment includes OccupiedIdleGPUsJobReaper exemption (60 min) for long load/warmup
  • Readiness wait: health_check: interval_seconds=10, max_attempts=720 (wait up to ~2 hours for the worker to come up)

Summary by CodeRabbit

  • New Features

    • Direct-to-worker deployment mode to run workers without an intermediate frontend.
    • Dynamic, job-specific port base and runtime-selected frontend port to reduce collisions.
  • Improvements

    • Longer health-check timeouts to accommodate multi-node weight loading and warmup.
    • Improved node IP detection preferring private network addresses.
    • Smarter startup flow that conditionally skips frontend/head services when appropriate.
  • Documentation & Configuration

    • Added new H100 FP8 deployment templates for the SGLang backend.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Adds worker-only (direct/none) SGLang support and three H100 FP8 worker-only recipe YAMLs; introduces runtime fields (effective_frontend_type, frontend_port, sys_port_base), deterministic sys-port seeding, direct health-check path, sglang launch adjustments, NATS bind-to-host changes, and improved IP selection logic. (50 words)

Changes

Cohort / File(s) Summary
Worker-only recipe files
recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml, recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml, recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml
Add three H100 FP8 SGLang worker-only aggregation configs (TP=8, DP=1, PP=2) with model/tokenizer paths, aggregated environment, tuning flags, sa-bench benchmarks, and extended health_check timeouts for multi-node weight loading.
Backend: sglang launch logic
src/srtctl/backends/sglang.py
Treat direct and none as sglang.launch_server-equivalent; include them in profiling/selection checks; skip dump-config-to when frontend_type is in (sglang, direct, none); update docs.
Health checks: direct-to-worker
src/srtctl/core/health.py
Add branch for frontend_type in (direct, none) in wait_for_model that polls worker /health and /v1/models and requires a non-empty model list; includes timeout, stop_event handling, logging, and exceptions.
Runtime: ports, image, context fields
src/srtctl/core/runtime.py
Add public fields effective_frontend_type, frontend_port, sys_port_base; allow `container_image: str
CLI: sweep/startup/frontend/worker/benchmark
src/srtctl/cli/do_sweep.py, src/srtctl/cli/mixins/frontend_stage.py, src/srtctl/cli/mixins/worker_stage.py, src/srtctl/cli/mixins/benchmark_stage.py
Use runtime.effective_frontend_type and runtime.frontend_port instead of static config; skip frontend/nginx when effective type is direct/none; make head infra startup conditional for direct mode; update dynamic URLs/ports and sys_port_base usage.
Head setup: NATS binding and host IP
src/srtctl/cli/setup_head.py
start_nats now accepts host_ip and binds NATS to that host_ip and explicit port; wait-for-service checks and logs query host_ip instead of localhost.
IP utils: prefer private addresses
src/srtctl/core/ip_utils/__init__.py
Enhance get_node_ip to parse IPv4 candidates and prefer RFC1918 private addresses (10., 172.16–31., 192.168.); log chosen IP and fall back to cleaned raw output when no candidates.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant DoSweep
    participant SetupHead
    participant Frontend
    participant Backend
    participant Health
    participant Worker

    User->>DoSweep: submit job (effective_frontend_type = "direct"/"none")
    alt effective_frontend_type == "direct" or "none"
        DoSweep->>SetupHead: skip head infra startup
        Note over SetupHead: NATS/etcd not started on head
    else
        DoSweep->>SetupHead: start head infra (NATS/etcd) bound to host_ip
    end

    DoSweep->>Frontend: start_frontend()
    alt effective_frontend_type == "direct" or "none"
        Frontend-->>DoSweep: return early (frontend disabled)
    else
        Frontend->>Frontend: configure router/nginx on frontend_port
    end

    DoSweep->>Backend: launch worker(s) (sglang.launch_server if sglang/direct/none)
    Backend->>Worker: start SGLang worker process

    DoSweep->>Health: wait_for_model()
    alt effective_frontend_type == "direct" or "none"
        Health->>Worker: poll /health
        Worker-->>Health: 200
        Health->>Worker: poll /v1/models
        Worker-->>Health: non-empty → Ready
    else
        Health->>Frontend: check via frontend/router
        Frontend-->>Health: Ready
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • ishandhanani
  • trevor-m

Poem

🐰 I hopped in with configs bright and new,
Worker-only roads and ports that stew.
I polled the health and found the model true,
Picked private IPs and seeded ports too.
A carrot deploy — hop, skip, hooray! 🥕🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and uses non-descriptive abbreviations (dsr1, agg, workeronly) without clear context about what changed or why it matters to the codebase. Provide a more descriptive title such as 'Support direct-to-worker SGLang aggregation with automatic frontend routing' or 'Add DeepSeek-R1 FP8 SGLang H100 recipes with direct-to-worker mode'.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yeswanthk-26 yeswanthk-26 force-pushed the yeswanthk/dsr1-workeronly-agg branch from 0d0462c to 76a89ed Compare January 26, 2026 17:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/srtctl/cli/setup_head.py`:
- Around line 134-146: The current NATS startup command binds to 0.0.0.0 (cmd
list built with "-a", "0.0.0.0"), which is overly broad; change the bind address
to the node's private interface (use the existing host_ip variable) so NATS only
listens on the cluster interface. Update the cmd construction in setup_head.py
to replace the literal "0.0.0.0" with host_ip (and ensure host_ip is computed
earlier), and confirm any port checks like wait_for_port(head_node, 4222) use
the same host_ip/hostname to match the bind address.
🧹 Nitpick comments (2)
src/srtctl/cli/do_sweep.py (1)

82-91: Consider more robust signature detection.

Catching TypeError for backward compatibility is fragile—it could mask unrelated TypeError exceptions from within endpoints_to_processes. Consider using inspect.signature to check if base_sys_port is supported, or document this as a known limitation.

♻️ Suggested alternative using inspect
+import inspect
+
 `@functools.cached_property`
 def backend_processes(self) -> list[Process]:
     """Compute physical process topology from endpoints (cached)."""
     # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
     # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
-    try:
-        return self.backend.endpoints_to_processes(
-            self.endpoints,
-            base_sys_port=self.runtime.sys_port_base,
-        )
-    except TypeError:
-        # Backends that don't accept base_sys_port keep their default behavior.
+    sig = inspect.signature(self.backend.endpoints_to_processes)
+    if "base_sys_port" in sig.parameters:
+        return self.backend.endpoints_to_processes(
+            self.endpoints,
+            base_sys_port=self.runtime.sys_port_base,
+        )
+    else:
         return self.backend.endpoints_to_processes(self.endpoints)
src/srtctl/core/runtime.py (1)

129-146: Port calculation logic is correct, but exception handling is overly broad.

The sys_port_base calculation correctly constrains ports to 10000–29900 (staying under signed 16-bit max). However, using bare except Exception when only ValueError is expected from int() could mask unexpected errors.

♻️ Narrow the exception type
         try:
             job_seed = int(job_id)
-        except Exception:
+        except ValueError:
             job_seed = abs(hash(job_id))
         sys_port_base = 10000 + (job_seed % 200) * 100

Comment thread src/srtctl/cli/setup_head.py
Copy link
Copy Markdown
Owner

@ishandhanani ishandhanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want frontend type none.

When we run dynamo we need frontend and worker to be both on when its agg singel node. When running stock sglang, we just want the worker. This should be handled under the hood

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 67-71: The comment claiming the health timeout "matches 1:30:00
time limit" is incorrect because slurm.time_limit is 02:00:00; update the
comment and/or the values under health_check (max_attempts and/or
interval_seconds) so they correctly reflect the intended timeout relationship
with slurm.time_limit. Specifically, edit the comment above health_check and
adjust health_check.max_attempts (or interval_seconds) so max_attempts *
interval_seconds equals the desired timeout (e.g., 7200s for 2:00:00) and
reference the health_check, max_attempts, interval_seconds, and slurm.time_limit
symbols when making the change.

In `@src/srtctl/core/runtime.py`:
- Around line 137-175: The code assumes only "direct" needs special handling:
update port selection and startup to treat "none" like "direct". In runtime.py
change the frontend_port logic to use 30000 when effective_frontend_type is
"direct" or "none" (i.e., frontend_port = 30000 if effective_frontend_type in
("direct", "none") else 8000). In FrontendStageMixin.start_frontend() update the
skip condition to return/skip startup when effective_frontend_type is "direct"
or "none" so it does not attempt to instantiate a non-existent frontend
implementation; ensure references to config.frontend.type or
effective_frontend_type are used consistently to detect "none".

Comment thread src/srtctl/core/runtime.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 64-68: This file sets a long health_check timeout (max_attempts:
540, interval_seconds: 10) but lacks an explicit Slurm time limit; add a
slurm.time_limit setting (e.g., time_limit: "02:00:00") to the YAML so the job
walltime exceeds the 90-minute health check window—place the time_limit under
the same slurm configuration block used in other variants (match the 1K_8K
pattern) to ensure the job isn't killed before health checks complete.

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_8K.yaml`:
- Around line 69-71: The comment for health_check (max_attempts: 540,
interval_seconds: 10) incorrectly says "matches 1:30:00 time limit" while
slurm.time_limit is "02:00:00"; update the comment to accurately reflect that
540 attempts * 10s = 5400s (1:30:00) is a subset of the job time (2:00:00) or
change wording to state it equals 1:30:00 and is less than the slurm.time_limit;
adjust the inline comment next to health_check, max_attempts, or
interval_seconds accordingly to reference the correct durations and relationship
to slurm.time_limit.

In `@src/srtctl/core/runtime.py`:
- Around line 227-236: The cargo_env file is being created under configs_dir
(variable cargo_env) which is inside the source tree; change it to live in the
ephemeral log_dir instead: build the path as log_dir / "cargo_env" (instead of
configs_dir / "cargo_env"), ensure cargo_env.parent.mkdir(parents=True,
exist_ok=True) and cargo_env.touch(exist_ok=True) still run, and update the
container_mounts entry to mount cargo_env.resolve() to Path("/root/.cargo/env");
keep the try/except best-effort behavior and do not create or modify any
gitignore changes.
🧹 Nitpick comments (3)
src/srtctl/core/runtime.py (1)

162-173: Port collision handling is reasonable, but collision probability should be understood.

With 200 possible base values (job_seed % 200), there's a ~5% chance of collision at ~14 concurrent jobs sharing the same nodes (birthday paradox). This is acceptable given:

  1. Ports are per-node, so only jobs sharing nodes can collide
  2. Each job gets a 100-port range for internal allocation headroom
  3. The constraint of staying under 32767 for Dynamo's i16 requirement limits options

Consider documenting this trade-off in a comment if cluster density is expected to be high.

src/srtctl/core/health.py (1)

339-480: Consider extracting common polling infrastructure, but current structure is acceptable.

The wait_for_model function has some structural duplication between the direct/none branch and sglang/dynamo branches (timeout handling, progress reporting, exception handling). However, the core logic differs enough that the current explicit structure is readable and maintainable.

src/srtctl/cli/do_sweep.py (1)

82-91: Consider catching a more specific exception or using introspection.

The TypeError catch is a pragmatic backward-compat approach, but it may inadvertently swallow unrelated TypeErrors from bugs inside endpoints_to_processes. Consider using inspect.signature to check if base_sys_port is accepted, or at minimum, log when the fallback is triggered so issues are visible.

♻️ Optional: Use signature introspection instead of try/except
+import inspect
+
 `@functools.cached_property`
 def backend_processes(self) -> list[Process]:
     """Compute physical process topology from endpoints (cached)."""
-    # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
-    # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
-    try:
-        return self.backend.endpoints_to_processes(
-            self.endpoints,
-            base_sys_port=self.runtime.sys_port_base,
-        )
-    except TypeError:
-        # Backends that don't accept base_sys_port keep their default behavior.
-        return self.backend.endpoints_to_processes(self.endpoints)
+    # NOTE: On shared clusters, fixed DYN_SYSTEM_PORT ranges can collide across jobs
+    # and crash dynamo.sglang with "Address already in use". Use a job-specific base.
+    sig = inspect.signature(self.backend.endpoints_to_processes)
+    if "base_sys_port" in sig.parameters:
+        return self.backend.endpoints_to_processes(
+            self.endpoints,
+            base_sys_port=self.runtime.sys_port_base,
+        )
+    return self.backend.endpoints_to_processes(self.endpoints)

Comment thread src/srtctl/core/runtime.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_1K_1K.yaml`:
- Around line 66-68: The health_check block contains an invalid line ("\g<1>540
...") causing YAML parse failures; update the health_check mapping (the block
containing interval_seconds and the malformed line) by removing the invalid
token and adding a proper key/value pair max_attempts: 540 so the block reads a
valid mapping with max_attempts and interval_seconds.

In `@recipies/h100/aggdsr1sglangfp8/dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml`:
- Line 70: Trailing blank lines exist at the end of the YAML document; open the
YAML file and remove extra blank lines so the file ends with a single newline
only (no multiple empty lines at EOF), ensuring the final character is a single
newline to satisfy the linter for dsr1-fp8-agg-workeronly-tp8-pp2_8K_1K.yaml.
- Around line 64-68: The health_check block contains a corrupted token
"\g<1>540" which breaks YAML; instead remove that token and add a proper
key/value such as timeout_seconds: 540 under the health_check mapping (preserve
indentation and the existing interval_seconds: 10), so the block reads with
valid keys health_check -> timeout_seconds and interval_seconds.

@yeswanthk-26 yeswanthk-26 force-pushed the yeswanthk/dsr1-workeronly-agg branch from 31b7537 to 74947de Compare February 6, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants