promethus + telemetry by ishandhanani · Pull Request #72 · ishandhanani/srt-slurm

ishandhanani · 2025-12-31T07:02:27Z

Summary by CodeRabbit

New Features
- Added Prometheus metrics collection framework with configuration validation and scraping support.
- Introduced debug recipe configuration for Qwen3-32B model.
- Added 60-second stabilization delay before benchmark execution for improved result accuracy.
- Implemented dynamic model parameter support in benchmarking workflow.
Chores
- Enabled ARM64 platform-specific SDK optimization for improved build performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-31T07:02:31Z

📝 Walkthrough

Walkthrough

This pull request introduces Prometheus metrics collection support into the SRT orchestration framework. It adds metrics configuration to the SGLang backend, creates a new MetricsStageMixin for managing Prometheus setup, updates the frontend_stage return signature to provide topology information, integrates metrics startup into the sweep orchestrator, adds dynamic model naming to benchmarks, and includes comprehensive test coverage for the new metrics functionality.

Changes

Cohort / File(s)	Summary
Infrastructure & Configuration `Makefile`, `pyproject.toml`	Introduces Prometheus 3.8.1 download/installation in Makefile setup target; adds TY rule to suppress invalid return-type errors for protocol stubs.
Metrics Backend Support `src/srtctl/backends/base.py`, `src/srtctl/backends/sglang.py`	Adds `metrics_enabled: bool` parameter to `build_worker_command` protocol method; introduces `MetricsConfig` dataclass and extends `SGLangProtocol` with metrics field; integrates `--enable-metrics` CLI flag.
CLI Orchestration & Stage Mixins `src/srtctl/cli/mixins/__init__.py`, `src/srtctl/cli/mixins/frontend_stage.py`, `src/srtctl/cli/mixins/worker_stage.py`, `src/srtctl/cli/do_sweep.py`	Exports new `MetricsStageMixin`; updates `start_frontend` return type to `tuple[list[ManagedProcess], FrontendTopology]` to provide topology; passes metrics configuration to backend command construction; integrates metrics startup into `SweepOrchestrator`.
Metrics Collection Management `src/srtctl/cli/mixins/metrics_stage.py`	New mixin providing node selection, Prometheus YAML config generation from backend/frontend scrape targets, and orchestrated Prometheus startup via non-head node.
Core Configuration & Validation `src/srtctl/core/schema.py`	Adds `_validate_metrics` method enforcing dynamo frontend requirement when SGLang metrics enabled; injects ARM64 RUSTFLAGS optimization for source builds.
Benchmark Enhancements `src/srtctl/benchmarks/sa_bench.py`, `src/srtctl/benchmarks/scripts/sa-bench/bench.sh`	Introduces dynamic `MODEL_NAME` parameter replacing hardcoded model references; adds 60-second stabilization delay; updates endpoint verification payload.
Recipe Configuration `recipes/qwen3-32b/debug.yaml`	New debug configuration with qwen3-32b model, GPU/node resource settings, dynamo frontend, sglang backend with metrics, and extended context via rope scaling.
Test Coverage `tests/test_configs.py`, `tests/test_frontend_topology.py`, `tests/test_e2e.py`, `tests/test_metrics_stage.py`	Adds metrics validation tests; updates topology unpacking in frontend tests; excludes debug recipe from E2E tests; introduces comprehensive 347-line metrics stage test suite validating node selection, Prometheus config generation, and metrics collection startup.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant SweepOrch as SweepOrchestrator
    participant FrontendMixin as FrontendStageMixin
    participant WorkerMixin as WorkerStageMixin
    participant MetricsMixin as MetricsStageMixin
    participant Prometheus as Prometheus
    participant Backend as Backend Worker
    
    User->>SweepOrch: start_sweep()
    
    SweepOrch->>FrontendMixin: start_frontend(registry)
    FrontendMixin-->>SweepOrch: (processes, topology)
    
    SweepOrch->>WorkerMixin: start_workers()
    WorkerMixin->>Backend: build_worker_command(metrics_enabled=true)
    Backend-->>WorkerMixin: command + --enable-metrics
    WorkerMixin-->>SweepOrch: worker_processes
    
    SweepOrch->>MetricsMixin: start_metrics_collection(topology)
    MetricsMixin->>MetricsMixin: select_metrics_node()
    MetricsMixin->>MetricsMixin: generate_prometheus_config(topology)
    MetricsMixin->>Prometheus: start_srun_process(prometheus)
    Prometheus-->>MetricsMixin: prometheus_process
    MetricsMixin-->>SweepOrch: metrics_processes
    
    SweepOrch->>SweepOrch: run_benchmark()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

add nsys profiling and swap to bench_serving instead of bench_one_batch #41: Related changes to SGLang backend command construction logic and metrics-related flags in src/srtctl/backends/sglang.py.
complete refactor #65: Overlapping modifications to orchestration surfaces including backend protocol, CLI mixins, schema validation, and test infrastructure touched by metrics integration.
sgl-router and docs #61: Concurrent changes to SGLang backend structure and worker command building that interact with the metrics parameter additions.

Poem

🐰 A metrics tale unfolds,
Prometheus now holds,
The secrets of performance bright,
With dynamo's dynamite,
Workers shine their telemetry light! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'promethus + telemetry' is vague and contains a typo ('promethus' instead of 'Prometheus'), making it unclear without context.	Correct the spelling to 'Prometheus' and make the title more specific. Consider: 'Add Prometheus metrics collection and telemetry infrastructure' to better communicate the scope and intent of the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 97.62% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ishandhanani · 2026-01-13T11:11:06Z

@coderabbitai review

coderabbitai · 2026-01-13T11:11:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @recipies/qwen3-32b/debug.yaml:
- Around line 49-55: The comment above the benchmark block is misleading — it
says "Mooncake benchmark using aiperf" but the benchmark is configured as type:
"sa-bench"; update the comment to accurately describe the configured benchmark
(refer to the benchmark block and the type: "sa-bench" field) so the description
matches the actual configuration.

In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh:
- Around line 62-65: Fix the typo in the echo message ("viisble" → "visible")
and make the fixed 60-second sleep configurable by introducing a variable (e.g.,
SLEEP_BEFORE_BENCH or a positional arg) with a sensible default of 60; update
the echo to print the chosen duration and use that variable for the sleep call
so CI or fast runs can set the env var or pass an argument to skip or shorten
the delay.

🧹 Nitpick comments (2)

pyproject.toml (1)

93-94: Consider scoping the suppression more narrowly.

Globally ignoring invalid-return-type may mask legitimate type errors elsewhere in the codebase. ty supports per-file rule overrides via [[tool.ty.overrides]] in pyproject.toml using glob patterns. Consider limiting this rule to the specific mixin files where protocol stubs with ... bodies are actually used (e.g., src/srtctl/cli/mixins/**/*.py).

src/srtctl/cli/mixins/metrics_stage.py (1)

73-74: Address or remove the TODO comment.

The TODO comment # TODO: ishan appears to be a development artifact. If there's pending work, consider creating an issue to track it; otherwise, remove the comment.

Do you want me to open an issue to track this TODO or should it be removed?

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e62a97 and cdd46c2.

📒 Files selected for processing (17)

Makefile
pyproject.toml
recipies/qwen3-32b/debug.yaml
src/srtctl/backends/base.py
src/srtctl/backends/sglang.py
src/srtctl/benchmarks/sa_bench.py
src/srtctl/benchmarks/scripts/sa-bench/bench.sh
src/srtctl/cli/do_sweep.py
src/srtctl/cli/mixins/__init__.py
src/srtctl/cli/mixins/frontend_stage.py
src/srtctl/cli/mixins/metrics_stage.py
src/srtctl/cli/mixins/worker_stage.py
src/srtctl/core/schema.py
tests/test_configs.py
tests/test_e2e.py
tests/test_frontend_topology.py
tests/test_metrics_stage.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Python 3.10+ with modern syntax including | unions and match statements
Use Ruff for linting and formatting with configuration in pyproject.toml, targeting 120 character line length
Use type hints everywhere and use ty for type checking
Use frozen dataclasses for all configuration objects with @dataclass(frozen=True) to ensure immutability
Prefer typing.Protocol over Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling
Use marshmallow_dataclass to combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization
Use factory classmethods named from_* for object construction (e.g., RuntimeContext.from_config(), RunMetadata.from_json()) to keep __init__ simple
Import type-only dependencies under if TYPE_CHECKING: guard to avoid circular imports, using string annotations for forward references
Use @property decorators for derived/computed values instead of storing computed state
Use decorator-based registry pattern for extensible registration (e.g., @register_benchmark("sa-bench")) to allow new implementations without modifying existing code
Use TypedDict for typing dictionaries from JSON/external sources where the structure cannot be controlled

Files:

tests/test_e2e.py
src/srtctl/cli/mixins/worker_stage.py
tests/test_metrics_stage.py
src/srtctl/cli/mixins/frontend_stage.py
src/srtctl/core/schema.py
src/srtctl/benchmarks/sa_bench.py
tests/test_configs.py
src/srtctl/backends/base.py
src/srtctl/cli/mixins/metrics_stage.py
src/srtctl/cli/mixins/__init__.py
tests/test_frontend_topology.py
src/srtctl/cli/do_sweep.py
src/srtctl/backends/sglang.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use check_dynamo_health(response_json, expected_prefill, expected_decode) for Dynamo backend health checks
Use check_sglang_router_health(response_json, expected_prefill, expected_decode) for SGLang router health checks, passing expected_prefill=0, expected_decode=num_agg for aggregated mode

Files:

tests/test_e2e.py
tests/test_metrics_stage.py
tests/test_configs.py
tests/test_frontend_topology.py

🧠 Learnings (6)

📚 Learning: 2026-01-08T09:37:29.361Z

Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use type hints everywhere and use `ty` for type checking

Applied to files:

pyproject.toml

📚 Learning: 2026-01-08T09:37:29.361Z

Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use Python 3.10+ with modern syntax including `|` unions and `match` statements

Applied to files:

pyproject.toml

📚 Learning: 2026-01-08T09:37:29.361Z

Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Prefer `typing.Protocol` over Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling

Applied to files:

pyproject.toml

📚 Learning: 2026-01-13T09:40:46.949Z

Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sg lang bench configuration, do not allow req_rate or concurrencies to be 0. Validate that both values are positive integers and raise a clear error if an attempt is made to set them to 0 (e.g., raise ValueError or validation error). Consider adding unit tests for negative and zero inputs and ensure the benchmark runs fail fast with informative messages.

Applied to files:

src/srtctl/benchmarks/sa_bench.py

📚 Learning: 2026-01-08T09:37:29.361Z

Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use `marshmallow_dataclass` to combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization

Applied to files:

tests/test_configs.py

📚 Learning: 2026-01-13T09:40:46.949Z

Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sglang-bench benchmark configuration, users will not set req_rate or concurrencies to 0, as these values would not make semantic sense in the benchmarking context (0 concurrency or 0 requests per second are not valid use cases).

Applied to files:

src/srtctl/benchmarks/scripts/sa-bench/bench.sh

🧬 Code graph analysis (9)

src/srtctl/cli/mixins/worker_stage.py (3)

src/srtctl/cli/do_sweep.py (1)

backend (57-59)

src/srtctl/cli/mixins/frontend_stage.py (1)

backend (63-65)

src/srtctl/core/schema.py (1)

enabled (398-400)

src/srtctl/cli/mixins/frontend_stage.py (1)

src/srtctl/core/processes.py (1)

ManagedProcess (26-64)

src/srtctl/core/schema.py (4)

src/srtctl/cli/mixins/worker_stage.py (1)

backend (41-43)

src/srtctl/cli/do_sweep.py (1)

backend (57-59)

src/srtctl/cli/mixins/frontend_stage.py (1)

backend (63-65)

src/srtctl/backends/sglang.py (1)

SGLangProtocol (60-306)

src/srtctl/benchmarks/sa_bench.py (1)

src/srtctl/core/schema.py (1)

served_model_name (722-735)

tests/test_configs.py (2)

src/srtctl/backends/sglang.py (3)

SGLangProtocol (60-306)

SGLangServerConfig (34-45)

MetricsConfig (49-56)

src/srtctl/core/schema.py (4)

FrontendConfig (559-576)

ResourceConfig (242-312)

SrtConfig (606-740)

enabled (398-400)

src/srtctl/cli/mixins/metrics_stage.py (7)

src/srtctl/core/processes.py (1)

ManagedProcess (26-64)

src/srtctl/core/slurm.py (1)

start_srun_process (138-257)

src/srtctl/cli/mixins/frontend_stage.py (3)

FrontendTopology (29-45)

backend_processes (68-70)

backend (63-65)

src/srtctl/core/runtime.py (1)

RuntimeContext (64-234)

src/srtctl/core/schema.py (2)

SrtConfig (606-740)

enabled (398-400)

src/srtctl/core/topology.py (2)

Endpoint (87-126)

Process (130-166)

src/srtctl/cli/mixins/worker_stage.py (3)

backend_processes (46-48)

endpoints (51-53)

backend (41-43)

src/srtctl/cli/mixins/__init__.py (1)

src/srtctl/cli/mixins/metrics_stage.py (1)

MetricsStageMixin (28-220)

tests/test_frontend_topology.py (1)

src/srtctl/cli/mixins/frontend_stage.py (1)

start_frontend (176-202)

src/srtctl/cli/do_sweep.py (1)

src/srtctl/cli/mixins/metrics_stage.py (2)

MetricsStageMixin (28-220)

start_metrics_collection (154-220)

🔇 Additional comments (36)

src/srtctl/benchmarks/sa_bench.py (2)

65-66: LGTM! Minor note on fallback behavior.

The logic correctly prioritizes served_model_name. Based on the served_model_name property in schema.py, it always returns a value (falls back to Path(self.model.path).name), so the or config.model.path fallback should never trigger. This is acceptable as defensive coding.

73-82: LGTM!

The command arguments correctly align with the updated bench.sh signature: endpoint model isl osl concurrencies [req_rate].

Makefile (2)

80-87: LGTM!

The Prometheus download and installation follows the established pattern for NATS and ETCD, with proper retry logic, extraction, permissions, and cleanup.

5-5: Prometheus 3.8.1 is a valid release.

Version 3.8.1 was released on December 16, 2025, and the download URL format in the Makefile is correct. No changes needed.

tests/test_e2e.py (1)

312-312: LGTM!

The filter correctly excludes debug.yaml from parameterized test recipes while maintaining the existence check fallback.

src/srtctl/benchmarks/scripts/sa-bench/bench.sh (3)

6-6: LGTM!

The comment correctly documents the updated argument order.

10-15: LGTM!

Argument parsing correctly reflects the new positional order with MODEL_NAME as the second parameter.

31-39: LGTM!

The endpoint verification now correctly uses the dynamic ${MODEL_NAME} variable.

src/srtctl/core/schema.py (2)

545-546: LGTM: Platform-specific optimization for arm64 builds.

The conditional RUSTFLAGS injection for arm64 is correctly structured and will improve performance of the maturin build on ARM platforms.

647-659: LGTM: Metrics validation logic is correct.

The validation properly enforces the Dynamo frontend requirement for metrics collection. The early-return pattern keeps the code clean and readable.

recipies/qwen3-32b/debug.yaml (1)

8-13: Verify tensor-parallel configuration.

With agg_nodes: 2, gpus_per_node: 4, and agg_workers: 8, each worker gets (2 * 4) / 8 = 1 GPU, which correctly matches tensor-parallel-size: 1 in the sglang_config.

src/srtctl/cli/mixins/__init__.py (1)

11-24: LGTM: MetricsStageMixin properly exported.

The new mixin is correctly added to the module exports, following the existing pattern for stage mixins.

src/srtctl/backends/base.py (1)

70-82: LGTM: Protocol signature updated with backward-compatible default.

The metrics_enabled parameter is correctly added with a default of False, ensuring backward compatibility for any existing backend implementations.

tests/test_frontend_topology.py (2)

238-242: LGTM: Test correctly unpacks the new tuple return type.

The test properly handles the updated start_frontend signature that now returns (processes, topology).

256-261: LGTM: Consistent handling of the new return type.

All test methods are consistently updated to unpack the tuple return value.

tests/test_configs.py (2)

11-15: LGTM: Imports correctly added for metrics validation tests.

The imports are properly organized and include all necessary types for the new test suite.

391-414: LGTM: Comprehensive test coverage for metrics validation.

The tests properly verify both the error case (sglang frontend with metrics enabled) and the success case (dynamo frontend with metrics enabled). The assertion in test_metrics_with_dynamo_frontend_ok confirms the config was created successfully with metrics enabled.

src/srtctl/cli/mixins/frontend_stage.py (1)

176-202: LGTM: Return type change is clean and well-documented.

The method now returns the computed topology alongside the processes, enabling downstream code (like MetricsStageMixin.start_metrics_collection) to use topology information for Prometheus configuration. The change is minimal and the docstring accurately reflects the new return type.

src/srtctl/cli/mixins/worker_stage.py (1)

109-109: LGTM!

The metrics_enabled flag is correctly propagated from the backend configuration to the worker command builder, enabling Prometheus metrics collection when configured.

src/srtctl/cli/do_sweep.py (3)

24-24: LGTM!

Import addition for MetricsStageMixin is correct.

42-43: LGTM!

The MetricsStageMixin is properly integrated into the class hierarchy, enabling the orchestrator to manage Prometheus metrics collection.

203-209: The metrics collection integration follows correct patterns and type signatures.

The start_frontend method properly returns tuple[list[ManagedProcess], FrontendTopology], enabling the unpacking at line 203. The subsequent calls are correct:

frontend_procs (list) is iterated with add_process() for individual registration

metrics_procs is returned as NamedProcesses (a dict[str, ManagedProcess]) and passed to add_processes(), which expects this exact type

Execution order is logical: frontend infrastructure is initialized before metrics collection, which needs the topology for Prometheus scrape target configuration.

src/srtctl/backends/sglang.py (4)

48-57: LGTM!

The MetricsConfig dataclass follows best practices:

Frozen for immutability

Sensible defaults (disabled by default, standard Prometheus port 9090)

ClassVar Schema for marshmallow integration

95-97: LGTM!

Proper use of field(default_factory=MetricsConfig) for the nested frozen dataclass.

203-216: LGTM!

The metrics_enabled parameter is properly added with a default value of False for backward compatibility, and the docstring is appropriately updated.

302-305: The --enable-metrics flag is correctly implemented and valid for both backends.

The flag is properly supported by both sglang.launch_server and dynamo.sglang. SGLang exposes Prometheus metrics at /metrics on the server port when enabled, and Dynamo similarly supports the flag with metrics prefixed as sglang:... and dynamo_*. The conditional addition is correct.

tests/test_metrics_stage.py (6)

19-45: LGTM!

The make_config helper provides a clean factory for creating test configurations with sensible defaults and supports customizing metrics-related parameters.

48-62: LGTM!

The make_runtime helper creates a minimal RuntimeContext suitable for unit testing the metrics stage logic.

65-72: LGTM!

Simple and effective helper for creating test FrontendTopology instances.

75-117: LGTM!

Comprehensive test coverage for metrics node selection logic:

Single node fallback to head

Two nodes prefers non-head

Many nodes selects last worker

Explicit head node exclusion verification

120-239: LGTM!

Thorough test coverage for Prometheus config generation:

YAML validity and structure

Scrape interval configuration

Worker and frontend target generation

Port format verification (host:port)

Label presence validation

242-347: LGTM!

Well-structured integration tests for metrics collection startup:

Verifies disabled mode returns empty dict

Verifies enabled mode starts Prometheus on non-head node with correct criticality

Validates port configuration propagation

Confirms Prometheus config file creation with expected content

src/srtctl/cli/mixins/metrics_stage.py (4)

1-26: LGTM!

Clean module structure with appropriate imports and type-checking guards. The docstring clearly explains the mixin's purpose.

28-54: LGTM!

The mixin clearly documents its dependencies and uses type hints with ellipsis (...) for abstract property stubs.

56-82: LGTM!

The node selection logic is well-designed:

Excludes head node to avoid resource contention with NATS/etcd

Prefers later worker nodes to minimize conflict with frontends

Falls back gracefully to head node with a warning when no alternatives exist

84-152: LGTM!

Well-structured Prometheus config generation:

Separate scrape jobs for workers and frontends enable independent monitoring

Rich labels (worker_mode, worker_index, node, node_rank) support powerful PromQL queries

Frontend job is conditionally added only when frontends exist

Clean YAML output with default_flow_style=False

coderabbitai · 2026-01-13T11:17:12Z

+# Mooncake benchmark using aiperf
+benchmark:
+  type: "sa-bench"
+  isl: 100
+  osl: 100
+  concurrencies: "1x2x4"
+  req_rate: "inf"


⚠️ Potential issue | 🟡 Minor

Misleading comment: benchmark type does not match description.

The comment says "Mooncake benchmark using aiperf" but the actual benchmark type is sa-bench. Update the comment to accurately describe the benchmark:

-# Mooncake benchmark using aiperf +# Performance benchmark using sa-bench benchmark: type: "sa-bench"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Mooncake benchmark using aiperf

benchmark:

type: "sa-bench"

isl: 100

osl: 100

concurrencies: "1x2x4"

req_rate: "inf"

# Performance benchmark using sa-bench

benchmark:

type: "sa-bench"

isl: 100

osl: 100

concurrencies: "1x2x4"

req_rate: "inf"

🤖 Prompt for AI Agents

In @recipies/qwen3-32b/debug.yaml around lines 49 - 55, The comment above the benchmark block is misleading — it says "Mooncake benchmark using aiperf" but the benchmark is configured as type: "sa-bench"; update the comment to accurately describe the configured benchmark (refer to the benchmark block and the type: "sa-bench" field) so the description matches the actual configuration.

coderabbitai · 2026-01-13T11:17:13Z

+# Sleep before starting main benchmark
+echo "Sleeping for 60 seconds before starting main benchmark..."
+echo "This will help show a viisble difference in any graphs..."
+sleep 60


⚠️ Potential issue | 🟡 Minor

Fix typo and consider making the sleep duration configurable.

Line 64 has a typo: "viisble" → "visible".

The 60-second sleep adds significant overhead to each benchmark run. Consider making this configurable via an environment variable or argument for scenarios where the delay isn't needed (e.g., CI pipelines).

Proposed fix

# Sleep before starting main benchmark -echo "Sleeping for 60 seconds before starting main benchmark..." -echo "This will help show a viisble difference in any graphs..." -sleep 60 +SLEEP_DURATION=${7:-60} +if [ "$SLEEP_DURATION" -gt 0 ]; then + echo "Sleeping for ${SLEEP_DURATION} seconds before starting main benchmark..." + echo "This will help show a visible difference in any graphs..." + sleep "$SLEEP_DURATION" +fi

🤖 Prompt for AI Agents

In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh around lines 62 - 65, Fix the typo in the echo message ("viisble" → "visible") and make the fixed 60-second sleep configurable by introducing a variable (e.g., SLEEP_BEFORE_BENCH or a positional arg) with a sensible default of 60; update the echo to print the chosen duration and use that variable for the sleep call so CI or fast runs can set the env var or pass an argument to skip or shorten the delay.

coderabbitai · 2026-01-13T11:17:13Z

+    def start_metrics_collection(self, frontend_topology: "FrontendTopology") -> NamedProcesses:
+        """Start Prometheus metrics collection.
+
+        Launches Prometheus on a non-head node to scrape metrics from all
+        dynamo worker and frontend endpoints.
+
+        Args:
+            frontend_topology: Frontend topology for scraping frontend metrics.
+
+        Returns:
+            Dictionary of process names to ManagedProcess objects.
+        """
+        processes: NamedProcesses = {}
+
+        # Check if metrics collection is enabled
+        if not self.config.backend.metrics.enabled:
+            logger.info("Metrics collection not enabled")
+            return processes
+
+        logger.info("Starting Prometheus metrics collection")
+
+        # Select node for Prometheus
+        prometheus_node = self.select_metrics_node()
+        logger.info("Prometheus node: %s", prometheus_node)
+
+        # Generate Prometheus config
+        prometheus_config_content = self.generate_prometheus_config(frontend_topology)
+        prometheus_config_path = self.runtime.log_dir / "prometheus.yml"
+        prometheus_config_path.write_text(prometheus_config_content)
+        logger.info("Generated Prometheus config: %s", prometheus_config_path)
+
+        # Log file for Prometheus
+        prometheus_log = self.runtime.log_dir / f"prometheus_{prometheus_node}.out"
+        prometheus_port = self.config.backend.metrics.prometheus_port
+
+        # Run Prometheus binary directly (like nats/etcd pattern)
+        # Binary installed via `make setup`, mounted at /configs
+        cmd = [
+            "/configs/prometheus",
+            "--config.file=/logs/prometheus.yml",
+            f"--web.listen-address=:{prometheus_port}",
+            "--storage.tsdb.path=/logs/prometheus_data",
+            "--storage.tsdb.retention.time=1h",
+        ]
+
+        logger.info("Prometheus port: %d", prometheus_port)
+        logger.info("Prometheus log: %s", prometheus_log)
+
+        proc = start_srun_process(
+            command=cmd,
+            nodelist=[prometheus_node],
+            output=str(prometheus_log),
+            container_image=str(self.runtime.container_image),
+            container_mounts=self.runtime.container_mounts,
+            use_bash_wrapper=False,
+        )
+
+        processes["prometheus"] = ManagedProcess(
+            name="prometheus",
+            popen=proc,
+            log_file=prometheus_log,
+            node=prometheus_node,
+            critical=False,  # Metrics collection is non-critical
+        )
+
+        logger.info("Prometheus started on %s:%d", prometheus_node, prometheus_port)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify configs directory mount setup in RuntimeContext echo "=== Searching for 'configs' in runtime.py ===" rg -n "configs" src/srtctl/core/runtime.py -C 3 echo "" echo "=== Searching for 'container_mounts' setup ===" rg -n "container_mounts" src/srtctl/core/runtime.py -C 2 echo "" echo "=== Checking for any Prometheus-related mounts ===" rg -n "prometheus|/configs" src/srtctl/cli/mixins/metrics_stage.py -C 2

Repository: ishandhanani/srt-slurm

Length of output: 5596

🏁 Script executed:

#!/bin/bash # Check if there's any validation preventing metrics.enabled without /configs mount echo "=== Searching for metrics.enabled validation ===" rg -n "metrics\.enabled" src/srtctl -C 3 --type py echo "" echo "=== Check if Prometheus binary path is configurable ===" rg -n "prometheus.*path|prometheus.*binary" src/srtctl -C 2 --type py

Repository: ishandhanani/srt-slurm

Length of output: 2395

🏁 Script executed:

#!/bin/bash # Check what happens during setup - verify make setup handles Prometheus binary echo "=== Check for Prometheus setup in Makefile ===" rg -n "prometheus" Makefile -C 2 echo "" echo "=== Check if configs directory is documented as required ===" rg -n "configs|SRTCTL_SOURCE_DIR" README.md -C 2 | head -30

Repository: ishandhanani/srt-slurm

Length of output: 870

🏁 Script executed:

#!/bin/bash # Check if there's any orchestration/deployment that ensures configs is available when metrics are used echo "=== Check sbatch script handling of SRTCTL_SOURCE_DIR ===" rg -n "SRTCTL_SOURCE_DIR" . -C 2 --type sh echo "" echo "=== Check how metrics configuration is typically used ===" rg -n "metrics.*enabled.*true|metrics.*enabled\s*=" src/srtctl -C 1 --type py | head -30

Repository: ishandhanani/srt-slurm

Length of output: 425

🏁 Script executed:

#!/bin/bash # Final check: verify if there's any error handling for missing Prometheus binary echo "=== Check start_srun_process for error handling ===" rg -n "def start_srun_process" src/srtctl -A 15 --type py echo "" echo "=== Check if there are tests that verify metrics setup ===" fd -n "test.*metrics" . --type py echo "" echo "=== Verify the actual validation in schema.py ===" sed -n '650,670p' src/srtctl/core/schema.py

Repository: ishandhanani/srt-slurm

Length of output: 2393

Address missing validation for Prometheus binary availability.

While the implementation is solid, there's a gap between configuration validation and runtime assumptions. The /configs mount is only conditionally added in RuntimeContext.from_config() (lines 149-155 of runtime.py):

if source_dir: configs_dir = Path(source_dir) / "configs" if configs_dir.exists(): container_mounts[configs_dir.resolve()] = Path("/configs")

However, start_metrics_collection() hardcodes the Prometheus binary path (/configs/prometheus) without validating the mount exists. The schema validation only checks that metrics require a Dynamo frontend, not that the Prometheus binary is available.

This will fail at runtime if SRTCTL_SOURCE_DIR is unset or the configs directory doesn't exist—even though metrics are enabled in the config. Add a validation check in the config schema to ensure metrics cannot be enabled without a valid SRTCTL_SOURCE_DIR, or add a defensive check in start_metrics_collection() before launching Prometheus.

ishandhanani · 2026-01-28T08:54:42Z

Use aiperf instead

go

3557af7

ishandhanani added 5 commits December 31, 2025 01:14

go

76f03d3

build dynamo

4ac4b27

dynamoic model name

254d9ed

metrics

a245a99

go

8ed1857

ishandhanani changed the title go promethus + telemetry Dec 31, 2025

ishandhanani added 4 commits January 3, 2026 13:10

go

b3c51f1

go

82075fb

metrics only with dynamo for now

42487c0

go

cdd46c2

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

ishandhanani closed this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promethus + telemetry#72

promethus + telemetry#72
ishandhanani wants to merge 10 commits intoishan/woahfrom
ishan/e2e-telem

ishandhanani commented Dec 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 31, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

ishandhanani commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 13, 2026

Uh oh!

coderabbitai bot Jan 13, 2026

Uh oh!

coderabbitai bot Jan 13, 2026

Uh oh!

ishandhanani commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishandhanani commented Dec 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

ishandhanani commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ishandhanani commented Dec 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 31, 2025 •

edited

Loading