Skip to content

promethus + telemetry#72

Closed
ishandhanani wants to merge 10 commits intoishan/woahfrom
ishan/e2e-telem
Closed

promethus + telemetry#72
ishandhanani wants to merge 10 commits intoishan/woahfrom
ishan/e2e-telem

Conversation

@ishandhanani
Copy link
Copy Markdown
Owner

@ishandhanani ishandhanani commented Dec 31, 2025

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics collection framework with configuration validation and scraping support.
    • Introduced debug recipe configuration for Qwen3-32B model.
    • Added 60-second stabilization delay before benchmark execution for improved result accuracy.
    • Implemented dynamic model parameter support in benchmarking workflow.
  • Chores

    • Enabled ARM64 platform-specific SDK optimization for improved build performance.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 31, 2025

📝 Walkthrough

Walkthrough

This pull request introduces Prometheus metrics collection support into the SRT orchestration framework. It adds metrics configuration to the SGLang backend, creates a new MetricsStageMixin for managing Prometheus setup, updates the frontend_stage return signature to provide topology information, integrates metrics startup into the sweep orchestrator, adds dynamic model naming to benchmarks, and includes comprehensive test coverage for the new metrics functionality.

Changes

Cohort / File(s) Summary
Infrastructure & Configuration
Makefile, pyproject.toml
Introduces Prometheus 3.8.1 download/installation in Makefile setup target; adds TY rule to suppress invalid return-type errors for protocol stubs.
Metrics Backend Support
src/srtctl/backends/base.py, src/srtctl/backends/sglang.py
Adds metrics_enabled: bool parameter to build_worker_command protocol method; introduces MetricsConfig dataclass and extends SGLangProtocol with metrics field; integrates --enable-metrics CLI flag.
CLI Orchestration & Stage Mixins
src/srtctl/cli/mixins/__init__.py, src/srtctl/cli/mixins/frontend_stage.py, src/srtctl/cli/mixins/worker_stage.py, src/srtctl/cli/do_sweep.py
Exports new MetricsStageMixin; updates start_frontend return type to tuple[list[ManagedProcess], FrontendTopology] to provide topology; passes metrics configuration to backend command construction; integrates metrics startup into SweepOrchestrator.
Metrics Collection Management
src/srtctl/cli/mixins/metrics_stage.py
New mixin providing node selection, Prometheus YAML config generation from backend/frontend scrape targets, and orchestrated Prometheus startup via non-head node.
Core Configuration & Validation
src/srtctl/core/schema.py
Adds _validate_metrics method enforcing dynamo frontend requirement when SGLang metrics enabled; injects ARM64 RUSTFLAGS optimization for source builds.
Benchmark Enhancements
src/srtctl/benchmarks/sa_bench.py, src/srtctl/benchmarks/scripts/sa-bench/bench.sh
Introduces dynamic MODEL_NAME parameter replacing hardcoded model references; adds 60-second stabilization delay; updates endpoint verification payload.
Recipe Configuration
recipes/qwen3-32b/debug.yaml
New debug configuration with qwen3-32b model, GPU/node resource settings, dynamo frontend, sglang backend with metrics, and extended context via rope scaling.
Test Coverage
tests/test_configs.py, tests/test_frontend_topology.py, tests/test_e2e.py, tests/test_metrics_stage.py
Adds metrics validation tests; updates topology unpacking in frontend tests; excludes debug recipe from E2E tests; introduces comprehensive 347-line metrics stage test suite validating node selection, Prometheus config generation, and metrics collection startup.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant SweepOrch as SweepOrchestrator
    participant FrontendMixin as FrontendStageMixin
    participant WorkerMixin as WorkerStageMixin
    participant MetricsMixin as MetricsStageMixin
    participant Prometheus as Prometheus
    participant Backend as Backend Worker
    
    User->>SweepOrch: start_sweep()
    
    SweepOrch->>FrontendMixin: start_frontend(registry)
    FrontendMixin-->>SweepOrch: (processes, topology)
    
    SweepOrch->>WorkerMixin: start_workers()
    WorkerMixin->>Backend: build_worker_command(metrics_enabled=true)
    Backend-->>WorkerMixin: command + --enable-metrics
    WorkerMixin-->>SweepOrch: worker_processes
    
    SweepOrch->>MetricsMixin: start_metrics_collection(topology)
    MetricsMixin->>MetricsMixin: select_metrics_node()
    MetricsMixin->>MetricsMixin: generate_prometheus_config(topology)
    MetricsMixin->>Prometheus: start_srun_process(prometheus)
    Prometheus-->>MetricsMixin: prometheus_process
    MetricsMixin-->>SweepOrch: metrics_processes
    
    SweepOrch->>SweepOrch: run_benchmark()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 A metrics tale unfolds,
Prometheus now holds,
The secrets of performance bright,
With dynamo's dynamite,
Workers shine their telemetry light!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'promethus + telemetry' is vague and contains a typo ('promethus' instead of 'Prometheus'), making it unclear without context. Correct the spelling to 'Prometheus' and make the title more specific. Consider: 'Add Prometheus metrics collection and telemetry infrastructure' to better communicate the scope and intent of the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 97.62% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ishandhanani ishandhanani changed the title go promethus + telemetry Dec 31, 2025
@ishandhanani
Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @recipies/qwen3-32b/debug.yaml:
- Around line 49-55: The comment above the benchmark block is misleading — it
says "Mooncake benchmark using aiperf" but the benchmark is configured as type:
"sa-bench"; update the comment to accurately describe the configured benchmark
(refer to the benchmark block and the type: "sa-bench" field) so the description
matches the actual configuration.

In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh:
- Around line 62-65: Fix the typo in the echo message ("viisble" → "visible")
and make the fixed 60-second sleep configurable by introducing a variable (e.g.,
SLEEP_BEFORE_BENCH or a positional arg) with a sensible default of 60; update
the echo to print the chosen duration and use that variable for the sleep call
so CI or fast runs can set the env var or pass an argument to skip or shorten
the delay.
🧹 Nitpick comments (2)
pyproject.toml (1)

93-94: Consider scoping the suppression more narrowly.

Globally ignoring invalid-return-type may mask legitimate type errors elsewhere in the codebase. ty supports per-file rule overrides via [[tool.ty.overrides]] in pyproject.toml using glob patterns. Consider limiting this rule to the specific mixin files where protocol stubs with ... bodies are actually used (e.g., src/srtctl/cli/mixins/**/*.py).

src/srtctl/cli/mixins/metrics_stage.py (1)

73-74: Address or remove the TODO comment.

The TODO comment # TODO: ishan appears to be a development artifact. If there's pending work, consider creating an issue to track it; otherwise, remove the comment.

Do you want me to open an issue to track this TODO or should it be removed?

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e62a97 and cdd46c2.

📒 Files selected for processing (17)
  • Makefile
  • pyproject.toml
  • recipies/qwen3-32b/debug.yaml
  • src/srtctl/backends/base.py
  • src/srtctl/backends/sglang.py
  • src/srtctl/benchmarks/sa_bench.py
  • src/srtctl/benchmarks/scripts/sa-bench/bench.sh
  • src/srtctl/cli/do_sweep.py
  • src/srtctl/cli/mixins/__init__.py
  • src/srtctl/cli/mixins/frontend_stage.py
  • src/srtctl/cli/mixins/metrics_stage.py
  • src/srtctl/cli/mixins/worker_stage.py
  • src/srtctl/core/schema.py
  • tests/test_configs.py
  • tests/test_e2e.py
  • tests/test_frontend_topology.py
  • tests/test_metrics_stage.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Python 3.10+ with modern syntax including | unions and match statements
Use Ruff for linting and formatting with configuration in pyproject.toml, targeting 120 character line length
Use type hints everywhere and use ty for type checking
Use frozen dataclasses for all configuration objects with @dataclass(frozen=True) to ensure immutability
Prefer typing.Protocol over Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling
Use marshmallow_dataclass to combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization
Use factory classmethods named from_* for object construction (e.g., RuntimeContext.from_config(), RunMetadata.from_json()) to keep __init__ simple
Import type-only dependencies under if TYPE_CHECKING: guard to avoid circular imports, using string annotations for forward references
Use @property decorators for derived/computed values instead of storing computed state
Use decorator-based registry pattern for extensible registration (e.g., @register_benchmark("sa-bench")) to allow new implementations without modifying existing code
Use TypedDict for typing dictionaries from JSON/external sources where the structure cannot be controlled

Files:

  • tests/test_e2e.py
  • src/srtctl/cli/mixins/worker_stage.py
  • tests/test_metrics_stage.py
  • src/srtctl/cli/mixins/frontend_stage.py
  • src/srtctl/core/schema.py
  • src/srtctl/benchmarks/sa_bench.py
  • tests/test_configs.py
  • src/srtctl/backends/base.py
  • src/srtctl/cli/mixins/metrics_stage.py
  • src/srtctl/cli/mixins/__init__.py
  • tests/test_frontend_topology.py
  • src/srtctl/cli/do_sweep.py
  • src/srtctl/backends/sglang.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use check_dynamo_health(response_json, expected_prefill, expected_decode) for Dynamo backend health checks
Use check_sglang_router_health(response_json, expected_prefill, expected_decode) for SGLang router health checks, passing expected_prefill=0, expected_decode=num_agg for aggregated mode

Files:

  • tests/test_e2e.py
  • tests/test_metrics_stage.py
  • tests/test_configs.py
  • tests/test_frontend_topology.py
🧠 Learnings (6)
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use type hints everywhere and use `ty` for type checking

Applied to files:

  • pyproject.toml
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use Python 3.10+ with modern syntax including `|` unions and `match` statements

Applied to files:

  • pyproject.toml
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Prefer `typing.Protocol` over Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling

Applied to files:

  • pyproject.toml
📚 Learning: 2026-01-13T09:40:46.949Z
Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sg lang bench configuration, do not allow req_rate or concurrencies to be 0. Validate that both values are positive integers and raise a clear error if an attempt is made to set them to 0 (e.g., raise ValueError or validation error). Consider adding unit tests for negative and zero inputs and ensure the benchmark runs fail fast with informative messages.

Applied to files:

  • src/srtctl/benchmarks/sa_bench.py
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use `marshmallow_dataclass` to combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization

Applied to files:

  • tests/test_configs.py
📚 Learning: 2026-01-13T09:40:46.949Z
Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sglang-bench benchmark configuration, users will not set req_rate or concurrencies to 0, as these values would not make semantic sense in the benchmarking context (0 concurrency or 0 requests per second are not valid use cases).

Applied to files:

  • src/srtctl/benchmarks/scripts/sa-bench/bench.sh
🧬 Code graph analysis (9)
src/srtctl/cli/mixins/worker_stage.py (3)
src/srtctl/cli/do_sweep.py (1)
  • backend (57-59)
src/srtctl/cli/mixins/frontend_stage.py (1)
  • backend (63-65)
src/srtctl/core/schema.py (1)
  • enabled (398-400)
src/srtctl/cli/mixins/frontend_stage.py (1)
src/srtctl/core/processes.py (1)
  • ManagedProcess (26-64)
src/srtctl/core/schema.py (4)
src/srtctl/cli/mixins/worker_stage.py (1)
  • backend (41-43)
src/srtctl/cli/do_sweep.py (1)
  • backend (57-59)
src/srtctl/cli/mixins/frontend_stage.py (1)
  • backend (63-65)
src/srtctl/backends/sglang.py (1)
  • SGLangProtocol (60-306)
src/srtctl/benchmarks/sa_bench.py (1)
src/srtctl/core/schema.py (1)
  • served_model_name (722-735)
tests/test_configs.py (2)
src/srtctl/backends/sglang.py (3)
  • SGLangProtocol (60-306)
  • SGLangServerConfig (34-45)
  • MetricsConfig (49-56)
src/srtctl/core/schema.py (4)
  • FrontendConfig (559-576)
  • ResourceConfig (242-312)
  • SrtConfig (606-740)
  • enabled (398-400)
src/srtctl/cli/mixins/metrics_stage.py (7)
src/srtctl/core/processes.py (1)
  • ManagedProcess (26-64)
src/srtctl/core/slurm.py (1)
  • start_srun_process (138-257)
src/srtctl/cli/mixins/frontend_stage.py (3)
  • FrontendTopology (29-45)
  • backend_processes (68-70)
  • backend (63-65)
src/srtctl/core/runtime.py (1)
  • RuntimeContext (64-234)
src/srtctl/core/schema.py (2)
  • SrtConfig (606-740)
  • enabled (398-400)
src/srtctl/core/topology.py (2)
  • Endpoint (87-126)
  • Process (130-166)
src/srtctl/cli/mixins/worker_stage.py (3)
  • backend_processes (46-48)
  • endpoints (51-53)
  • backend (41-43)
src/srtctl/cli/mixins/__init__.py (1)
src/srtctl/cli/mixins/metrics_stage.py (1)
  • MetricsStageMixin (28-220)
tests/test_frontend_topology.py (1)
src/srtctl/cli/mixins/frontend_stage.py (1)
  • start_frontend (176-202)
src/srtctl/cli/do_sweep.py (1)
src/srtctl/cli/mixins/metrics_stage.py (2)
  • MetricsStageMixin (28-220)
  • start_metrics_collection (154-220)
🔇 Additional comments (36)
src/srtctl/benchmarks/sa_bench.py (2)

65-66: LGTM! Minor note on fallback behavior.

The logic correctly prioritizes served_model_name. Based on the served_model_name property in schema.py, it always returns a value (falls back to Path(self.model.path).name), so the or config.model.path fallback should never trigger. This is acceptable as defensive coding.


73-82: LGTM!

The command arguments correctly align with the updated bench.sh signature: endpoint model isl osl concurrencies [req_rate].

Makefile (2)

80-87: LGTM!

The Prometheus download and installation follows the established pattern for NATS and ETCD, with proper retry logic, extraction, permissions, and cleanup.


5-5: Prometheus 3.8.1 is a valid release.

Version 3.8.1 was released on December 16, 2025, and the download URL format in the Makefile is correct. No changes needed.

tests/test_e2e.py (1)

312-312: LGTM!

The filter correctly excludes debug.yaml from parameterized test recipes while maintaining the existence check fallback.

src/srtctl/benchmarks/scripts/sa-bench/bench.sh (3)

6-6: LGTM!

The comment correctly documents the updated argument order.


10-15: LGTM!

Argument parsing correctly reflects the new positional order with MODEL_NAME as the second parameter.


31-39: LGTM!

The endpoint verification now correctly uses the dynamic ${MODEL_NAME} variable.

src/srtctl/core/schema.py (2)

545-546: LGTM: Platform-specific optimization for arm64 builds.

The conditional RUSTFLAGS injection for arm64 is correctly structured and will improve performance of the maturin build on ARM platforms.


647-659: LGTM: Metrics validation logic is correct.

The validation properly enforces the Dynamo frontend requirement for metrics collection. The early-return pattern keeps the code clean and readable.

recipies/qwen3-32b/debug.yaml (1)

8-13: Verify tensor-parallel configuration.

With agg_nodes: 2, gpus_per_node: 4, and agg_workers: 8, each worker gets (2 * 4) / 8 = 1 GPU, which correctly matches tensor-parallel-size: 1 in the sglang_config.

src/srtctl/cli/mixins/__init__.py (1)

11-24: LGTM: MetricsStageMixin properly exported.

The new mixin is correctly added to the module exports, following the existing pattern for stage mixins.

src/srtctl/backends/base.py (1)

70-82: LGTM: Protocol signature updated with backward-compatible default.

The metrics_enabled parameter is correctly added with a default of False, ensuring backward compatibility for any existing backend implementations.

tests/test_frontend_topology.py (2)

238-242: LGTM: Test correctly unpacks the new tuple return type.

The test properly handles the updated start_frontend signature that now returns (processes, topology).


256-261: LGTM: Consistent handling of the new return type.

All test methods are consistently updated to unpack the tuple return value.

tests/test_configs.py (2)

11-15: LGTM: Imports correctly added for metrics validation tests.

The imports are properly organized and include all necessary types for the new test suite.


391-414: LGTM: Comprehensive test coverage for metrics validation.

The tests properly verify both the error case (sglang frontend with metrics enabled) and the success case (dynamo frontend with metrics enabled). The assertion in test_metrics_with_dynamo_frontend_ok confirms the config was created successfully with metrics enabled.

src/srtctl/cli/mixins/frontend_stage.py (1)

176-202: LGTM: Return type change is clean and well-documented.

The method now returns the computed topology alongside the processes, enabling downstream code (like MetricsStageMixin.start_metrics_collection) to use topology information for Prometheus configuration. The change is minimal and the docstring accurately reflects the new return type.

src/srtctl/cli/mixins/worker_stage.py (1)

109-109: LGTM!

The metrics_enabled flag is correctly propagated from the backend configuration to the worker command builder, enabling Prometheus metrics collection when configured.

src/srtctl/cli/do_sweep.py (3)

24-24: LGTM!

Import addition for MetricsStageMixin is correct.


42-43: LGTM!

The MetricsStageMixin is properly integrated into the class hierarchy, enabling the orchestrator to manage Prometheus metrics collection.


203-209: The metrics collection integration follows correct patterns and type signatures.

The start_frontend method properly returns tuple[list[ManagedProcess], FrontendTopology], enabling the unpacking at line 203. The subsequent calls are correct:

  • frontend_procs (list) is iterated with add_process() for individual registration
  • metrics_procs is returned as NamedProcesses (a dict[str, ManagedProcess]) and passed to add_processes(), which expects this exact type

Execution order is logical: frontend infrastructure is initialized before metrics collection, which needs the topology for Prometheus scrape target configuration.

src/srtctl/backends/sglang.py (4)

48-57: LGTM!

The MetricsConfig dataclass follows best practices:

  • Frozen for immutability
  • Sensible defaults (disabled by default, standard Prometheus port 9090)
  • ClassVar Schema for marshmallow integration

95-97: LGTM!

Proper use of field(default_factory=MetricsConfig) for the nested frozen dataclass.


203-216: LGTM!

The metrics_enabled parameter is properly added with a default value of False for backward compatibility, and the docstring is appropriately updated.


302-305: The --enable-metrics flag is correctly implemented and valid for both backends.

The flag is properly supported by both sglang.launch_server and dynamo.sglang. SGLang exposes Prometheus metrics at /metrics on the server port when enabled, and Dynamo similarly supports the flag with metrics prefixed as sglang:... and dynamo_*. The conditional addition is correct.

tests/test_metrics_stage.py (6)

19-45: LGTM!

The make_config helper provides a clean factory for creating test configurations with sensible defaults and supports customizing metrics-related parameters.


48-62: LGTM!

The make_runtime helper creates a minimal RuntimeContext suitable for unit testing the metrics stage logic.


65-72: LGTM!

Simple and effective helper for creating test FrontendTopology instances.


75-117: LGTM!

Comprehensive test coverage for metrics node selection logic:

  • Single node fallback to head
  • Two nodes prefers non-head
  • Many nodes selects last worker
  • Explicit head node exclusion verification

120-239: LGTM!

Thorough test coverage for Prometheus config generation:

  • YAML validity and structure
  • Scrape interval configuration
  • Worker and frontend target generation
  • Port format verification (host:port)
  • Label presence validation

242-347: LGTM!

Well-structured integration tests for metrics collection startup:

  • Verifies disabled mode returns empty dict
  • Verifies enabled mode starts Prometheus on non-head node with correct criticality
  • Validates port configuration propagation
  • Confirms Prometheus config file creation with expected content
src/srtctl/cli/mixins/metrics_stage.py (4)

1-26: LGTM!

Clean module structure with appropriate imports and type-checking guards. The docstring clearly explains the mixin's purpose.


28-54: LGTM!

The mixin clearly documents its dependencies and uses type hints with ellipsis (...) for abstract property stubs.


56-82: LGTM!

The node selection logic is well-designed:

  1. Excludes head node to avoid resource contention with NATS/etcd
  2. Prefers later worker nodes to minimize conflict with frontends
  3. Falls back gracefully to head node with a warning when no alternatives exist

84-152: LGTM!

Well-structured Prometheus config generation:

  • Separate scrape jobs for workers and frontends enable independent monitoring
  • Rich labels (worker_mode, worker_index, node, node_rank) support powerful PromQL queries
  • Frontend job is conditionally added only when frontends exist
  • Clean YAML output with default_flow_style=False

Comment on lines +49 to +55
# Mooncake benchmark using aiperf
benchmark:
type: "sa-bench"
isl: 100
osl: 100
concurrencies: "1x2x4"
req_rate: "inf" No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading comment: benchmark type does not match description.

The comment says "Mooncake benchmark using aiperf" but the actual benchmark type is sa-bench. Update the comment to accurately describe the benchmark:

-# Mooncake benchmark using aiperf
+# Performance benchmark using sa-bench
 benchmark:
   type: "sa-bench"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Mooncake benchmark using aiperf
benchmark:
type: "sa-bench"
isl: 100
osl: 100
concurrencies: "1x2x4"
req_rate: "inf"
# Performance benchmark using sa-bench
benchmark:
type: "sa-bench"
isl: 100
osl: 100
concurrencies: "1x2x4"
req_rate: "inf"
🤖 Prompt for AI Agents
In @recipies/qwen3-32b/debug.yaml around lines 49 - 55, The comment above the
benchmark block is misleading — it says "Mooncake benchmark using aiperf" but
the benchmark is configured as type: "sa-bench"; update the comment to
accurately describe the configured benchmark (refer to the benchmark block and
the type: "sa-bench" field) so the description matches the actual configuration.

Comment on lines +62 to +65
# Sleep before starting main benchmark
echo "Sleeping for 60 seconds before starting main benchmark..."
echo "This will help show a viisble difference in any graphs..."
sleep 60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo and consider making the sleep duration configurable.

Line 64 has a typo: "viisble" → "visible".

The 60-second sleep adds significant overhead to each benchmark run. Consider making this configurable via an environment variable or argument for scenarios where the delay isn't needed (e.g., CI pipelines).

Proposed fix
 # Sleep before starting main benchmark
-echo "Sleeping for 60 seconds before starting main benchmark..."
-echo "This will help show a viisble difference in any graphs..."
-sleep 60
+SLEEP_DURATION=${7:-60}
+if [ "$SLEEP_DURATION" -gt 0 ]; then
+    echo "Sleeping for ${SLEEP_DURATION} seconds before starting main benchmark..."
+    echo "This will help show a visible difference in any graphs..."
+    sleep "$SLEEP_DURATION"
+fi
🤖 Prompt for AI Agents
In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh around lines 62 - 65, Fix
the typo in the echo message ("viisble" → "visible") and make the fixed
60-second sleep configurable by introducing a variable (e.g., SLEEP_BEFORE_BENCH
or a positional arg) with a sensible default of 60; update the echo to print the
chosen duration and use that variable for the sleep call so CI or fast runs can
set the env var or pass an argument to skip or shorten the delay.

Comment on lines +154 to +219
def start_metrics_collection(self, frontend_topology: "FrontendTopology") -> NamedProcesses:
"""Start Prometheus metrics collection.

Launches Prometheus on a non-head node to scrape metrics from all
dynamo worker and frontend endpoints.

Args:
frontend_topology: Frontend topology for scraping frontend metrics.

Returns:
Dictionary of process names to ManagedProcess objects.
"""
processes: NamedProcesses = {}

# Check if metrics collection is enabled
if not self.config.backend.metrics.enabled:
logger.info("Metrics collection not enabled")
return processes

logger.info("Starting Prometheus metrics collection")

# Select node for Prometheus
prometheus_node = self.select_metrics_node()
logger.info("Prometheus node: %s", prometheus_node)

# Generate Prometheus config
prometheus_config_content = self.generate_prometheus_config(frontend_topology)
prometheus_config_path = self.runtime.log_dir / "prometheus.yml"
prometheus_config_path.write_text(prometheus_config_content)
logger.info("Generated Prometheus config: %s", prometheus_config_path)

# Log file for Prometheus
prometheus_log = self.runtime.log_dir / f"prometheus_{prometheus_node}.out"
prometheus_port = self.config.backend.metrics.prometheus_port

# Run Prometheus binary directly (like nats/etcd pattern)
# Binary installed via `make setup`, mounted at /configs
cmd = [
"/configs/prometheus",
"--config.file=/logs/prometheus.yml",
f"--web.listen-address=:{prometheus_port}",
"--storage.tsdb.path=/logs/prometheus_data",
"--storage.tsdb.retention.time=1h",
]

logger.info("Prometheus port: %d", prometheus_port)
logger.info("Prometheus log: %s", prometheus_log)

proc = start_srun_process(
command=cmd,
nodelist=[prometheus_node],
output=str(prometheus_log),
container_image=str(self.runtime.container_image),
container_mounts=self.runtime.container_mounts,
use_bash_wrapper=False,
)

processes["prometheus"] = ManagedProcess(
name="prometheus",
popen=proc,
log_file=prometheus_log,
node=prometheus_node,
critical=False, # Metrics collection is non-critical
)

logger.info("Prometheus started on %s:%d", prometheus_node, prometheus_port)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify configs directory mount setup in RuntimeContext
echo "=== Searching for 'configs' in runtime.py ==="
rg -n "configs" src/srtctl/core/runtime.py -C 3

echo ""
echo "=== Searching for 'container_mounts' setup ==="
rg -n "container_mounts" src/srtctl/core/runtime.py -C 2

echo ""
echo "=== Checking for any Prometheus-related mounts ==="
rg -n "prometheus|/configs" src/srtctl/cli/mixins/metrics_stage.py -C 2

Repository: ishandhanani/srt-slurm

Length of output: 5596


🏁 Script executed:

#!/bin/bash
# Check if there's any validation preventing metrics.enabled without /configs mount
echo "=== Searching for metrics.enabled validation ==="
rg -n "metrics\.enabled" src/srtctl -C 3 --type py

echo ""
echo "=== Check if Prometheus binary path is configurable ==="
rg -n "prometheus.*path|prometheus.*binary" src/srtctl -C 2 --type py

Repository: ishandhanani/srt-slurm

Length of output: 2395


🏁 Script executed:

#!/bin/bash
# Check what happens during setup - verify make setup handles Prometheus binary
echo "=== Check for Prometheus setup in Makefile ==="
rg -n "prometheus" Makefile -C 2

echo ""
echo "=== Check if configs directory is documented as required ==="
rg -n "configs|SRTCTL_SOURCE_DIR" README.md -C 2 | head -30

Repository: ishandhanani/srt-slurm

Length of output: 870


🏁 Script executed:

#!/bin/bash
# Check if there's any orchestration/deployment that ensures configs is available when metrics are used
echo "=== Check sbatch script handling of SRTCTL_SOURCE_DIR ==="
rg -n "SRTCTL_SOURCE_DIR" . -C 2 --type sh

echo ""
echo "=== Check how metrics configuration is typically used ==="
rg -n "metrics.*enabled.*true|metrics.*enabled\s*=" src/srtctl -C 1 --type py | head -30

Repository: ishandhanani/srt-slurm

Length of output: 425


🏁 Script executed:

#!/bin/bash
# Final check: verify if there's any error handling for missing Prometheus binary
echo "=== Check start_srun_process for error handling ==="
rg -n "def start_srun_process" src/srtctl -A 15 --type py

echo ""
echo "=== Check if there are tests that verify metrics setup ==="
fd -n "test.*metrics" . --type py

echo ""
echo "=== Verify the actual validation in schema.py ==="
sed -n '650,670p' src/srtctl/core/schema.py

Repository: ishandhanani/srt-slurm

Length of output: 2393


Address missing validation for Prometheus binary availability.

While the implementation is solid, there's a gap between configuration validation and runtime assumptions. The /configs mount is only conditionally added in RuntimeContext.from_config() (lines 149-155 of runtime.py):

if source_dir:
    configs_dir = Path(source_dir) / "configs"
    if configs_dir.exists():
        container_mounts[configs_dir.resolve()] = Path("/configs")

However, start_metrics_collection() hardcodes the Prometheus binary path (/configs/prometheus) without validating the mount exists. The schema validation only checks that metrics require a Dynamo frontend, not that the Prometheus binary is available.

This will fail at runtime if SRTCTL_SOURCE_DIR is unset or the configs directory doesn't exist—even though metrics are enabled in the config. Add a validation check in the config schema to ensure metrics cannot be enabled without a valid SRTCTL_SOURCE_DIR, or add a defensive check in start_metrics_collection() before launching Prometheus.

@ishandhanani
Copy link
Copy Markdown
Owner Author

Use aiperf instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant