Conversation
📝 WalkthroughWalkthroughThis pull request introduces Prometheus metrics collection support into the SRT orchestration framework. It adds metrics configuration to the SGLang backend, creates a new MetricsStageMixin for managing Prometheus setup, updates the frontend_stage return signature to provide topology information, integrates metrics startup into the sweep orchestrator, adds dynamic model naming to benchmarks, and includes comprehensive test coverage for the new metrics functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant SweepOrch as SweepOrchestrator
participant FrontendMixin as FrontendStageMixin
participant WorkerMixin as WorkerStageMixin
participant MetricsMixin as MetricsStageMixin
participant Prometheus as Prometheus
participant Backend as Backend Worker
User->>SweepOrch: start_sweep()
SweepOrch->>FrontendMixin: start_frontend(registry)
FrontendMixin-->>SweepOrch: (processes, topology)
SweepOrch->>WorkerMixin: start_workers()
WorkerMixin->>Backend: build_worker_command(metrics_enabled=true)
Backend-->>WorkerMixin: command + --enable-metrics
WorkerMixin-->>SweepOrch: worker_processes
SweepOrch->>MetricsMixin: start_metrics_collection(topology)
MetricsMixin->>MetricsMixin: select_metrics_node()
MetricsMixin->>MetricsMixin: generate_prometheus_config(topology)
MetricsMixin->>Prometheus: start_srun_process(prometheus)
Prometheus-->>MetricsMixin: prometheus_process
MetricsMixin-->>SweepOrch: metrics_processes
SweepOrch->>SweepOrch: run_benchmark()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @recipies/qwen3-32b/debug.yaml:
- Around line 49-55: The comment above the benchmark block is misleading — it
says "Mooncake benchmark using aiperf" but the benchmark is configured as type:
"sa-bench"; update the comment to accurately describe the configured benchmark
(refer to the benchmark block and the type: "sa-bench" field) so the description
matches the actual configuration.
In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh:
- Around line 62-65: Fix the typo in the echo message ("viisble" → "visible")
and make the fixed 60-second sleep configurable by introducing a variable (e.g.,
SLEEP_BEFORE_BENCH or a positional arg) with a sensible default of 60; update
the echo to print the chosen duration and use that variable for the sleep call
so CI or fast runs can set the env var or pass an argument to skip or shorten
the delay.
🧹 Nitpick comments (2)
pyproject.toml (1)
93-94: Consider scoping the suppression more narrowly.Globally ignoring
invalid-return-typemay mask legitimate type errors elsewhere in the codebase.tysupports per-file rule overrides via[[tool.ty.overrides]]in pyproject.toml using glob patterns. Consider limiting this rule to the specific mixin files where protocol stubs with...bodies are actually used (e.g.,src/srtctl/cli/mixins/**/*.py).src/srtctl/cli/mixins/metrics_stage.py (1)
73-74: Address or remove the TODO comment.The TODO comment
# TODO: ishanappears to be a development artifact. If there's pending work, consider creating an issue to track it; otherwise, remove the comment.Do you want me to open an issue to track this TODO or should it be removed?
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
Makefilepyproject.tomlrecipies/qwen3-32b/debug.yamlsrc/srtctl/backends/base.pysrc/srtctl/backends/sglang.pysrc/srtctl/benchmarks/sa_bench.pysrc/srtctl/benchmarks/scripts/sa-bench/bench.shsrc/srtctl/cli/do_sweep.pysrc/srtctl/cli/mixins/__init__.pysrc/srtctl/cli/mixins/frontend_stage.pysrc/srtctl/cli/mixins/metrics_stage.pysrc/srtctl/cli/mixins/worker_stage.pysrc/srtctl/core/schema.pytests/test_configs.pytests/test_e2e.pytests/test_frontend_topology.pytests/test_metrics_stage.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Python 3.10+ with modern syntax including|unions andmatchstatements
Use Ruff for linting and formatting with configuration inpyproject.toml, targeting 120 character line length
Use type hints everywhere and usetyfor type checking
Use frozen dataclasses for all configuration objects with@dataclass(frozen=True)to ensure immutability
Prefertyping.Protocolover Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling
Usemarshmallow_dataclassto combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization
Use factory classmethods namedfrom_*for object construction (e.g.,RuntimeContext.from_config(),RunMetadata.from_json()) to keep__init__simple
Import type-only dependencies underif TYPE_CHECKING:guard to avoid circular imports, using string annotations for forward references
Use@propertydecorators for derived/computed values instead of storing computed state
Use decorator-based registry pattern for extensible registration (e.g.,@register_benchmark("sa-bench")) to allow new implementations without modifying existing code
UseTypedDictfor typing dictionaries from JSON/external sources where the structure cannot be controlled
Files:
tests/test_e2e.pysrc/srtctl/cli/mixins/worker_stage.pytests/test_metrics_stage.pysrc/srtctl/cli/mixins/frontend_stage.pysrc/srtctl/core/schema.pysrc/srtctl/benchmarks/sa_bench.pytests/test_configs.pysrc/srtctl/backends/base.pysrc/srtctl/cli/mixins/metrics_stage.pysrc/srtctl/cli/mixins/__init__.pytests/test_frontend_topology.pysrc/srtctl/cli/do_sweep.pysrc/srtctl/backends/sglang.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Usecheck_dynamo_health(response_json, expected_prefill, expected_decode)for Dynamo backend health checks
Usecheck_sglang_router_health(response_json, expected_prefill, expected_decode)for SGLang router health checks, passingexpected_prefill=0, expected_decode=num_aggfor aggregated mode
Files:
tests/test_e2e.pytests/test_metrics_stage.pytests/test_configs.pytests/test_frontend_topology.py
🧠 Learnings (6)
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use type hints everywhere and use `ty` for type checking
Applied to files:
pyproject.toml
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use Python 3.10+ with modern syntax including `|` unions and `match` statements
Applied to files:
pyproject.toml
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Prefer `typing.Protocol` over Abstract Base Classes (ABC) for interface definitions to enable duck typing without inheritance coupling
Applied to files:
pyproject.toml
📚 Learning: 2026-01-13T09:40:46.949Z
Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sg lang bench configuration, do not allow req_rate or concurrencies to be 0. Validate that both values are positive integers and raise a clear error if an attempt is made to set them to 0 (e.g., raise ValueError or validation error). Consider adding unit tests for negative and zero inputs and ensure the benchmark runs fail fast with informative messages.
Applied to files:
src/srtctl/benchmarks/sa_bench.py
📚 Learning: 2026-01-08T09:37:29.361Z
Learnt from: CR
Repo: ishandhanani/srt-slurm PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T09:37:29.361Z
Learning: Applies to **/*.py : Use `marshmallow_dataclass` to combine dataclasses with marshmallow schemas for type-safe config loading with validation, using custom fields for polymorphic deserialization
Applied to files:
tests/test_configs.py
📚 Learning: 2026-01-13T09:40:46.949Z
Learnt from: weireweire
Repo: ishandhanani/srt-slurm PR: 76
File: src/srtctl/benchmarks/sglang_bench.py:45-56
Timestamp: 2026-01-13T09:40:46.949Z
Learning: In sglang-bench benchmark configuration, users will not set req_rate or concurrencies to 0, as these values would not make semantic sense in the benchmarking context (0 concurrency or 0 requests per second are not valid use cases).
Applied to files:
src/srtctl/benchmarks/scripts/sa-bench/bench.sh
🧬 Code graph analysis (9)
src/srtctl/cli/mixins/worker_stage.py (3)
src/srtctl/cli/do_sweep.py (1)
backend(57-59)src/srtctl/cli/mixins/frontend_stage.py (1)
backend(63-65)src/srtctl/core/schema.py (1)
enabled(398-400)
src/srtctl/cli/mixins/frontend_stage.py (1)
src/srtctl/core/processes.py (1)
ManagedProcess(26-64)
src/srtctl/core/schema.py (4)
src/srtctl/cli/mixins/worker_stage.py (1)
backend(41-43)src/srtctl/cli/do_sweep.py (1)
backend(57-59)src/srtctl/cli/mixins/frontend_stage.py (1)
backend(63-65)src/srtctl/backends/sglang.py (1)
SGLangProtocol(60-306)
src/srtctl/benchmarks/sa_bench.py (1)
src/srtctl/core/schema.py (1)
served_model_name(722-735)
tests/test_configs.py (2)
src/srtctl/backends/sglang.py (3)
SGLangProtocol(60-306)SGLangServerConfig(34-45)MetricsConfig(49-56)src/srtctl/core/schema.py (4)
FrontendConfig(559-576)ResourceConfig(242-312)SrtConfig(606-740)enabled(398-400)
src/srtctl/cli/mixins/metrics_stage.py (7)
src/srtctl/core/processes.py (1)
ManagedProcess(26-64)src/srtctl/core/slurm.py (1)
start_srun_process(138-257)src/srtctl/cli/mixins/frontend_stage.py (3)
FrontendTopology(29-45)backend_processes(68-70)backend(63-65)src/srtctl/core/runtime.py (1)
RuntimeContext(64-234)src/srtctl/core/schema.py (2)
SrtConfig(606-740)enabled(398-400)src/srtctl/core/topology.py (2)
Endpoint(87-126)Process(130-166)src/srtctl/cli/mixins/worker_stage.py (3)
backend_processes(46-48)endpoints(51-53)backend(41-43)
src/srtctl/cli/mixins/__init__.py (1)
src/srtctl/cli/mixins/metrics_stage.py (1)
MetricsStageMixin(28-220)
tests/test_frontend_topology.py (1)
src/srtctl/cli/mixins/frontend_stage.py (1)
start_frontend(176-202)
src/srtctl/cli/do_sweep.py (1)
src/srtctl/cli/mixins/metrics_stage.py (2)
MetricsStageMixin(28-220)start_metrics_collection(154-220)
🔇 Additional comments (36)
src/srtctl/benchmarks/sa_bench.py (2)
65-66: LGTM! Minor note on fallback behavior.The logic correctly prioritizes
served_model_name. Based on theserved_model_nameproperty inschema.py, it always returns a value (falls back toPath(self.model.path).name), so theor config.model.pathfallback should never trigger. This is acceptable as defensive coding.
73-82: LGTM!The command arguments correctly align with the updated
bench.shsignature:endpoint model isl osl concurrencies [req_rate].Makefile (2)
80-87: LGTM!The Prometheus download and installation follows the established pattern for NATS and ETCD, with proper retry logic, extraction, permissions, and cleanup.
5-5: Prometheus 3.8.1 is a valid release.Version 3.8.1 was released on December 16, 2025, and the download URL format in the Makefile is correct. No changes needed.
tests/test_e2e.py (1)
312-312: LGTM!The filter correctly excludes
debug.yamlfrom parameterized test recipes while maintaining the existence check fallback.src/srtctl/benchmarks/scripts/sa-bench/bench.sh (3)
6-6: LGTM!The comment correctly documents the updated argument order.
10-15: LGTM!Argument parsing correctly reflects the new positional order with
MODEL_NAMEas the second parameter.
31-39: LGTM!The endpoint verification now correctly uses the dynamic
${MODEL_NAME}variable.src/srtctl/core/schema.py (2)
545-546: LGTM: Platform-specific optimization for arm64 builds.The conditional RUSTFLAGS injection for arm64 is correctly structured and will improve performance of the maturin build on ARM platforms.
647-659: LGTM: Metrics validation logic is correct.The validation properly enforces the Dynamo frontend requirement for metrics collection. The early-return pattern keeps the code clean and readable.
recipies/qwen3-32b/debug.yaml (1)
8-13: Verify tensor-parallel configuration.With
agg_nodes: 2,gpus_per_node: 4, andagg_workers: 8, each worker gets(2 * 4) / 8 = 1GPU, which correctly matchestensor-parallel-size: 1in the sglang_config.src/srtctl/cli/mixins/__init__.py (1)
11-24: LGTM: MetricsStageMixin properly exported.The new mixin is correctly added to the module exports, following the existing pattern for stage mixins.
src/srtctl/backends/base.py (1)
70-82: LGTM: Protocol signature updated with backward-compatible default.The
metrics_enabledparameter is correctly added with a default ofFalse, ensuring backward compatibility for any existing backend implementations.tests/test_frontend_topology.py (2)
238-242: LGTM: Test correctly unpacks the new tuple return type.The test properly handles the updated
start_frontendsignature that now returns(processes, topology).
256-261: LGTM: Consistent handling of the new return type.All test methods are consistently updated to unpack the tuple return value.
tests/test_configs.py (2)
11-15: LGTM: Imports correctly added for metrics validation tests.The imports are properly organized and include all necessary types for the new test suite.
391-414: LGTM: Comprehensive test coverage for metrics validation.The tests properly verify both the error case (sglang frontend with metrics enabled) and the success case (dynamo frontend with metrics enabled). The assertion in
test_metrics_with_dynamo_frontend_okconfirms the config was created successfully with metrics enabled.src/srtctl/cli/mixins/frontend_stage.py (1)
176-202: LGTM: Return type change is clean and well-documented.The method now returns the computed topology alongside the processes, enabling downstream code (like
MetricsStageMixin.start_metrics_collection) to use topology information for Prometheus configuration. The change is minimal and the docstring accurately reflects the new return type.src/srtctl/cli/mixins/worker_stage.py (1)
109-109: LGTM!The metrics_enabled flag is correctly propagated from the backend configuration to the worker command builder, enabling Prometheus metrics collection when configured.
src/srtctl/cli/do_sweep.py (3)
24-24: LGTM!Import addition for MetricsStageMixin is correct.
42-43: LGTM!The MetricsStageMixin is properly integrated into the class hierarchy, enabling the orchestrator to manage Prometheus metrics collection.
203-209: The metrics collection integration follows correct patterns and type signatures.The
start_frontendmethod properly returnstuple[list[ManagedProcess], FrontendTopology], enabling the unpacking at line 203. The subsequent calls are correct:
frontend_procs(list) is iterated withadd_process()for individual registrationmetrics_procsis returned asNamedProcesses(adict[str, ManagedProcess]) and passed toadd_processes(), which expects this exact typeExecution order is logical: frontend infrastructure is initialized before metrics collection, which needs the topology for Prometheus scrape target configuration.
src/srtctl/backends/sglang.py (4)
48-57: LGTM!The MetricsConfig dataclass follows best practices:
- Frozen for immutability
- Sensible defaults (disabled by default, standard Prometheus port 9090)
- ClassVar Schema for marshmallow integration
95-97: LGTM!Proper use of
field(default_factory=MetricsConfig)for the nested frozen dataclass.
203-216: LGTM!The
metrics_enabledparameter is properly added with a default value ofFalsefor backward compatibility, and the docstring is appropriately updated.
302-305: The--enable-metricsflag is correctly implemented and valid for both backends.The flag is properly supported by both
sglang.launch_serveranddynamo.sglang. SGLang exposes Prometheus metrics at/metricson the server port when enabled, and Dynamo similarly supports the flag with metrics prefixed assglang:...anddynamo_*. The conditional addition is correct.tests/test_metrics_stage.py (6)
19-45: LGTM!The
make_confighelper provides a clean factory for creating test configurations with sensible defaults and supports customizing metrics-related parameters.
48-62: LGTM!The
make_runtimehelper creates a minimal RuntimeContext suitable for unit testing the metrics stage logic.
65-72: LGTM!Simple and effective helper for creating test FrontendTopology instances.
75-117: LGTM!Comprehensive test coverage for metrics node selection logic:
- Single node fallback to head
- Two nodes prefers non-head
- Many nodes selects last worker
- Explicit head node exclusion verification
120-239: LGTM!Thorough test coverage for Prometheus config generation:
- YAML validity and structure
- Scrape interval configuration
- Worker and frontend target generation
- Port format verification (
host:port)- Label presence validation
242-347: LGTM!Well-structured integration tests for metrics collection startup:
- Verifies disabled mode returns empty dict
- Verifies enabled mode starts Prometheus on non-head node with correct criticality
- Validates port configuration propagation
- Confirms Prometheus config file creation with expected content
src/srtctl/cli/mixins/metrics_stage.py (4)
1-26: LGTM!Clean module structure with appropriate imports and type-checking guards. The docstring clearly explains the mixin's purpose.
28-54: LGTM!The mixin clearly documents its dependencies and uses type hints with ellipsis (
...) for abstract property stubs.
56-82: LGTM!The node selection logic is well-designed:
- Excludes head node to avoid resource contention with NATS/etcd
- Prefers later worker nodes to minimize conflict with frontends
- Falls back gracefully to head node with a warning when no alternatives exist
84-152: LGTM!Well-structured Prometheus config generation:
- Separate scrape jobs for workers and frontends enable independent monitoring
- Rich labels (worker_mode, worker_index, node, node_rank) support powerful PromQL queries
- Frontend job is conditionally added only when frontends exist
- Clean YAML output with
default_flow_style=False
| # Mooncake benchmark using aiperf | ||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 100 | ||
| osl: 100 | ||
| concurrencies: "1x2x4" | ||
| req_rate: "inf" No newline at end of file |
There was a problem hiding this comment.
Misleading comment: benchmark type does not match description.
The comment says "Mooncake benchmark using aiperf" but the actual benchmark type is sa-bench. Update the comment to accurately describe the benchmark:
-# Mooncake benchmark using aiperf
+# Performance benchmark using sa-bench
benchmark:
type: "sa-bench"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Mooncake benchmark using aiperf | |
| benchmark: | |
| type: "sa-bench" | |
| isl: 100 | |
| osl: 100 | |
| concurrencies: "1x2x4" | |
| req_rate: "inf" | |
| # Performance benchmark using sa-bench | |
| benchmark: | |
| type: "sa-bench" | |
| isl: 100 | |
| osl: 100 | |
| concurrencies: "1x2x4" | |
| req_rate: "inf" |
🤖 Prompt for AI Agents
In @recipies/qwen3-32b/debug.yaml around lines 49 - 55, The comment above the
benchmark block is misleading — it says "Mooncake benchmark using aiperf" but
the benchmark is configured as type: "sa-bench"; update the comment to
accurately describe the configured benchmark (refer to the benchmark block and
the type: "sa-bench" field) so the description matches the actual configuration.
| # Sleep before starting main benchmark | ||
| echo "Sleeping for 60 seconds before starting main benchmark..." | ||
| echo "This will help show a viisble difference in any graphs..." | ||
| sleep 60 |
There was a problem hiding this comment.
Fix typo and consider making the sleep duration configurable.
Line 64 has a typo: "viisble" → "visible".
The 60-second sleep adds significant overhead to each benchmark run. Consider making this configurable via an environment variable or argument for scenarios where the delay isn't needed (e.g., CI pipelines).
Proposed fix
# Sleep before starting main benchmark
-echo "Sleeping for 60 seconds before starting main benchmark..."
-echo "This will help show a viisble difference in any graphs..."
-sleep 60
+SLEEP_DURATION=${7:-60}
+if [ "$SLEEP_DURATION" -gt 0 ]; then
+ echo "Sleeping for ${SLEEP_DURATION} seconds before starting main benchmark..."
+ echo "This will help show a visible difference in any graphs..."
+ sleep "$SLEEP_DURATION"
+fi🤖 Prompt for AI Agents
In @src/srtctl/benchmarks/scripts/sa-bench/bench.sh around lines 62 - 65, Fix
the typo in the echo message ("viisble" → "visible") and make the fixed
60-second sleep configurable by introducing a variable (e.g., SLEEP_BEFORE_BENCH
or a positional arg) with a sensible default of 60; update the echo to print the
chosen duration and use that variable for the sleep call so CI or fast runs can
set the env var or pass an argument to skip or shorten the delay.
| def start_metrics_collection(self, frontend_topology: "FrontendTopology") -> NamedProcesses: | ||
| """Start Prometheus metrics collection. | ||
|
|
||
| Launches Prometheus on a non-head node to scrape metrics from all | ||
| dynamo worker and frontend endpoints. | ||
|
|
||
| Args: | ||
| frontend_topology: Frontend topology for scraping frontend metrics. | ||
|
|
||
| Returns: | ||
| Dictionary of process names to ManagedProcess objects. | ||
| """ | ||
| processes: NamedProcesses = {} | ||
|
|
||
| # Check if metrics collection is enabled | ||
| if not self.config.backend.metrics.enabled: | ||
| logger.info("Metrics collection not enabled") | ||
| return processes | ||
|
|
||
| logger.info("Starting Prometheus metrics collection") | ||
|
|
||
| # Select node for Prometheus | ||
| prometheus_node = self.select_metrics_node() | ||
| logger.info("Prometheus node: %s", prometheus_node) | ||
|
|
||
| # Generate Prometheus config | ||
| prometheus_config_content = self.generate_prometheus_config(frontend_topology) | ||
| prometheus_config_path = self.runtime.log_dir / "prometheus.yml" | ||
| prometheus_config_path.write_text(prometheus_config_content) | ||
| logger.info("Generated Prometheus config: %s", prometheus_config_path) | ||
|
|
||
| # Log file for Prometheus | ||
| prometheus_log = self.runtime.log_dir / f"prometheus_{prometheus_node}.out" | ||
| prometheus_port = self.config.backend.metrics.prometheus_port | ||
|
|
||
| # Run Prometheus binary directly (like nats/etcd pattern) | ||
| # Binary installed via `make setup`, mounted at /configs | ||
| cmd = [ | ||
| "/configs/prometheus", | ||
| "--config.file=/logs/prometheus.yml", | ||
| f"--web.listen-address=:{prometheus_port}", | ||
| "--storage.tsdb.path=/logs/prometheus_data", | ||
| "--storage.tsdb.retention.time=1h", | ||
| ] | ||
|
|
||
| logger.info("Prometheus port: %d", prometheus_port) | ||
| logger.info("Prometheus log: %s", prometheus_log) | ||
|
|
||
| proc = start_srun_process( | ||
| command=cmd, | ||
| nodelist=[prometheus_node], | ||
| output=str(prometheus_log), | ||
| container_image=str(self.runtime.container_image), | ||
| container_mounts=self.runtime.container_mounts, | ||
| use_bash_wrapper=False, | ||
| ) | ||
|
|
||
| processes["prometheus"] = ManagedProcess( | ||
| name="prometheus", | ||
| popen=proc, | ||
| log_file=prometheus_log, | ||
| node=prometheus_node, | ||
| critical=False, # Metrics collection is non-critical | ||
| ) | ||
|
|
||
| logger.info("Prometheus started on %s:%d", prometheus_node, prometheus_port) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify configs directory mount setup in RuntimeContext
echo "=== Searching for 'configs' in runtime.py ==="
rg -n "configs" src/srtctl/core/runtime.py -C 3
echo ""
echo "=== Searching for 'container_mounts' setup ==="
rg -n "container_mounts" src/srtctl/core/runtime.py -C 2
echo ""
echo "=== Checking for any Prometheus-related mounts ==="
rg -n "prometheus|/configs" src/srtctl/cli/mixins/metrics_stage.py -C 2Repository: ishandhanani/srt-slurm
Length of output: 5596
🏁 Script executed:
#!/bin/bash
# Check if there's any validation preventing metrics.enabled without /configs mount
echo "=== Searching for metrics.enabled validation ==="
rg -n "metrics\.enabled" src/srtctl -C 3 --type py
echo ""
echo "=== Check if Prometheus binary path is configurable ==="
rg -n "prometheus.*path|prometheus.*binary" src/srtctl -C 2 --type pyRepository: ishandhanani/srt-slurm
Length of output: 2395
🏁 Script executed:
#!/bin/bash
# Check what happens during setup - verify make setup handles Prometheus binary
echo "=== Check for Prometheus setup in Makefile ==="
rg -n "prometheus" Makefile -C 2
echo ""
echo "=== Check if configs directory is documented as required ==="
rg -n "configs|SRTCTL_SOURCE_DIR" README.md -C 2 | head -30Repository: ishandhanani/srt-slurm
Length of output: 870
🏁 Script executed:
#!/bin/bash
# Check if there's any orchestration/deployment that ensures configs is available when metrics are used
echo "=== Check sbatch script handling of SRTCTL_SOURCE_DIR ==="
rg -n "SRTCTL_SOURCE_DIR" . -C 2 --type sh
echo ""
echo "=== Check how metrics configuration is typically used ==="
rg -n "metrics.*enabled.*true|metrics.*enabled\s*=" src/srtctl -C 1 --type py | head -30Repository: ishandhanani/srt-slurm
Length of output: 425
🏁 Script executed:
#!/bin/bash
# Final check: verify if there's any error handling for missing Prometheus binary
echo "=== Check start_srun_process for error handling ==="
rg -n "def start_srun_process" src/srtctl -A 15 --type py
echo ""
echo "=== Check if there are tests that verify metrics setup ==="
fd -n "test.*metrics" . --type py
echo ""
echo "=== Verify the actual validation in schema.py ==="
sed -n '650,670p' src/srtctl/core/schema.pyRepository: ishandhanani/srt-slurm
Length of output: 2393
Address missing validation for Prometheus binary availability.
While the implementation is solid, there's a gap between configuration validation and runtime assumptions. The /configs mount is only conditionally added in RuntimeContext.from_config() (lines 149-155 of runtime.py):
if source_dir:
configs_dir = Path(source_dir) / "configs"
if configs_dir.exists():
container_mounts[configs_dir.resolve()] = Path("/configs")However, start_metrics_collection() hardcodes the Prometheus binary path (/configs/prometheus) without validating the mount exists. The schema validation only checks that metrics require a Dynamo frontend, not that the Prometheus binary is available.
This will fail at runtime if SRTCTL_SOURCE_DIR is unset or the configs directory doesn't exist—even though metrics are enabled in the config. Add a validation check in the config schema to ensure metrics cannot be enabled without a valid SRTCTL_SOURCE_DIR, or add a defensive check in start_metrics_collection() before launching Prometheus.
|
Use aiperf instead |
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.