Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Jun 6, 2025

See sla_planner.md for documentation

pending:

  • support common components in dynamo serve
  • testing in k8s

Summary by CodeRabbit

  • New Features

    • Introduced an SLA-based planner for dynamic autoscaling of prefill and decode workers based on predictive load forecasting and SLA targets.
    • Added Prometheus service integration for real-time metric collection and monitoring.
    • Implemented load prediction models (Constant, ARIMA, Prophet) and performance interpolators for more accurate scaling decisions.
    • Added new configuration options and YAML files for disaggregated planner setups.
    • Added a new Planner service component with asynchronous startup and minimal endpoint.
    • Integrated Planner and Prometheus dependencies into the frontend service.
  • Bug Fixes

    • Improved removal logic for worker processes to support non-blocking operation and better error handling.
  • Documentation

    • Added comprehensive documentation for both load-based and SLA-based planners, including deployment and profiling instructions.
    • Updated architecture docs to reference new planner documentation.
  • Refactor

    • Split and reorganized planner default configurations into base, load-based, and SLA-based classes.
    • Updated code to use new configuration classes and improved argument organization.
  • Chores

    • Updated dependencies to include packages required for load prediction and monitoring.
    • Enhanced Dockerfile to install Prometheus for monitoring support.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tedzhouhk tedzhouhk marked this pull request as ready for review June 12, 2025 23:22
@tedzhouhk tedzhouhk requested a review from hutm as a code owner June 13, 2025 17:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
components/planner/src/dynamo/planner/__init__.py (1)

16-27: ⚠️ Potential issue

__all__ exports stale symbol & makes Ruff unhappy

PlannerDefaults no longer exists after the split; meanwhile the two new defaults classes are imported but unused, triggering F401.

-__all__ = [
-    "CircusController",
-    "LocalConnector",
-    "PlannerConnector",
-    "KubernetesConnector",
-    "PlannerDefaults",
-]
+__all__ = [
+    "CircusController",
+    "LocalConnector",
+    "PlannerConnector",
+    "KubernetesConnector",
+    "LoadPlannerDefaults",
+    "SLAPlannerDefaults",
+]

After this change the two imports are “used” via __all__, silencing Ruff and preventing consumers from importing a non-existent symbol.

🧰 Tools
🪛 Ruff (0.11.9)

26-26: dynamo.planner.defaults.LoadPlannerDefaults imported but unused

(F401)


26-26: dynamo.planner.defaults.SLAPlannerDefaults imported but unused

(F401)

🪛 Pylint (3.3.7)

[error] 21-21: Undefined variable name 'PlannerDefaults' in all

(E0603)

♻️ Duplicate comments (2)
components/planner/src/dynamo/planner/planner_sla.py (1)

32-35: Fixed 30-second sleep is still here

The hard-coded INIT_PLANNER_START_DELAY = 30 remains, despite previous feedback. This blocks the event-loop and hard-codes an environment assumption. Please make it configurable (default 0) or poll for readiness of dependent components instead.

Also applies to: 105-110

components/planner/src/dynamo/planner/utils/planner_core.py (1)

221-235: Division by zero still possible for corrected ITL

corrected_itl = self.args.itl / self.d_correction_factor is executed without guarding against self.d_correction_factor == 0, an issue already raised earlier.

-            corrected_itl = self.args.itl / self.d_correction_factor
+            if self.d_correction_factor == 0:
+                logger.error("Decode correction factor is zero – skipping scaling step to avoid div-by-zero")
+                return
+            corrected_itl = self.args.itl / self.d_correction_factor
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90b1267 and b6f3161.

📒 Files selected for processing (9)
  • components/planner/src/dynamo/planner/__init__.py (1 hunks)
  • components/planner/src/dynamo/planner/defaults.py (1 hunks)
  • components/planner/src/dynamo/planner/planner_sla.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/load_predictor.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/perf_interpolation.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (1 hunks)
  • deploy/sdk/src/dynamo/sdk/cli/utils.py (3 hunks)
  • examples/llm/components/planner.py (2 hunks)
  • examples/llm/components/planner_service.py (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • examples/llm/components/planner_service.py
  • deploy/sdk/src/dynamo/sdk/cli/utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/llm/components/planner.py
  • components/planner/src/dynamo/planner/utils/load_predictor.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/planner/src/dynamo/planner/planner_sla.py (6)
components/planner/src/dynamo/planner/defaults.py (1)
  • SLAPlannerDefaults (38-46)
components/planner/src/dynamo/planner/utils/planner_core.py (2)
  • start_sla_planner (306-308)
  • Planner (51-303)
lib/bindings/python/src/dynamo/runtime/logging.py (1)
  • configure_dynamo_logging (77-105)
deploy/sdk/src/dynamo/sdk/lib/decorators.py (1)
  • async_on_start (95-99)
deploy/sdk/src/dynamo/sdk/core/protocol/interface.py (1)
  • ComponentType (43-46)
examples/llm/components/planner_service.py (4)
  • RequestType (32-33)
  • Planner (45-114)
  • async_init (103-109)
  • generate (112-114)
🪛 Ruff (0.11.9)
components/planner/src/dynamo/planner/__init__.py

26-26: dynamo.planner.defaults.LoadPlannerDefaults imported but unused

(F401)


26-26: dynamo.planner.defaults.SLAPlannerDefaults imported but unused

(F401)

🪛 Pylint (3.3.7)
components/planner/src/dynamo/planner/utils/planner_core.py

[refactor] 40-40: Too many instance attributes (8/7)

(R0902)


[refactor] 51-51: Too many instance attributes (18/7)

(R0902)


[refactor] 162-162: Too many branches (13/12)

(R0912)

components/planner/src/dynamo/planner/defaults.py

[refactor] 18-18: Too few public methods (0/2)

(R0903)


[refactor] 30-30: Too few public methods (0/2)

(R0903)


[refactor] 38-38: Too few public methods (0/2)

(R0903)

components/planner/src/dynamo/planner/planner_sla.py

[refactor] 37-37: Too few public methods (0/2)

(R0903)

components/planner/src/dynamo/planner/utils/perf_interpolation.py

[refactor] 56-56: Too many instance attributes (12/7)

(R0902)

🔇 Additional comments (3)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (1)

137-145: Axis-swap issue fixed – good catch

The earlier mix-up between x/y axes is now resolved (self.itl_interpolator[iy, ix] and same for throughput). ✅
No further concerns here.

components/planner/src/dynamo/planner/utils/planner_core.py (1)

152-156:

❌ Incorrect review comment

Formatting assumes metrics are non-None

{self.last_metrics.num_req:.2f} etc. will raise TypeError if Prometheus returns None. Either validate each metric or supply a fallback value before formatting.


🏁 Script executed:

#!/bin/bash
set -e

# Find references to last_metrics in planner_core.py
rg "last_metrics" -n components/planner/src/dynamo/planner/utils/planner_core.py

# Show lines around the formatting snippet
sed -n '140,180p' components/planner/src/dynamo/planner/utils/planner_core.py

# Locate any Metrics class or dataclass definitions under the dynamo folder
rg "class .*Metrics" -n components/planner/src/dynamo
rg "@dataclass" -n components/planner/src/dynamo

Length of output: 3623


🏁 Script executed:

#!/bin/bash
set -e

# Display the Metrics dataclass definition to inspect default values and types
sed -n '30,80p' components/planner/src/dynamo/planner/utils/planner_core.py

Length of output: 1896


🏁 Script executed:

#!/bin/bash
set -e

# Locate PrometheusAPIClient definition and related get_avg methods
rg "class PrometheusAPIClient" -n components/planner/src/dynamo/planner/utils/prometheus.py

# Show the file to inspect get_avg_* implementations
sed -n '1,200p' components/planner/src/dynamo/planner/utils/prometheus.py

Length of output: 4176


Ignore None‐check for metrics formatting
PrometheusAPIClient.get_avg_* methods always return a float (0.0 on error) and each metric is assigned immediately before the f-string logs. Therefore, self.last_metrics.num_req, isl, osl, ttft, and itl are guaranteed to be floats, and using .2f/.3f will not raise a TypeError. No additional validation or fallbacks are required here.

Likely an incorrect or invalid review comment.

components/planner/src/dynamo/planner/defaults.py (1)

18-28: All good – clear split between base / load / SLA defaults

The refactor cleanly separates shared and planner-specific parameters.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 18-18: Too few public methods (0/2)

(R0903)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (1)

72-73: Hard-coded token-capacity fixed – looks good

Loading max_kv_tokens from the profiling artifact instead of a constant resolves the skew highlighted earlier.

components/planner/src/dynamo/planner/utils/planner_core.py (4)

170-186: ⚠️ Potential issue

Division by zero still possible in correction-factor maths

expect_ttft or expect_itl can be 0 (or None – see previous comment), causing ZeroDivisionError.
The identical issue was flagged in the previous review but is still present.

if expect_ttft == 0:
    logger.warning("Expected TTFT is zero – skipping correction factor update")
else:
    self.p_correction_factor = self.last_metrics.ttft / expect_ttft

...
if expect_itl == 0:
    logger.warning("Expected ITL is zero – skipping correction factor update")
else:
    self.d_correction_factor = self.last_metrics.itl / expect_itl

220-226: ⚠️ Potential issue

Guard against zero decode correction factor

self.d_correction_factor may be zero after the above fix or on first run, leading to a crash when computing corrected_itl.

if self.d_correction_factor == 0:
    logger.error("Decode correction factor is zero – cannot compute corrected ITL")
    return
corrected_itl = self.args.itl / self.d_correction_factor

269-284: 🛠️ Refactor suggestion

Scaling race TODO still unresolved – risk of concurrent operations

Without a lock/flag, a new adjustment cycle may enqueue scaling while the previous one is ongoing, especially in Kubernetes where scale-up is slow.
Implement a simple asyncio.Lock or an atomic “scaling-in-progress” flag before releasing this feature.


125-156: ⚠️ Potential issue

Metric values may be None – logging & math will crash

PrometheusAPIClient can return None on query failure.
Formatting with :.2f / :.3f or later arithmetic (num_req, isl, etc.) will raise TypeError.

self.last_metrics.ttft = ...
...
# validate
for name, val in vars(self.last_metrics).items():
    if val is None:
        logger.warning("%s metric unavailable, skipping adjustment interval", name)
        return  # abort this cycle early
🧹 Nitpick comments (1)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (1)

158-161: Backward scan in find_best_throughput_per_gpu is O(resolution) – use vectorised search

A full Python for loop over resolution (100) is fine now, but a larger grid will regress.
Consider np.where to locate the first valid index in native code:

valid_ix = np.where(self.itl_interpolator[iy, :] <= itl)[0]
if valid_ix.size:
    ix = valid_ix[-1]          # right-most idx
    return self.thpt_interpolator[iy, ix]
return self.thpt_interpolator[iy, 0]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6f3161 and a4acd2b.

📒 Files selected for processing (3)
  • components/planner/src/dynamo/planner/__init__.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/perf_interpolation.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/planner/src/dynamo/planner/init.py
🧰 Additional context used
🪛 Pylint (3.3.7)
components/planner/src/dynamo/planner/utils/planner_core.py

[refactor] 40-40: Too many instance attributes (8/7)

(R0902)


[refactor] 51-51: Too many instance attributes (18/7)

(R0902)


[refactor] 162-162: Too many branches (13/12)

(R0912)

components/planner/src/dynamo/planner/utils/perf_interpolation.py

[refactor] 56-56: Too many instance attributes (12/7)

(R0902)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/architecture/planner.md (1)

38-40: Improve link text clarity for planner documentation.

Using generic link text like “docs” can be ambiguous. Consider adopting more descriptive labels, for example:

-1. Load-based planner: [docs](load_planner.md)
-2. SLA-based planner: [docs](sla_planner.md)
+1. Load-based planner documentation: see [load-based planner docs](load_planner.md)
+2. SLA-based planner documentation: see [SLA-based planner docs](sla_planner.md)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4acd2b and e024c59.

📒 Files selected for processing (1)
  • docs/architecture/planner.md (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - vllm

@alec-flowers
Copy link
Contributor

It still feels like its in a POC phase. What is the largest trials and such that we have run? Can we note down the largest scale experiments and testing that we have done?

@tedzhouhk
Copy link
Contributor Author

It still feels like its in a POC phase. What is the largest trials and such that we have run? Can we note down the largest scale experiments and testing that we have done?

@alec-flowers Yes, SLA planner in 0.3.1 will be an experimental release. I will run large scale exps in k8s to tune it and it will be the next release.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
components/planner/src/dynamo/planner/prometheus.py (3)

38-38: Doc-string still references “Frontend service”
The constructor comment is unchanged from the original copy-paste and is misleading for a Prometheus service.

-        """Initialize Frontend service with HTTP server and model configuration."""
+        """Spin-up an embedded Prometheus server using the service configuration."""

46-55: ⚠️ Potential issue

Temp file never cleaned up & not wrapped in a context manager
NamedTemporaryFile(delete=False) persists on disk but there is no teardown path, so every container run leaves an orphaned YAML file. This was flagged previously.

@@
-        self.temp_file = tempfile.NamedTemporaryFile(
-            mode="w", suffix=".yml", delete=False
-        )
-        yaml.dump(self.config, self.temp_file)
-        self.temp_file.close()
+        with tempfile.NamedTemporaryFile(mode="w",
+                                         suffix=".yml",
+                                         delete=False) as tmp:
+            yaml.dump(self.config, tmp)
+            self._config_path = tmp.name

Add a destructor (or framework-specific shutdown hook) to remove the file:

+    def __del__(self):
+        """Terminate child process and delete temporary config on GC/shutdown."""
+        if getattr(self, "process", None) and self.process.poll() is None:
+            self.process.terminate()
+        if getattr(self, "_config_path", None):
+            try:
+                os.remove(self._config_path)
+            except OSError:
+                logger.warning("Failed to delete Prometheus config %s", self._config_path)

(Remember import os at the top.)

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 49-51: Consider using 'with' for resource-allocating operations

(R1732)


56-67: 🛠️ Refactor suggestion

subprocess.Popen launched without safety nets
No error handling for a missing prometheus binary and stdout/stderr are discarded, making failures silent. This was noted in the last review.

-        self.process = subprocess.Popen(
-            cmd,
-            stdout=None,
-            stderr=None,
-        )
+        try:
+            self.process = subprocess.Popen(
+                cmd,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+            )
+        except FileNotFoundError as exc:
+            logger.error("Prometheus binary not found: %s", exc)
+            raise

Consider a watchdog or self.process.wait() in a background task to restart on crash.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 63-67: Consider using 'with' for resource-allocating operations

(R1732)

components/planner/src/dynamo/planner/utils/perf_interpolation.py (1)

62-78: ⚠️ Potential issue

Guard against resolution < 2 to avoid div-by-zero

compute_idx divides by (self.xi[1] - self.xi[0]). With resolution==1 this is 0, raising ZeroDivisionError. The same concern was raised earlier but is still unresolved.

-        self.resolution = resolution
+        if resolution < 2:
+            raise ValueError("resolution must be ≥ 2")
+        self.resolution = resolution
🧹 Nitpick comments (2)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (2)

16-18: Prefer module-level import for interpolate

Importing the full scipy package loads a large namespace unnecessarily. Import only the required sub-module to keep import time and memory footprint low:

-import scipy
+from scipy import interpolate

and adjust calls (scipy.interpolate.*interpolate.*).


146-161: Minor efficiency nit: break early once a match is found

The reverse scan exits via return, but Python still evaluates the loop condition after return is prepared. Slight micro-optimisation:

-        for ix in range(self.resolution - 1, -1, -1):
-            if self.itl_interpolator[iy, ix] <= itl:
-                return self.thpt_interpolator[iy, ix]
+        for ix in range(self.resolution - 1, -1, -1):
+            if self.itl_interpolator[iy, ix] <= itl:
+                return self.thpt_interpolator[iy, ix]  # early exit
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e024c59 and 6a9437e.

📒 Files selected for processing (2)
  • components/planner/src/dynamo/planner/prometheus.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/perf_interpolation.py (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
components/planner/src/dynamo/planner/prometheus.py

[refactor] 49-51: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 63-67: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 36-36: Too few public methods (1/2)

(R0903)

components/planner/src/dynamo/planner/utils/perf_interpolation.py

[refactor] 56-56: Too many instance attributes (12/7)

(R0902)

🔇 Additional comments (2)
components/planner/src/dynamo/planner/prometheus.py (1)

61-61: Log level correctly set to INFO – good catch
Previous warning-level log has been adjusted to logger.info, matching reviewer feedback.

components/planner/src/dynamo/planner/utils/perf_interpolation.py (1)

31-45: Double-check TTFT units

DecodeInterpolator converts ITL from ms → s (line 97) but PrefillInterpolator leaves TTFT untouched. If the downstream planner expects both latency metrics in the same unit, this asymmetry will introduce subtle bugs.

Confirm the unit of prefill_ttft in the NPZ; convert to seconds if it is stored in milliseconds.

-        self.prefill_ttft = raw_data["prefill_ttft"]
+        self.prefill_ttft = raw_data["prefill_ttft"] / 1000  # ms → s

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a9437e and b08ddfd.

📒 Files selected for processing (2)
  • container/Dockerfile.vllm (1 hunks)
  • docs/architecture/planner.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • container/Dockerfile.vllm
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: Build and Test - vllm

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Hongkuan Zhou <[email protected]>
@tedzhouhk tedzhouhk merged commit 3f53a78 into main Jun 14, 2025
11 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/sla_planner_v2 branch June 14, 2025 01:40
tedzhouhk added a commit that referenced this pull request Jun 26, 2025
Signed-off-by: Hongkuan Zhou <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
Co-authored-by: Alec <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants