Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Jul 21, 2025

Overview:

This MR revamps the SLA profiler to use DynamoGraphDeployments (DGDs) and moves the SLA profiler to run properly in Kubernetes instead of locally. Some key improvements are error handling, result caching, and K8s integration.

Details:

  • Profiler launches DGDs using YAML files instead of dynamo serve
  • Automatically cleans up DynamoGraphDeployments when profiling jobs fail or are interrupted
  • Added --skip-existing-results flag to resume interrupted profiling sweeps by loading previously completed results
  • Complete K8s Job definition for calling the profile_sla.py job, storing results in a PVC that the planner will be able to access

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Introduced Kubernetes deployment and profiling automation tools, including new scripts and Python modules for managing deployments, profiling jobs, and result caching.
    • Added several Kubernetes manifests for SLA-based profiling and RBAC permissions.
    • Added a new CLI entry point for Kubernetes component management.
  • Enhancements

    • Improved profiler benchmarking with asynchronous deployment management and enhanced error handling.
    • Updated deployment configurations with new model versions, resource allocations, and logging options.
    • Expanded documentation with detailed deployment and profiling instructions.
  • Bug Fixes

    • Improved configuration handling and removed deprecated options in profiler utilities.
  • Chores

    • Added new Python and system dependencies to support asynchronous and Kubernetes operations.

hhzhang16 and others added 30 commits July 11, 2025 09:28
… of github.com:ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example
@grahamking
Copy link
Contributor

@hhzhang16 examples/vllm is moving to components/backends/vllm in here: #1983

Could you hold of a few minutes and then rebase? The files will be in components/backends/vllm/deploy/.

@tedzhouhk
Copy link
Contributor

@hhzhang16 examples/vllm is moving to components/backends/vllm in here: #1983

Could you hold of a few minutes and then rebase? The files will be in components/backends/vllm/deploy/.

sure, thanks for letting us know!

@grahamking
Copy link
Contributor

@hhzhang16 examples/vllm is moving to components/backends/vllm in here: #1983
Could you hold of a few minutes and then rebase? The files will be in components/backends/vllm/deploy/.

sure, thanks for letting us know!

It's ready!

@hhzhang16
Copy link
Contributor Author

@tedzhouhk @grahamking rebased!

@grahamking
Copy link
Contributor

@tedzhouhk @grahamking rebased!

I, euh, maybe a tiny bit changed them again. #2055

@hhzhang16
Copy link
Contributor Author

Rebased again haha

@tedzhouhk tedzhouhk mentioned this pull request Jul 22, 2025
@hhzhang16 hhzhang16 merged commit fe718fd into main Jul 24, 2025
12 checks passed
@hhzhang16 hhzhang16 deleted the hannahz/dep-233-deploy-sla-profiler-to-k8s branch July 24, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants