Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
065cb2a
feat: update k8s deploy yamls to use binary/python3
hhzhang16 Jul 11, 2025
aee478c
config part working
tedzhouhk Jul 11, 2025
9455ad1
feat: add component type worker and bump image
hhzhang16 Jul 12, 2025
f3dd01a
fix: merge conflicts
mohammedabdulwahhab Jul 14, 2025
7de97ef
fix: using health checks exposed by dynamo-run
mohammedabdulwahhab Jul 14, 2025
16fd7f2
Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-2…
hhzhang16 Jul 14, 2025
3a29913
Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…
hhzhang16 Jul 14, 2025
51835db
fix: check for message in logs
mohammedabdulwahhab Jul 14, 2025
39b377f
Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…
mohammedabdulwahhab Jul 14, 2025
dddb45f
Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…
tedzhouhk Jul 14, 2025
34bc79c
define apis
tedzhouhk Jul 14, 2025
8c22d14
update script
tedzhouhk Jul 14, 2025
9856dde
fix: add dynamodeployment lib
mohammedabdulwahhab Jul 14, 2025
61a215b
fix: working client lib
mohammedabdulwahhab Jul 14, 2025
5141334
fix: working client lib
mohammedabdulwahhab Jul 14, 2025
8e25a29
integrate with utils.dynamo_deployment
tedzhouhk Jul 15, 2025
1d87164
fix: port forward works
mohammedabdulwahhab Jul 15, 2025
aaf4544
Merge branch 'hzhou/profile_vllmv1_k8s' of https://github.com/ai-dyna…
mohammedabdulwahhab Jul 15, 2025
65dec07
pc
tedzhouhk Jul 15, 2025
0af209b
add dep; bug fix
tedzhouhk Jul 15, 2025
918733a
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…
tedzhouhk Jul 15, 2025
3f900ef
staging, port forward not working
tedzhouhk Jul 15, 2025
bd12d40
stage
tedzhouhk Jul 15, 2025
7ac43a9
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…
mohammedabdulwahhab Jul 15, 2025
9971acf
fix: running script
mohammedabdulwahhab Jul 16, 2025
a5d8aca
fix: fix
mohammedabdulwahhab Jul 16, 2025
7b1d99a
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…
tedzhouhk Jul 16, 2025
f8f9363
add logic to find a free port
tedzhouhk Jul 16, 2025
8e292f6
feat: add Kubernetes service account configuration for SLA profiling …
hhzhang16 Jul 17, 2025
d62731f
feat: use service DNS for interfacing with deployments when profiling…
hhzhang16 Jul 17, 2025
a1aea5a
Revert "feat: use service DNS for interfacing with deployments when p…
hhzhang16 Jul 17, 2025
06bfe3b
feat: use service DNS instead of port forwarding for K8s-deployed SLA…
hhzhang16 Jul 18, 2025
6a2dcd0
feat: add service account configuration files and deployment changes
hhzhang16 Jul 16, 2025
606b4e3
feat: add profile_sla_rbac instead of the job
hhzhang16 Jul 16, 2025
0980195
feat: wip of profiling vllm_v1
hhzhang16 Jul 15, 2025
babf639
feat: wip of profiling sla job
hhzhang16 Jul 16, 2025
f934160
feat: use in-cluster service accounts if possible
hhzhang16 Jul 16, 2025
e911248
feat: use sa instead of pullsecret in job
hhzhang16 Jul 16, 2025
3d2284a
feat: working serviceaccount
hhzhang16 Jul 16, 2025
6062b8a
feat: wip of using dns instead of portforward if running in k8s
hhzhang16 Jul 17, 2025
fcfa5f4
feat: service dns fixes with k8s client
hhzhang16 Jul 17, 2025
aec7f6f
feat: fully replace port with base_url
hhzhang16 Jul 17, 2025
0155042
feat: resize profiling pvc to make it larger
hhzhang16 Jul 18, 2025
832f570
feat: wip of cleaning up deployments and testing
hhzhang16 Jul 18, 2025
ff96b9e
add try-catch waiting for deployment
tedzhouhk Jul 18, 2025
e95cecf
feat: skipping sweeps if they exist in the output dir
hhzhang16 Jul 18, 2025
cec3a0a
feat: cleaning up sla profiler deployment
hhzhang16 Jul 21, 2025
f16158a
feat: update readme
hhzhang16 Jul 21, 2025
5419885
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…
tedzhouhk Jul 21, 2025
d2b6b00
feat: clean up outlying DGDs upon SLA profiling failure (#2016)
hhzhang16 Jul 21, 2025
df4795f
Merge branch 'hzhou/profile_vllmv1_k8s' of github.com:ai-dynamo/dynam…
hhzhang16 Jul 21, 2025
79c7e58
feat: newest deployment yamls
hhzhang16 Jul 21, 2025
450d371
add debug info
tedzhouhk Jul 22, 2025
d8ffe1a
Merge branch 'hzhou/profile_vllmv1_k8s' of https://github.com/ai-dyna…
tedzhouhk Jul 22, 2025
b66c347
feat: fixes for CI
hhzhang16 Jul 22, 2025
615fdfb
Merge branches 'main' and 'hzhou/profile_vllmv1_k8s' of github.com:ai…
hhzhang16 Jul 22, 2025
e505288
feat: update deploy images
hhzhang16 Jul 22, 2025
5772013
feat: remove k8s.sh
hhzhang16 Jul 22, 2025
ad89cec
feat: remove readme
hhzhang16 Jul 22, 2025
8086166
chore: cleanup, add doc (#2053)
tedzhouhk Jul 22, 2025
65cc1e7
feat: add instructions on how to view images/profiling results
hhzhang16 Jul 22, 2025
f6ef37e
Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-2…
hhzhang16 Jul 22, 2025
a26f1bd
feat: shorten copyright headings
hhzhang16 Jul 22, 2025
62155db
Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-2…
hhzhang16 Jul 22, 2025
dd41754
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hanna…
tedzhouhk Jul 22, 2025
5f60d6a
Merge branch 'hannahz/dep-233-deploy-sla-profiler-to-k8s' of https://…
tedzhouhk Jul 22, 2025
f84d3c5
mypy
tedzhouhk Jul 23, 2025
0bb3389
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hanna…
tedzhouhk Jul 23, 2025
93d6734
docs: minor path change
hhzhang16 Jul 23, 2025
1716dab
docs: rewrite rbac
hhzhang16 Jul 23, 2025
8ceb061
docs: remove mentions of dynamo serve
hhzhang16 Jul 23, 2025
89d57ff
add note
tedzhouhk Jul 23, 2025
a7d28c6
Merge branch 'hannahz/dep-233-deploy-sla-profiler-to-k8s' of https://…
tedzhouhk Jul 23, 2025
10d6426
typo
tedzhouhk Jul 23, 2025
9413ef1
increase timeout, update yaml
tedzhouhk Jul 23, 2025
0a586b1
pc
tedzhouhk Jul 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions benchmarks/profiler/deploy/profile_sla_binding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: profile-sla-binding
namespace: ${NAMESPACE}
subjects:
- kind: ServiceAccount
name: profile-sla-sa
namespace: ${NAMESPACE}
roleRef:
kind: Role
name: profile-sla-role
apiGroup: rbac.authorization.k8s.io
48 changes: 48 additions & 0 deletions benchmarks/profiler/deploy/profile_sla_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: profile-sla
namespace: ${NAMESPACE}
spec:
template:
spec:
serviceAccountName: profile-sla-sa
containers:
- name: profile-sla
image: ${DOCKER_IMAGE}
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
- name: NATS_SERVER
value: nats://${NAMESPACE}-nats:4222
- name: ETCD_ENDPOINTS
value: ${NAMESPACE}-etcd:2379
command: ["python", "/workspace/benchmarks/profiler/profile_sla.py"]
args:
- --config
- ${DGD_CONFIG_FILE}
- --output-dir
- /workspace/profiling_results
- --namespace
- ${NAMESPACE}
volumeMounts:
- name: output-volume
mountPath: /workspace/profiling_results
restartPolicy: Never
volumes:
- name: output-volume
persistentVolumeClaim:
claimName: profiling-pvc
backoffLimit: 0
19 changes: 19 additions & 0 deletions benchmarks/profiler/deploy/profile_sla_rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: profile-sla-role
namespace: ${NAMESPACE}
rules:
# DynamoGraphDeployment custom resources - needed for create/get/delete operations
- apiGroups: ["nvidia.com"]
resources: ["dynamographdeployments"]
verbs: ["get", "create", "delete"]
# Pods - needed for listing pods by label selector and getting logs
- apiGroups: [""]
resources: ["pods"]
verbs: ["list"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
9 changes: 9 additions & 0 deletions benchmarks/profiler/deploy/profile_sla_sa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ServiceAccount
metadata:
name: profile-sla-sa
namespace: ${NAMESPACE}
imagePullSecrets:
- name: nvcr-imagepullsecret
13 changes: 13 additions & 0 deletions benchmarks/profiler/deploy/profiling_pvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: profiling-pvc
namespace: ${NAMESPACE}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
Loading
Loading