Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion deploy/inference-gateway/example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,10 @@ export GATEWAY_URL=<Gateway-URL>

To test the gateway in minikube, use the following command:
```bash
minikube tunnel &
# start minikube tunnel
Copy link
Contributor

@atchernych atchernych Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 35 has obsolete data:
export DEPLOYMENT_NAME=llm-agg1
yq eval '
.metadata.name = env(DEPLOYMENT_NAME) |
.spec.services[].extraPodSpec.mainContainer.image = env(VLLM_RUNTIME_IMAGE)
' examples/vllm_v0/deploy/agg.yaml > examples/vllm_v0/deploy/agg1.yaml

minikube tunnel

# in a separate terminal
GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}')
echo $GATEWAY_URL
Comment on lines 145 to 146
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

kubectl get svc command is broken and returns an unusable IP

kubectl only accepts one -o flag, so -o yaml -o jsonpath=... fails.
Even if the command ran, .spec.clusterIP yields an internal-only address that is not reachable from the host. For a LoadBalancer service via minikube tunnel you need the external IP ( .status.loadBalancer.ingress[0].ip ) or simply use minikube service … --url.

Suggested fix:

-# in a separate terminal
-GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}')
+# in a separate terminal
+# grab the external IP assigned by the tunnel
+GATEWAY_URL=$(kubectl get svc inference-gateway \
+  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
 echo $GATEWAY_URL

or, for simplicity:

GATEWAY_URL=$(minikube service inference-gateway --url | head -n1)
🤖 Prompt for AI Agents
In deploy/inference-gateway/example/README.md around lines 145 to 146, the
kubectl command uses multiple -o flags which is invalid and retrieves an
internal cluster IP that is not accessible externally. Replace the command with
one that fetches the external IP from .status.loadBalancer.ingress[0].ip or,
more simply, use the minikube service inference-gateway --url command to get a
reachable gateway URL.

```
Expand Down
68 changes: 68 additions & 0 deletions deploy/inference-gateway/example/two_models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Get Gatewaty URL
```bash
GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}')
echo $GATEWAY_URL
```

Get models
```bash
curl $GATEWAY_URL/v1/models | jq .

# Sample response
{
"object": "list",
"data": [
{
"id": "google/gemma-3-1b-it",
"object": "object",
"created": 1752695871,
"owned_by": "nvidia"
},
{
"id": "Qwen/Qwen3-0.6B",
"object": "object",
"created": 1752695871,
"owned_by": "nvidia"
}
]
}
```


Send inference request to Qwen model:

```bash
curl $GATEWAY_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
```


Send inference request to Gemma model:


```bash
curl $GATEWAY_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
```
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: deep-seek-model
name: gemma3
namespace: default
spec:
criticality: Critical
modelName: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
modelName: google/gemma-3-1b-it
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen3
namespace: default
spec:
criticality: Critical
modelName: Qwen/Qwen3-0.6B
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: dynamo-deepseek
140 changes: 140 additions & 0 deletions deploy/inference-gateway/example/two_models/two_models.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-v1-agg
spec:
services:
Frontend:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Readiness probe always succeeds – loses rollout safety.

command: ["sh","-c","exit 0"] marks every pod ready even when the app is down.
Expose a real health endpoint or remove the probe to let liveness handle restarts.

🤖 Prompt for AI Agents
In deploy/inference-gateway/example/two_models/two_models.yaml around lines 31
to 39, the readiness probe uses a command that always exits with 0, causing the
pod to be marked ready even if the app is down. To fix this, replace the command
with a real health check that verifies the application's readiness, such as an
HTTP GET to a health endpoint, or remove the readiness probe entirely so that
the liveness probe manages pod restarts.

dynamoNamespace: vllm-v1-agg
componentType: main
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
mainContainer:
image: gitlab-master.nvidia.com:5005/aire/microservices/compoundai/dynamo:1c03404f2624186523529b8d4ca04731b60aa8b9-31776852-vllm_v1-amd64
workingDir: /workspace/examples/vllm
args:
- dynamo
- run
- in=http
- out=dyn
- --http-port
- "8000"
- --router-mode
- kv
GemmaDecodeWorker:
envFromSecret: hf-token-secret
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
dynamoNamespace: vllm-v1-agg
componentType: worker
replicas: 1
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
mainContainer:
image: gitlab-master.nvidia.com:5005/aire/microservices/compoundai/dynamo:1c03404f2624186523529b8d4ca04731b60aa8b9-31776852-vllm_v1-amd64
workingDir: /workspace/examples/vllm
args:
- "python3 components/main.py --model google/gemma-3-1b-it --enforce-eager --endpoint dyn://dynamo.gemma.generate 2>&1 | tee /tmp/vllm.log"
QwenDecodeWorker:
envFromSecret: hf-token-secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Worker args executed without shell – redirection & pipes break.

The whole string is one argv element; 2>&1 | tee … will never be interpreted.

Either wrap with an explicit shell:

command:
  - /bin/sh
  - -c
args:
  - |
    python3 components/main.py --model google/gemma-3-1b-it \
      --enforce-eager --endpoint dyn://dynamo.gemma.generate 2>&1 | tee /tmp/vllm.log

or drop redirection/pipe.

🤖 Prompt for AI Agents
In deploy/inference-gateway/example/two_models/two_models.yaml around lines 97
to 103, the args for mainContainer include shell redirection and piping as a
single argument, which won't work because the command is executed without a
shell. To fix this, replace the args with a command array that runs /bin/sh with
the -c option, and pass the entire python command with redirection and pipe as a
single string argument to the shell. This ensures the shell interprets the
redirection and piping correctly.

livenessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
dynamoNamespace: vllm-v1-agg
componentType: worker
replicas: 1
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
mainContainer:
image: gitlab-master.nvidia.com:5005/aire/microservices/compoundai/dynamo:1c03404f2624186523529b8d4ca04731b60aa8b9-31776852-vllm_v1-amd64
workingDir: /workspace/examples/vllm
args:
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --endpoint dyn://dynamo.qwen.generate 2>&1 | tee /tmp/vllm.log"
Loading