Skip to content
Merged
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
ccf7194
Added pre-merge-e2e.yml setup before and after script.
dmitry-tokarev-nv Sep 25, 2025
66ca5e3
cleanup
dmitry-tokarev-nv Sep 25, 2025
35b19b8
Merge branch 'main' of github.com:ai-dynamo/dynamo into dtokarev-e2e
dillon-cullinan Sep 25, 2025
54ba81b
Add docker tag and push
dillon-cullinan Sep 25, 2025
3a695da
vix env vars
dmitry-tokarev-nv Sep 26, 2025
e5a94cd
Added ACR login. Renamed some vars
dmitry-tokarev-nv Sep 26, 2025
1d67927
secret. >> secrets.
dmitry-tokarev-nv Sep 26, 2025
9899736
Add sudo, add container push to sglang and vllm
dillon-cullinan Sep 26, 2025
ab487f2
Install uv, fix action path
dillon-cullinan Sep 26, 2025
1aa0bc8
Update tag and push action
dillon-cullinan Sep 26, 2025
a24246c
Fix var names
dillon-cullinan Sep 26, 2025
ea22c83
Move input vars into steps directly
dillon-cullinan Sep 26, 2025
bf1dd25
Fix vars
dillon-cullinan Sep 26, 2025
7caa1a6
Set run shell
dillon-cullinan Sep 26, 2025
5f7d507
Remove global vars
dillon-cullinan Sep 26, 2025
7808410
trailing space
dmitry-tokarev-nv Sep 26, 2025
3e14f48
remove uv, set -x
dmitry-tokarev-nv Sep 26, 2025
e9c181d
added missed chmod and fail-fast: false
dmitry-tokarev-nv Sep 26, 2025
b528c14
fail-fast: false
dmitry-tokarev-nv Sep 26, 2025
fcbdfe3
Install awscli for docker push
dillon-cullinan Sep 29, 2025
50d9d26
Fix chmod on kubectl
dillon-cullinan Sep 29, 2025
ccdc3ad
Move docker tag/push before tests
dillon-cullinan Sep 29, 2025
29c8d7c
Remove AWS login, it is now provided by pod service account
dillon-cullinan Sep 29, 2025
019bf03
Re-add login...
dillon-cullinan Sep 29, 2025
b8ce72c
Fix if statement for ACR tag and push
dillon-cullinan Sep 29, 2025
a6ac1d5
Comment out helm repo add/update in beforescript
dillon-cullinan Oct 1, 2025
32f9cc5
Comment out env and cd command
dillon-cullinan Oct 1, 2025
376f979
update location
nv-anants Oct 2, 2025
86a93f3
run everyhting in before scr
nv-anants Oct 2, 2025
96b098e
debug
nv-anants Oct 2, 2025
0c36041
add operator build
nv-anants Oct 2, 2025
4bab7e3
add secrets
nv-anants Oct 2, 2025
f4ef624
add everthing
nv-anants Oct 2, 2025
8540646
use docker build
nv-anants Oct 2, 2025
d3f4bec
push image
nv-anants Oct 2, 2025
f0d1f14
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 3, 2025
952f8d7
revert build.sh changes
nv-anants Oct 3, 2025
cb6e511
deploy only once
nv-anants Oct 3, 2025
fc666c3
tests
nv-anants Oct 3, 2025
2c41ff9
deps
nv-anants Oct 3, 2025
7f1a16b
merge
nv-anants Oct 3, 2025
ef60527
test
nv-anants Oct 3, 2025
fa90a05
cleanup
nv-anants Oct 3, 2025
d06a371
add test
nv-anants Oct 3, 2025
878850c
debug
nv-anants Oct 3, 2025
281fa3b
debug2
nv-anants Oct 3, 2025
c0900b4
sleep less
nv-anants Oct 3, 2025
8cd1c83
debug
nv-anants Oct 3, 2025
b5fc5fa
switch image
nv-anants Oct 3, 2025
53343df
test
nv-anants Oct 3, 2025
11430d5
fix: fix
mohammedabdulwahhab Oct 6, 2025
5ec9b0f
fix: use hf secret
mohammedabdulwahhab Oct 6, 2025
67bef0a
Merge branch 'main' into dtokarev-e2e
dillon-cullinan Oct 7, 2025
f5c5fde
Comment out workflow metrics temporarily
dillon-cullinan Oct 7, 2025
6ebc651
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 8, 2025
2605c34
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 8, 2025
e80844c
temp: disable cleanup
nv-anants Oct 8, 2025
a83dc3c
fix: merge commit
mohammedabdulwahhab Oct 15, 2025
0e870b7
refactor
nv-anants Oct 15, 2025
23a4bac
remove unused var
nv-anants Oct 15, 2025
8d95a4c
revert back backend part
nv-anants Oct 15, 2025
c5b51c4
Revert "revert back backend part"
nv-anants Oct 15, 2025
a95c0c3
Revert "remove unused var"
nv-anants Oct 15, 2025
c8c28db
Revert "refactor"
nv-anants Oct 15, 2025
92c4bad
fix: run tests in parallel, re-enable cleanup
mohammedabdulwahhab Oct 15, 2025
72d3980
fix: build and push images for backends
mohammedabdulwahhab Oct 15, 2025
3a71653
fix: build and push images for backends
mohammedabdulwahhab Oct 15, 2025
83c7c82
fix: build and push images for backends
mohammedabdulwahhab Oct 15, 2025
a5eabe1
fix: add more profiles
mohammedabdulwahhab Oct 15, 2025
3b3e27b
fix: add more profiles
mohammedabdulwahhab Oct 15, 2025
1e61ebc
fix: add more profiles
mohammedabdulwahhab Oct 16, 2025
28b1a2d
fix: pull main
mohammedabdulwahhab Oct 16, 2025
e07f5cf
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 16, 2025
92aa1f8
fix namespace assignment
nv-anants Oct 16, 2025
6cbd271
remove sleep in cleanup
nv-anants Oct 16, 2025
9e64dad
revert all commented parts
nv-anants Oct 16, 2025
0661846
skip on doc only changes
nv-anants Oct 16, 2025
5d62ffb
precommit
nv-anants Oct 16, 2025
404056a
missed comments
nv-anants Oct 16, 2025
2715603
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 16, 2025
a36d11a
up the timneoutt
nv-anants Oct 16, 2025
550a0bf
make checks non blocking
nv-anants Oct 16, 2025
8b61754
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 16, 2025
f1ba94b
Merge branch 'main' into dtokarev-e2e
nv-anants Oct 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 262 additions & 0 deletions .github/workflows/container-validation-backends.yml
Original file line number Diff line number Diff line change
Expand Up @@ -321,3 +321,265 @@ jobs:
run: |
# Upload complete workflow metrics including container metrics
python3 .github/workflows/upload_complete_workflow_metrics.py

deploy-test-vllm:
runs-on: cpu-amd-m5-2xlarge
if: needs.changed-files.outputs.has_code_changes == 'true'
needs: [changed-files, operator, vllm]
permissions:
contents: read
strategy:
fail-fast: false
matrix:
profile:
- agg
- agg_router
- disagg
- disagg_router
name: deploy-test-vllm (${{ matrix.profile }})
env:
FRAMEWORK: vllm
DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com
DEPLOYMENT_FILE: "deploy/${{ matrix.profile }}.yaml"
MODEL_NAME: "Qwen/Qwen3-0.6B"
steps: &deploy-test-steps
- uses: actions/checkout@v4
- name: Set namespace and install dependencies
run: |
# Set namespace using FRAMEWORK env var
PROFILE_SANITIZED="${{ matrix.profile }}"
PROFILE_SANITIZED="${PROFILE_SANITIZED//_/-}"
echo "NAMESPACE=gh-job-id-${{ github.run_id }}-${FRAMEWORK}-${PROFILE_SANITIZED}" >> $GITHUB_ENV

set -x
# Install dependencies
sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq

# Install yq
echo "Installing yq..."
curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
sudo chmod 755 yq
sudo mv yq /usr/local/bin/
# Install Helm
echo "Installing Helm..."
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
sudo chmod 700 get_helm.sh
sudo ./get_helm.sh
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo chmod 755 kubectl
sudo mv kubectl /usr/local/bin/

# Setup kubeconfig
echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
chmod 600 .kubeconfig
export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"
kubectl config current-context
- name: Deploy Operator
run: |
set -x
export KUBECONFIG=$(pwd)/.kubeconfig

# Create a namespace for this job
echo "Creating an ephemeral namespace..."
kubectl delete namespace $NAMESPACE || true
kubectl create namespace $NAMESPACE || true
echo "Attaching the labels for secrets and cleanup"
kubectl label namespaces ${NAMESPACE} nscleanup/enabled=true nscleanup/ttl=7200 gitlab-imagepull=enabled ngc-api=enabled nvcr-imagepull=enabled --overwrite=true

# Set the namespace as default
kubectl config set-context --current --namespace=$NAMESPACE

# Check if Istio is installed
kubectl get pods -n istio-system
# Check if default storage class exists
kubectl get storageclass

# Install Helm chart
export IMAGE_TAG=$(cat build.env)
echo $IMAGE_TAG
export VIRTUAL_ENV=/opt/dynamo/venv
export KUBE_NS=$NAMESPACE
export ISTIO_ENABLED=true
export ISTIO_GATEWAY=istio-system/ingress-alb
export VIRTUAL_SERVICE_SUPPORTS_HTTPS=true
export DYNAMO_CLOUD=https://${NAMESPACE}.${DYNAMO_INGRESS_SUFFIX}

# Install dynamo env secrets
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true
# Create docker pull secret for operator image
kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE}
# Install helm dependencies
helm repo add bitnami https://charts.bitnami.com/bitnami
cd deploy/cloud/helm/platform/
helm dep build .
# Install platform with namespace restriction for single profile testing
helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \
--set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \
--set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
# Wait for all deployments to be ready
timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch
cd -

export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE

cd components/backends/$FRAMEWORK
export FRAMEWORK_RUNTIME_IMAGE="${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64"
export KUBE_NS=$NAMESPACE
export GRAPH_NAME=$(yq e '.metadata.name' $DEPLOYMENT_FILE)
# Update the deployment file in-place
yq -i '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE

# Debug: Show updated deployment file
echo "=== UPDATED DEPLOYMENT FILE ==="
cat $DEPLOYMENT_FILE

# Apply the updated file
kubectl apply -n $KUBE_NS -f $DEPLOYMENT_FILE

# --- Wait for all pods in the dynamo graph deployment to be ready ---
sleep 20
# Get the deployment name from the file
export GRAPH_NAME=$(yq e '.metadata.name' $DEPLOYMENT_FILE)
echo "Waiting for all pods with label nvidia.com/dynamo-graph-deployment-name: $GRAPH_NAME"
# Wait for all pods with the deployment label to be ready
kubectl wait --for=condition=ready pod -l "nvidia.com/dynamo-graph-deployment-name=$GRAPH_NAME" -n ${KUBE_NS} --timeout=1000s

# Debug: Show final pod statuses for the deployment
echo "=== FINAL POD STATUSES ==="
kubectl get pods -l "nvidia.com/dynamo-graph-deployment-name=$GRAPH_NAME" -n $KUBE_NS -o wide
echo ""

kubectl get all -n $KUBE_NS
export FRONTEND_POD=$(kubectl get pods -n ${KUBE_NS} | grep "frontend" | sort -k1 | tail -n1 | awk '{print $1}')
export CONTAINER_PORT=$(kubectl get pod $FRONTEND_POD -n ${KUBE_NS} -o jsonpath='{.spec.containers[0].ports[?(@.name=="http")].containerPort}')
echo "Container port is ${CONTAINER_PORT}"
kubectl port-forward pod/$FRONTEND_POD 8000:${CONTAINER_PORT} -n ${KUBE_NS} &
export LLM_URL="http://localhost:8000"
sleep 10 # Give port-forward time to establish the connection
echo "LLM URL: ${LLM_URL}"
echo "MODEL NAME: ${MODEL_NAME}"
# Wait until the model is available in the /v1/models response
MAX_ATTEMPTS=30
ATTEMPT=1
while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
MODELS_RESPONSE=$(curl -s --retry 5 --retry-delay 2 --retry-connrefused "${LLM_URL}/v1/models")
if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" '.data[]?.id == $MODEL_NAME' >/dev/null 2>&1; then
echo "Model $MODEL_NAME is available in /v1/models"
break
fi
echo "Waiting for model $MODEL_NAME to be available in /v1/models... (attempt $ATTEMPT/$MAX_ATTEMPTS)"
sleep 5
ATTEMPT=$((ATTEMPT + 1))
done
if [ $ATTEMPT -gt $MAX_ATTEMPTS ]; then
echo "Model $MODEL_NAME not found in /v1/models after $MAX_ATTEMPTS attempts"
echo "Last response: $MODELS_RESPONSE"
exit 1
fi
RESPONSE=$(curl -s -N --no-buffer --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME:-Qwen/Qwen3-0.6B}"'",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30,
"temperature": 0.0
}' 2>&1)
echo "Response: $RESPONSE"
TEST_RESULT=0
if ! echo "$RESPONSE" | jq -e . >/dev/null 2>&1; then
echo "Test failed: Response is not valid JSON"
echo "Got: $RESPONSE"
TEST_RESULT=1
elif ! echo "$RESPONSE" | jq -e '.choices[0].message.role == "assistant"' >/dev/null 2>&1; then
echo "Test failed: Message role is not 'assistant'"
echo "Got: $(echo "$RESPONSE" | jq '.choices[0].message.role')"
TEST_RESULT=1
elif ! echo "$RESPONSE" | jq -e '.model == "'"${MODEL_NAME}"'"' >/dev/null 2>&1; then
echo "Test failed: Model name is incorrect"
echo "Got: $(echo "$RESPONSE" | jq '.model')"
TEST_RESULT=1
elif ! echo "$RESPONSE" | jq -e '.choices[0].message.content | length > 100' >/dev/null 2>&1; then
echo "Test failed: Response content length is not greater than 100 characters"
echo "Got length: $(echo "$RESPONSE" | jq '.choices[0].message.content | length')"
TEST_RESULT=1
else
echo "Test passed: Response matches expected format and content"
fi
exit $TEST_RESULT
- name: Cleanup
if: always()
timeout-minutes: 5
run: |
echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
chmod 600 .kubeconfig
export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"

# For debugging purposes, list all the resources before we uninstall
kubectl get all

echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..."
kubectl delete dynamographdeployments --all -n $NAMESPACE || true

# Uninstall the helm chart
helm ls
helm uninstall dynamo-platform || true

echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..."
kubectl delete namespace $NAMESPACE || true
echo "Namespace $NAMESPACE completed."

deploy-test-sglang:
runs-on: cpu-amd-m5-2xlarge
if: needs.changed-files.outputs.has_code_changes == 'true'
needs: [changed-files, operator, sglang]
permissions:
contents: read
strategy:
fail-fast: false
matrix:
profile:
- agg
- agg_router
name: deploy-test-sglang (${{ matrix.profile }})
env:
FRAMEWORK: sglang
DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com
DEPLOYMENT_FILE: "deploy/${{ matrix.profile }}.yaml"
MODEL_NAME: "Qwen/Qwen3-0.6B"
steps: *deploy-test-steps

deploy-test-trtllm:
runs-on: cpu-amd-m5-2xlarge
if: needs.changed-files.outputs.has_code_changes == 'true'
needs: [changed-files, operator, trtllm]
permissions:
contents: read
strategy:
fail-fast: false
matrix:
profile:
- agg
- agg_router
- disagg
- disagg_router
name: deploy-test-trtllm (${{ matrix.profile }})
env:
FRAMEWORK: trtllm
DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com
DEPLOYMENT_FILE: "deploy/${{ matrix.profile }}.yaml"
MODEL_NAME: "Qwen/Qwen3-0.6B"
steps: *deploy-test-steps
Loading