Skip to content

Fix Image Loading for Podman in E2E Tests#377

Closed
hdefazio wants to merge 10 commits into
llm-d:mainfrom
hdefazio:dev/fix_podman_e2e
Closed

Fix Image Loading for Podman in E2E Tests#377
hdefazio wants to merge 10 commits into
llm-d:mainfrom
hdefazio:dev/fix_podman_e2e

Conversation

@hdefazio
Copy link
Copy Markdown
Contributor

@hdefazio hdefazio commented Oct 17, 2025

Depends on #371

This PR fixes a critical issue where the end-to-end test suite was failing in environments using Podman due to how kind loads container images.

What this PR does / why we need it:
Previously, our tests used kind load docker-image to load test images into the Kind cluster. While this works for Docker, it is unreliable with Podman, especially in rootless environments. The command would fail with errors like "image not present locally" or "stat -: no such file or directory" because the test runner could not correctly connect to the user's Podman session or handle piped image data.
This is a known issue with kind (see kubernetes-sigs/kind#2038 and kubernetes-sigs/kind#3105) and this pr implements the suggested workaround.

Testing:

$ make test-e2e
✅ Container tool 'podman' found.
==== Building Docker image ghcr.io/llm-d/llm-d-inference-scheduler:dev ====
podman build \
....
Successfully tagged ghcr.io/llm-d/llm-d-inference-scheduler:dev
b8a33690807f06e92527fb1d04328cefaea6d225e2fb9b996343e1c5be0cac35

==== Pulling Docker images ====
./scripts/pull_images.sh
Using container tool: podman
--- Using the following images ---
Scheduler Image:     ghcr.io/llm-d/llm-d-inference-scheduler:dev
Simulator Image:     ghcr.io/llm-d/llm-d-inference-sim:latest
Sidecar Image:       ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0
----------------------------------------------------
Pulling dependencies...
...
==== Running End to End Tests ====
./test/scripts/run_e2e.sh
Running end to end tests
  "level"=0 "msg"="Successfully loaded environment variable" "key"="CONTAINER_TOOL" "value"="podman"
  "level"=0 "msg"="Successfully loaded environment variable" "key"="EPP_IMAGE" "value"="ghcr.io/llm-d/llm-d-inference-scheduler:dev"
  "level"=0 "msg"="Successfully loaded environment variable" "key"="VLLM_SIMULATOR_IMAGE" "value"="ghcr.io/llm-d/llm-d-inference-sim:latest"
  "level"=0 "msg"="Successfully loaded environment variable" "key"="ROUTING_SIDECAR_IMAGE" "value"="ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0"
  "level"=0 "msg"="Environment variable not set, using default value" "key"="EXISTS_TIMEOUT" "defaultValue"="30s"
  "level"=0 "msg"="Environment variable not set, using default value" "key"="READY_TIMEOUT" "defaultValue"="3m0s"
  "level"=0 "msg"="Environment variable not set, using default value" "key"="MODEL_READY_TIMEOUT" "defaultValue"="10m0s"
=== RUN   TestEndToEnd
Running Suite: End To End Test Suite - 
==============================================================================================================
Random Seed: 1760660772

Will run 3 of 3 specs
------------------------------
[BeforeSuite] 
  enabling experimental podman provider
  Creating cluster "e2e-tests" ...
  Set kubectl context to "kind-e2e-tests"
  You can now use your cluster with:

  kubectl cluster-info --context kind-e2e-tests

  Thanks for using kind! 😊
  STEP: Loading image into Kind cluster: ghcr.io/llm-d/llm-d-inference-sim:latest @ 10/16/25 20:26:34.463
  "level"=0 "msg"="Podman detected, using image-archive method." "path"="/usr/bin/podman"
  Copying blob sha256:778d8c610941586099cac6c507cad2d1156b71b2bb54c42cebedf8808c68edb9
  Writing manifest to image destination
  enabling experimental podman provider
  STEP: Loading image into Kind cluster: ghcr.io/llm-d/llm-d-inference-scheduler:dev @ 10/16/25 20:26:40.428
  "level"=0 "msg"="Podman detected, using image-archive method." "path"="/usr/bin/podman"
  Copying blob sha256:004d2c90a65694c2830b06fddc1047d40063c6cb36fb31a5a3edfce9435326c6
  Writing manifest to image destination
  enabling experimental podman provider
  STEP: Loading image into Kind cluster: ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0 @ 10/16/25 20:26:47.652
  "level"=0 "msg"="Podman detected, using image-archive method." "path"="/usr/bin/podman"
  Copying blob sha256:ff59c129bdee8355d5b47559167f5f7c893dc99d9779a2b3194fa59152e90110
  Writing manifest to image destination
  enabling experimental podman provider
[BeforeSuite] PASSED [48.537 seconds]
...
Ran 3 of 3 Specs in 116.816 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestEndToEnd (116.82s)
PASS
ok      github.com/llm-d/llm-d-inference-scheduler/test/e2e     116.833s

@hdefazio hdefazio marked this pull request as draft October 17, 2025 00:07
@hdefazio hdefazio changed the title Fix the e2e tests so they work with podman Fix Image Loading for Podman in E2E Tests Oct 17, 2025
containers:
- name: epp
image: ghcr.io/llm-d/llm-d-inference-scheduler:latest
image: ${EPP_IMAGE}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.


images:
- name: ghcr.io/llm-d/llm-d-inference-scheduler
newTag: ${EPP_TAG}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.

initContainers:
- name: routing-sidecar
image: ghcr.io/llm-d/llm-d-routing-sidecar:latest
image: ${ROUTING_SIDECAR_IMAGE}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.

containers:
- name: vllm
image: ghcr.io/llm-d/llm-d-inference-sim:latest
image: ${VLLM_SIMULATOR_IMAGE}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.

- name: ghcr.io/llm-d/llm-d-inference-sim
newTag: ${VLLM_SIMULATOR_TAG}
- name: ghcr.io/llm-d/llm-d-routing-sidecar
newTag: ${ROUTING_SIDECAR_TAG}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.

Comment thread scripts/kind-dev-env.sh
# Set the default routing side car image tag
export ROUTING_SIDECAR_TAG="${ROUTING_SIDECAR_TAG:-0.0.6}"
# Set the default routing side car image
export ROUTING_SIDECAR_IMAGE="${ROUTING_SIDECAR_IMAGE:-ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0}"
Copy link
Copy Markdown
Collaborator

@shmuelk shmuelk Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very useful to keep the TAG separate from the image name. Please change this to use ROUTING_SIDECAR_TAG to set the tag in ROUTING_SIDECAR_IMAGE

Comment thread scripts/kind-dev-env.sh
# Load the vllm simulator image into the cluster
if [ "${CONTAINER_RUNTIME}" == "podman" ]; then
podman save ${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG} -o /dev/stdout | kind --name ${CLUSTER_NAME} load image-archive /dev/stdin
podman save ${VLLM_SIMULATOR_IMAGE} -o /dev/stdout | kind --name ${CLUSTER_NAME} load image-archive /dev/stdin
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change due to other changes that are requested to be undone.

Comment thread scripts/kind-dev-env.sh
else
if docker image inspect "${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG}" > /dev/null 2>&1; then
if docker image inspect ${VLLM_SIMULATOR_IMAGE} > /dev/null 2>&1; then
echo "INFO: Loading image into KIND cluster..."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change due to other changes that are requested to be undone.

Comment thread scripts/kind-dev-env.sh
if docker image inspect ${VLLM_SIMULATOR_IMAGE} > /dev/null 2>&1; then
echo "INFO: Loading image into KIND cluster..."
kind --name ${CLUSTER_NAME} load docker-image ${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG}
kind --name ${CLUSTER_NAME} load docker-image ${VLLM_SIMULATOR_IMAGE}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change due to other changes that are requested to be undone.


# Default image registry for pulling deployment images
export IMAGE_REGISTRY="${IMAGE_REGISTRY:-ghcr.io/llm-d}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change due to other changes that are requested to be undone.

@github-project-automation github-project-automation Bot moved this to In review in llm-d-router Oct 19, 2025
@elevran elevran moved this from In review to In progress in llm-d-router Oct 20, 2025
Comment thread scripts/pull_images.sh
local image_name="$1"
echo "Checking for image: ${image_name}"

# Attempt to inspect the image manifest on the remote registry.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reverse this check. That is check locally first and only if not present locally check the remote container registry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any risk that the local version could become out of date? If I am using ghcr.io/llm-d/llm-d-inference-scheduler:latest and I've already pulled it a few weeks ago then the local image would not be updated if the check is executed in this order

loadSession, err := gexec.Start(cmdKindLoad, ginkgo.GinkgoWriter, ginkgo.GinkgoWriter)
gomega.Expect(err).ShouldNot(gomega.HaveOccurred())
gomega.Eventually(loadSession).WithTimeout(600 * time.Second).Should(gexec.Exit(0))
case "docker":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in main has been changed for docker to also save and then do a kind load image-archive. Solves multi-architecture image issues on kind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thank you for updating that! I will adjust this pr (or open a new one)

@shmuelk
Copy link
Copy Markdown
Collaborator

shmuelk commented Oct 29, 2025

@hdefazio I have updated many of my comments as after I read your proposal and re-reviewed the code I better understand what you were trying to do here. Some places need minor tweaks but overall it's a good PR.

@shmuelk
Copy link
Copy Markdown
Collaborator

shmuelk commented Oct 29, 2025

@hdefazio Do you plan on carrying this PR forward? Especially in light of your proposal?

@hdefazio
Copy link
Copy Markdown
Contributor Author

@shmuelk My current plan is to open a new pr today to just handle the podman compatibility issue, as that is a much smaller change than #371. I'll close this one once I have a new pr open and I'll address your comments there!

@hdefazio
Copy link
Copy Markdown
Contributor Author

Closing in favor of #406

@hdefazio hdefazio closed this Oct 29, 2025
@github-project-automation github-project-automation Bot moved this from In progress to Done in llm-d-router Oct 29, 2025
elevran pushed a commit to elevran/llm-d-router that referenced this pull request Apr 8, 2026
…king protocol (llm-d#377)

* Defining an outer metadata struct as part of the extproc endpoint picking protocol

* Apply suggestions from code review

Update the protocol doc based on the suggested edits

Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>

* Updated the flag names

---------

Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants