Fix Image Loading for Podman in E2E Tests#377
Conversation
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
| containers: | ||
| - name: epp | ||
| image: ghcr.io/llm-d/llm-d-inference-scheduler:latest | ||
| image: ${EPP_IMAGE} |
There was a problem hiding this comment.
Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.
|
|
||
| images: | ||
| - name: ghcr.io/llm-d/llm-d-inference-scheduler | ||
| newTag: ${EPP_TAG} |
There was a problem hiding this comment.
Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.
| initContainers: | ||
| - name: routing-sidecar | ||
| image: ghcr.io/llm-d/llm-d-routing-sidecar:latest | ||
| image: ${ROUTING_SIDECAR_IMAGE} |
There was a problem hiding this comment.
Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.
| containers: | ||
| - name: vllm | ||
| image: ghcr.io/llm-d/llm-d-inference-sim:latest | ||
| image: ${VLLM_SIMULATOR_IMAGE} |
There was a problem hiding this comment.
Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.
| - name: ghcr.io/llm-d/llm-d-inference-sim | ||
| newTag: ${VLLM_SIMULATOR_TAG} | ||
| - name: ghcr.io/llm-d/llm-d-routing-sidecar | ||
| newTag: ${ROUTING_SIDECAR_TAG} |
There was a problem hiding this comment.
Please undo this change. The file as it was allows the YAML file to be used outside of the kind based tests.
| # Set the default routing side car image tag | ||
| export ROUTING_SIDECAR_TAG="${ROUTING_SIDECAR_TAG:-0.0.6}" | ||
| # Set the default routing side car image | ||
| export ROUTING_SIDECAR_IMAGE="${ROUTING_SIDECAR_IMAGE:-ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0}" |
There was a problem hiding this comment.
It is very useful to keep the TAG separate from the image name. Please change this to use ROUTING_SIDECAR_TAG to set the tag in ROUTING_SIDECAR_IMAGE
| # Load the vllm simulator image into the cluster | ||
| if [ "${CONTAINER_RUNTIME}" == "podman" ]; then | ||
| podman save ${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG} -o /dev/stdout | kind --name ${CLUSTER_NAME} load image-archive /dev/stdin | ||
| podman save ${VLLM_SIMULATOR_IMAGE} -o /dev/stdout | kind --name ${CLUSTER_NAME} load image-archive /dev/stdin |
There was a problem hiding this comment.
Please undo this change due to other changes that are requested to be undone.
| else | ||
| if docker image inspect "${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG}" > /dev/null 2>&1; then | ||
| if docker image inspect ${VLLM_SIMULATOR_IMAGE} > /dev/null 2>&1; then | ||
| echo "INFO: Loading image into KIND cluster..." |
There was a problem hiding this comment.
Please undo this change due to other changes that are requested to be undone.
| if docker image inspect ${VLLM_SIMULATOR_IMAGE} > /dev/null 2>&1; then | ||
| echo "INFO: Loading image into KIND cluster..." | ||
| kind --name ${CLUSTER_NAME} load docker-image ${IMAGE_REGISTRY}/${VLLM_SIMULATOR_IMAGE}:${VLLM_SIMULATOR_TAG} | ||
| kind --name ${CLUSTER_NAME} load docker-image ${VLLM_SIMULATOR_IMAGE} |
There was a problem hiding this comment.
Please undo this change due to other changes that are requested to be undone.
|
|
||
| # Default image registry for pulling deployment images | ||
| export IMAGE_REGISTRY="${IMAGE_REGISTRY:-ghcr.io/llm-d}" | ||
|
|
There was a problem hiding this comment.
Please undo this change due to other changes that are requested to be undone.
| local image_name="$1" | ||
| echo "Checking for image: ${image_name}" | ||
|
|
||
| # Attempt to inspect the image manifest on the remote registry. |
There was a problem hiding this comment.
Please reverse this check. That is check locally first and only if not present locally check the remote container registry.
There was a problem hiding this comment.
is there any risk that the local version could become out of date? If I am using ghcr.io/llm-d/llm-d-inference-scheduler:latest and I've already pulled it a few weeks ago then the local image would not be updated if the check is executed in this order
| loadSession, err := gexec.Start(cmdKindLoad, ginkgo.GinkgoWriter, ginkgo.GinkgoWriter) | ||
| gomega.Expect(err).ShouldNot(gomega.HaveOccurred()) | ||
| gomega.Eventually(loadSession).WithTimeout(600 * time.Second).Should(gexec.Exit(0)) | ||
| case "docker": |
There was a problem hiding this comment.
The code in main has been changed for docker to also save and then do a kind load image-archive. Solves multi-architecture image issues on kind.
There was a problem hiding this comment.
Yes thank you for updating that! I will adjust this pr (or open a new one)
|
@hdefazio I have updated many of my comments as after I read your proposal and re-reviewed the code I better understand what you were trying to do here. Some places need minor tweaks but overall it's a good PR. |
|
@hdefazio Do you plan on carrying this PR forward? Especially in light of your proposal? |
|
Closing in favor of #406 |
…king protocol (llm-d#377) * Defining an outer metadata struct as part of the extproc endpoint picking protocol * Apply suggestions from code review Update the protocol doc based on the suggested edits Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com> * Updated the flag names --------- Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>
Depends on #371
This PR fixes a critical issue where the end-to-end test suite was failing in environments using Podman due to how kind loads container images.
What this PR does / why we need it:
Previously, our tests used
kind load docker-imageto load test images into the Kind cluster. While this works for Docker, it is unreliable with Podman, especially in rootless environments. The command would fail with errors like "image not present locally" or "stat -: no such file or directory" because the test runner could not correctly connect to the user's Podman session or handle piped image data.This is a known issue with kind (see kubernetes-sigs/kind#2038 and kubernetes-sigs/kind#3105) and this pr implements the suggested workaround.
Testing: