[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

fstetic · 2023-01-19T10:30:18Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?
Local Canonical Kubeflow using this guide
KFP version:
Bottom of KFP UI left sidenav says build version dev_local and the guide states 1.6
KFP SDK version:
kfp 2.0.0b10
kfp-pipeline-spec 0.1.17
kfp-server-api 2.0.0a6

Steps to reproduce

Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.

Expected result

Pipeline shouldn't fail.

Materials and Reference

In details it says Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.

Addition pipeline from documentation

from kfp import compiler
from kfp import dsl


@dsl.component
def addition_component(num1: int, num2: int) -> int:
    return num1 + num2


@dsl.pipeline(name="addition-pipeline")
def my_pipeline(a: int, b: int, c: int):
    add_task_1 = addition_component(num1=a, num2=b)
    add_task_2 = addition_component(num1=add_task_1.output, num2=c)


cmplr = compiler.Compiler()
cmplr.compile(my_pipeline, package_path="my_pipeline.yaml")

Impacted by this bug? Give it a 👍.

The text was updated successfully, but these errors were encountered:

gkcalat · 2023-01-19T23:51:18Z

Hi @fstetic!
Thank you for reporting this. Could you confirm whether the problem is persistent or if it goes away after the run completes?

fstetic · 2023-01-20T08:07:53Z

Hi @gkcalat! Thanks for the quick response.

The run doesn't complete. That error happens at the start of the run.

I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named <workflow name> in mlpipelines/artifacts, but v2 don't. "contextName" in the error message stated in the issue corresponds to the RunID of the pipeline, not workflow name.

I also noticed in network requests, when a run is opened in UI, a POST request to /ml_metadata.MetadataStoreService/GetContextByTypeAndName where v1 and v2 pipelines differ. V1 pipelines send pipeline_run in request body and v2 pipelines send system.PipelineRun. Don't know if that means anything because in both cases the request fails with 400 error and message Cannot POST /ml_metadata.MetadataStoreService/GetContextByTypeAndName

I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next.

tleewongjaro-agoda · 2023-04-20T08:02:30Z

Hello @fstetic

I am also having the same problem.
Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

fstetic · 2023-04-20T08:12:29Z

Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines.

gkcalat · 2023-04-20T15:00:07Z

/cc @chensun

Enochlove · 2023-09-05T09:12:07Z

Hello @fstetic

I am also having the same problem. Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

Have u fighred out now? Or any ideas?

LordWaif · 2023-09-05T10:56:35Z

The use of v1 pipelines is still viable?, I have the same problem reported above

But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below

In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.

+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd

DIR=/opt/proxy
++ jq -r '.data.Hostname // empty'
++ kubectl get configmap inverse-proxy-config -o json
HOSTNAME=
[[ -n '' ]]
[[ ! -z '' ]]
++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
INSTANCE_ZONE=/

Enochlove · 2023-09-06T02:12:32Z

Sorry，I'm still using Pipeline v2. Can't help u.    ?????Enoch????? ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2023年9月5日(星期二) 晚上6:56 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [kubeflow/pipelines] [backend] Cannot get MLMD objects from Metadata store when running v2 pipeline (Issue #8733) The use of v1 pipelines is still viable?, I have the same problem reported above But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below In the ui, I keep coming across this error without being able to use it Error: failed to retrieve list of pipelines. Click Details for more information. +++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh ++ cd /opt/proxy ++ pwd DIR=/opt/proxy ++ jq -r '.data.Hostname // empty' ++ kubectl get configmap inverse-proxy-config -o json HOSTNAME= [[ -n '' ]] [[ ! -z '' ]] ++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal INSTANCE_ZONE=/ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

DnPlas · 2023-10-05T11:02:15Z

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

chensun · 2023-10-05T18:58:24Z

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):
"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.
Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution.
That being said, I will test this again with Kubeflow 1.8 rc shortly.

chensun · 2023-10-09T20:52:59Z

Confirming this doesn't reproduce on Kubeflow 1.8.0-rc.1

The error message about cannot get MLMD context does sometime shown in the UI, this is expected before a run starts (we should consider some UI improvement to not make it confusing), but it should be gone once the run starts (the root driver pod will create MLMD context).

venkatesh-chinni · 2024-05-06T09:46:42Z

facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ?

ZeynepRuveyda · 2024-05-16T07:21:32Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

venkatesh-chinni · 2024-05-17T09:24:04Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

ZeynepRuveyda · 2024-05-17T09:29:51Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version.

I hope it will help for you!

photonbit · 2024-06-04T00:51:50Z

I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests.

sapphire008 · 2024-07-17T19:32:53Z

this guide

@photonbit Do you mind elaborating on what volume is being created? Is this a Docker volume? Is there a specific name I need to use? Thanks.

photonbit · 2024-07-17T19:42:21Z

@photonbit Do you mind elaborating on what volume is being created? Is this a Docker volume? Is there a specific name I need to use? Thanks.

I created a volume from the kubeflow dashboard. For the name, I tried both with a random name and creating the same volume I had configured for the pipeline and both resulted in the issue being solved.

sapphire008 · 2024-07-17T19:49:18Z

@photonbit Thanks!

thesuperzapper · 2024-08-03T00:13:19Z

Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the Deployment/metadata-envoy-deployment Pods.

As a temporary workaround, it seems like you can simply restart that deployment and it should fix it:

kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow

As a longer-term solution (assuming a restart is all that is required), we can add a livenessProbe on the PodSpec of the manifests. For example, this Kustomize patch may work:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metadata-envoy-deployment
spec:
  template:
    spec:
      containers:
        - name: container
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            successThreshold: 1
            timeoutSeconds: 5
            httpGet:
              path: "/"
              port: md-envoy
              httpHeaders:
                - name: Content-Type
                  value: application/grpc-web-text

orfeas-k · 2024-08-07T13:47:59Z

As commented in deployKF/deployKF#191 (comment), note that the above livenessProbe has been tested for pipelines 2.0 and we bumped into issues sending the same request when we bumped in 2.2.0 (as described in canonical/envoy-operator#106)

ZDowney926 · 2024-08-09T10:10:56Z

@thesuperzapper Thanks for your information. I put this in "metadata-envoy-deployment" yaml, and I run a simple pipeline of example, it still show of " Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun","contextName":"2c9d40f1-1c09-4de3-a2e0-725eb4a4f0fb"}: Cannot find specified context"
P.S. I try in both KubeFlow v1.8.0 and v1.10.0, it show same message.

vishnujp12 · 2024-08-30T06:38:15Z

In my case the issue was related to a hard coded image name [gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707] somewhere in the kubeflow code which is not mentioned any where in the kubeflow manifests, which kept the pod Init:ImagePullBackOff state. Loading this image with the same name on all worker nodes solved the issue. I am using kubeflow version 1.8, on prem k8 cluster.

fstetic added area/backend kind/bug labels Jan 19, 2023

gkcalat self-assigned this Jan 19, 2023

gkcalat assigned chensun and jlyaoyuli and unassigned gkcalat May 4, 2023

This comment was marked as outdated.

Sign in to view

davidspek mentioned this issue Sep 21, 2023

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue kubeflow/manifests#2442

Closed

5 tasks

chensun closed this as completed Oct 9, 2023

DnPlas mentioned this issue Oct 14, 2023

cannot save parameter /tmp/outputs/condition #9678

Closed

kimwnasptd mentioned this issue Jul 4, 2024

running kfp_v2 integration test and an experiment shows that Cannot get MLMD objects from Metadata store. canonical/bundle-kubeflow#966

Closed

Jithsaavvy mentioned this issue Jul 15, 2024

[backend] Cannot get MLMD objects from Metadata store. Cannot find context (Version 1.9.0-rc.2) kubeflow/manifests#2800

Closed

7 tasks

xwt-ml mentioned this issue Jul 26, 2024

Cannot get MLMD objects from Metadata store. deployKF/deployKF#187

Closed

3 tasks

thesuperzapper mentioned this issue Aug 3, 2024

fix: add liveness probe for metadata-envoy-deployment deployKF/deployKF#191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

fstetic commented Jan 19, 2023 •

edited

Loading

gkcalat commented Jan 19, 2023

fstetic commented Jan 20, 2023

tleewongjaro-agoda commented Apr 20, 2023

fstetic commented Apr 20, 2023

gkcalat commented Apr 20, 2023

Enochlove commented Sep 5, 2023

This comment was marked as outdated.

LordWaif commented Sep 5, 2023

Enochlove commented Sep 6, 2023 via email

DnPlas commented Oct 5, 2023

chensun commented Oct 5, 2023

chensun commented Oct 9, 2023

venkatesh-chinni commented May 6, 2024

ZeynepRuveyda commented May 16, 2024 •

edited

Loading

venkatesh-chinni commented May 17, 2024 •

edited

Loading

ZeynepRuveyda commented May 17, 2024

photonbit commented Jun 4, 2024

sapphire008 commented Jul 17, 2024

photonbit commented Jul 17, 2024

sapphire008 commented Jul 17, 2024

thesuperzapper commented Aug 3, 2024

orfeas-k commented Aug 7, 2024

ZDowney926 commented Aug 9, 2024 •

edited

Loading

vishnujp12 commented Aug 30, 2024

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

Comments

fstetic commented Jan 19, 2023 • edited Loading

Environment

Steps to reproduce

Expected result

Materials and Reference

gkcalat commented Jan 19, 2023

fstetic commented Jan 20, 2023

tleewongjaro-agoda commented Apr 20, 2023

fstetic commented Apr 20, 2023

gkcalat commented Apr 20, 2023

Enochlove commented Sep 5, 2023

This comment was marked as outdated.

LordWaif commented Sep 5, 2023

Enochlove commented Sep 6, 2023 via email

DnPlas commented Oct 5, 2023

chensun commented Oct 5, 2023

chensun commented Oct 9, 2023

venkatesh-chinni commented May 6, 2024

ZeynepRuveyda commented May 16, 2024 • edited Loading

venkatesh-chinni commented May 17, 2024 • edited Loading

ZeynepRuveyda commented May 17, 2024

photonbit commented Jun 4, 2024

sapphire008 commented Jul 17, 2024

photonbit commented Jul 17, 2024

sapphire008 commented Jul 17, 2024

thesuperzapper commented Aug 3, 2024

orfeas-k commented Aug 7, 2024

ZDowney926 commented Aug 9, 2024 • edited Loading

vishnujp12 commented Aug 30, 2024

fstetic commented Jan 19, 2023 •

edited

Loading

ZeynepRuveyda commented May 16, 2024 •

edited

Loading

venkatesh-chinni commented May 17, 2024 •

edited

Loading

ZDowney926 commented Aug 9, 2024 •

edited

Loading