-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733
Comments
Hi @fstetic! |
Hi @gkcalat! Thanks for the quick response. The run doesn't complete. That error happens at the start of the run. I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named I also noticed in network requests, when a run is opened in UI, a POST request to I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next. |
Hello @fstetic I am also having the same problem. Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14 |
Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines. |
/cc @chensun |
Have u fighred out now? Or any ideas? |
This comment was marked as outdated.
This comment was marked as outdated.
The use of v1 pipelines is still viable?, I have the same problem reported above But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below In the ui, I keep coming across this error without being able to use it +++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
|
Sorry,I'm still using Pipeline v2. Can't help u.
?????Enoch?????
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2023年9月5日(星期二) 晚上6:56
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [kubeflow/pipelines] [backend] Cannot get MLMD objects from Metadata store when running v2 pipeline (Issue #8733)
The use of v1 pipelines is still viable?, I have the same problem reported above
But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below
In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.
+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd
DIR=/opt/proxy
++ jq -r '.data.Hostname // empty'
++ kubectl get configmap inverse-proxy-config -o json
HOSTNAME=
[[ -n '' ]]
[[ ! -z '' ]]
++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
INSTANCE_ZONE=/
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):
Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8? |
I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution. |
facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ? |
Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ? |
Still trying to figure out, no resolution yet |
I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version. I hope it will help for you! |
I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests. |
@photonbit Do you mind elaborating on what volume is being created? Is this a Docker volume? Is there a specific name I need to use? Thanks. |
I created a volume from the kubeflow dashboard. For the name, I tried both with a random name and creating the same volume I had configured for the pipeline and both resulted in the issue being solved. |
@photonbit Thanks! |
Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the As a temporary workaround, it seems like you can simply restart that deployment and it should fix it: kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow As a longer-term solution (assuming a restart is all that is required), we can add a apiVersion: apps/v1
kind: Deployment
metadata:
name: metadata-envoy-deployment
spec:
template:
spec:
containers:
- name: container
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
httpGet:
path: "/"
port: md-envoy
httpHeaders:
- name: Content-Type
value: application/grpc-web-text |
As commented in deployKF/deployKF#191 (comment), note that the above livenessProbe has been tested for pipelines 2.0 and we bumped into issues sending the same request when we bumped in 2.2.0 (as described in canonical/envoy-operator#106) |
@thesuperzapper Thanks for your information. I put this in "metadata-envoy-deployment" yaml, and I run a simple pipeline of example, it still show of " |
In my case the issue was related to a hard coded image name [gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707] somewhere in the kubeflow code which is not mentioned any where in the kubeflow manifests, which kept the pod Init:ImagePullBackOff state. Loading this image with the same name on all worker nodes solved the issue. I am using kubeflow version 1.8, on prem k8 cluster. |
Environment
Local Canonical Kubeflow using this guide
Bottom of KFP UI left sidenav says
build version dev_local
and the guide states 1.6kfp 2.0.0b10
kfp-pipeline-spec 0.1.17
kfp-server-api 2.0.0a6
Steps to reproduce
Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.
Expected result
Pipeline shouldn't fail.
Materials and Reference
In details it says
Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.
Addition pipeline from documentation
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: