Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Cannot get MLMD objects from Metadata store. Cannot find context (Version 1.9.0-rc.2) #2800

Closed
6 of 7 tasks
Jithsaavvy opened this issue Jul 15, 2024 · 11 comments
Closed
6 of 7 tasks

Comments

@Jithsaavvy
Copy link

Jithsaavvy commented Jul 15, 2024

Validation Checklist

Version

master

Describe your issue

Environment

  • Kubeflow: v1.9.0-rc.2
  • Kubernetes: v1.27.4
  • Platform: On-premise Kubernetes cluster
  • Kubeflow Pipelines (KFP) - v2.2.0
  • KFP SDK - v2.8.0
  • OS: Ubuntu 22.04
  • Deployment:
    • Using Kubeflow Manifests (without any specific distribution) from the master branch.
    • v1.9.0-rc.2 version was deployed (manually checked with every deployed component versions using release notes).

Description

Kubeflow was upgraded from v1.8.1 stable to v1.9.0-rc.2 (since no stable release is available yet for v1.9 ) to use the latest KFP v2.2.0, as a clean redeployment. When attempted to run a pipeline via the UI, it resulted in the following error:

Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun" "contextName":"cc1bbc51-426f-4192-843a-bf4b94535a5b"}: Cannot find specified context

The same pipeline executed successfully without any issues or errors in the previous stable Kubeflow v1.8.1. As a sanity check, a sample pipeline from the documentation and an existing tutorial pipeline available within the KFP UI were also attempted to run, both resulting in the above error.

Upon inspection of the embedded MySQL pod, the pipeline context record was created in the mlpipeline database as the following:

mlpipeline table

mysql> USE mlpipeline;
mysql> SELECT uuid, name, status from pipelines;
| uuid                                 | name                                | status |
|--------------------------------------|-------------------------------------|--------|
| 645a4823-8e01-432b-b6b1-75776d14c805 | [Tutorial] DSL - Control structures | READY  |

run_details table

mysql> SELECT uuid, displayname, pipelinecontextid, pipelineid, conditions from run_details;

| uuid                                 | displayname                                        | pipelinecontextid | pipelineid                           | conditions |
|--------------------------------------|----------------------------------------------------|-------------------|--------------------------------------|------------|
| cc1bbc51-426f-4192-843a-bf4b94535a5b | Run of [Tutorial] DSL - Control structures (be550) | 0                 | 645a4823-8e01-432b-b6b1-75776d14c805 | Failed     |

However, the execution-run context for the same pipeline was not created or referenced in the metadb database. Analysis of the pods from the kubeflow namespace revealed that the ml-pipeline-api-server container within the ml-pipeline pod uses the mlpipeline database as backend storage for the pipeline component and the metadata-controller pod uses the metadb database as backend storage for the MLMD store. It appears that metadb cannot find or access the pipeline context record from the mlpipeline db or something similar. The connection to the MySQL-db pod is strong and the respective pvc is mounted, available and accessible.

Note: The above description applies to any pipeline.

Expected Behavior

Pipeline run should succeed without any issues.

Current Behavior

When a pipeline run is triggered from the UI, a system-dag-driver pod is created in the KF user namespace and runs to completion successfully. After that, the KFP execution pod is created with respect to the pipeline components and fails immediately, resulting in the above error.

Steps to reproduce the issue

  1. Install Kubeflow v1.9.0-rc.2 using Kubeflow Manifests.
  2. Copy the pipeline code or use the already existing tutorial pipeline from the UI and create a run from it.

Additional Context

v1.9.0-rc.2's release notes states that it supports Kubernetes v1.27 - 1.29. But, the README from this particular RC's release tag states that it targets Kubernetes v1.29+ which is a little confusing. My questions are:

  1. What are the supported Kubernetes versions for KF v1.9?
  2. Is the above issue a known bug in this RC version which will be patched in v1.9 stable release?
  3. Is anyone else impacted by this issue or are there any solutions available?

Related Issues

  1. Cannot get MLMD objects from Metadata store when running v2 pipeline #8733
  2. Error running pipelines with pod labels or annotation in pipeline steps added using kfp-kubernetes #10868

Put here any screenshots or videos (optional)

No response

@juliusvonkohout
Copy link
Member

Hello, this should be fixed in the 1.9 branch and the final release today. Please try again with the current v1.9 branch, not the rc.2 tag, and reopen if it still occurs with "/reopen".

@Jithsaavvy
Copy link
Author

Thanks @juliusvonkohout. Yes, the issue is fixed in v1.9 stable release.

@nparkstar
Copy link
Contributor

nparkstar commented Sep 4, 2024

/reopen

@Jithsaavvy, Did you solve this issue?
I have the same issue still, though I tried with kubeflow v1.9.
I've got the message "Cannot get MLMD objects from Metadata store."

env :
OS : Ubuntu 22.04
Kubernetes : v1.27
Kubeflow : upgrade to 1.9 (using "git clone -b v1.9.0 https://github.com/kubeflow/manifests.git")
KFP : kfp 2.7.0 (using "kfp --version")

message on GUI
image

And I can see following message when I run the command "kubectl logs -n kubeflow metadata-grpc-deployment-c568bd446-zpptp"

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0716 08:00:27.548267 1 metadata_store_server_main.cc:577] Server listening on 0.0.0.0:8080
W0902 12:52:33.509635 10 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: system.Pipeline, version: nullopt
W0902 12:52:33.568145 11 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: system.PipelineRun, version: nullopt
E0903 13:59:45.840103283 10 hpack_parser.cc:1216] Error parsing metadata: error=invalid value key=:method value=HEAD
E0903 14:49:58.423987147 11 hpack_parser.cc:1216] Error parsing metadata: error=invalid value key=content-type value=application/grpc-web-text

Do you know about it?

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 4, 2024

@nparkstar is this a pipelines or manifests issue? It looks like pipelines only.

@nparkstar
Copy link
Contributor

@nparkstar is this a pipelines or manifests issue? It looks like pipelines only.

Jithsaavvy said "Yes, the issue is fixed in v1.9 stable release."
I tried v1.9, but the issue remains.
So I commented is this issue.
My comments are not appropriate for this issue, I'll delete.

@juliusvonkohout
Copy link
Member

@nparkstar you probably need a new issue. Please test with a fresh kind cluster as described in our readme, to make sure that it is not specific to your Kubernetes cluster first.

@nparkstar
Copy link
Contributor

@nparkstar you probably need a new issue. Please test with a fresh kind cluster as described in our readme, to make sure that it is not specific to your Kubernetes cluster first.
Thank you for your response.
I'll test on new system as you tell and I'll post the result.

@nparkstar
Copy link
Contributor

@nparkstar you probably need a new issue. Please test with a fresh kind cluster as described in our readme, to make sure that it is not specific to your Kubernetes cluster first.
Thank you for your response.
I'll test on new system as you tell and I'll post the result.

I solved my issue at last.
I did fresh install on the new machine, and the problem has not appear anymore.

But I think the following instruction is wrong.
"Install individual components" of https://github.com/kubeflow/manifests/tree/v1.9.0.

I cannot connect to the central dashboard after installing kubeflow according to the above instruction.

I succeeded after installing using bellow command.

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Thanks,

@juliusvonkohout
Copy link
Member

Then please create a PR to fix it.

@nparkstar
Copy link
Contributor

@juliusvonkohout

I created PRs.
#2873
#2874

This is my first time to create PR.
If I did something wrong, tell me.

Thanks,

@vak890
Copy link

vak890 commented Sep 12, 2024

@juliusvonkohout I also get the same error when trying to deploy all components
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants