-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running kfp_v2 integration test and an experiment shows that Cannot get MLMD objects from Metadata store.
#966
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5956.
|
This should be a transient error shown by MLMD kubeflow/pipelines#8733 (comment) @nishant-dash could you confirm if the run completed successfully in the end? |
Just saw it finishes, so will go ahead and close this issue |
Re-opening this issue. We managed to reproduce it while bulk testing the Azure one-click deployment. The message in our case was not transient. The red box kept staying there and also the UI was never getting updated for a pipeline run progress. After looking at the browser's dev-tools we saw that an A short term solution was to delete the |
ReproducabilityTrying to reproduce it in a Microk8s 1.29, Juju 3.4.4, I also see the following logs in the
but no error in KFP UI neither in Here's the juju status output which from a revision perspective is identical to the reported one, apart from kserve-controller where mine has 573 and for kfp-schedwf 1466. I 'll try again in Azure one-click deployments. |
LogsThe first finding is that this happens due to a race taking place in envoy workload pod.
Compared to logs from a healthy envoy pod
DifferencesWe see that when the envoy works as expected, the following line run
and we see the
compared to when it's not working
and the unhealthy pod results in
EDIT: the `shutting down parent after drain is also observed on healthy envoy pods. JSON configurationTheir {"admin": {"access_log_path": "/tmp/admin_access.log", "address": {"socket_address": {"address": "0.0.0.0", "port_value": "9901"}}}, "static_resources": {"listeners": [{"name": "listener_0", "address": {"socket_address": {"address": "0.0.0.0", "port_value": 9090}}, "filter_chains": [{"filters": [{"name": "envoy.http_connection_manager", "config": {"stat_prefix": "ingress_http", "route_config": {"name": "local_route", "virtual_hosts": [{"name": "local_service", "domains": ["*"], "routes": [{"match": {"prefix": "/"}, "route": {"cluster": "metadata-grpc-service", "max_grpc_timeout": "60.000s"}}], "cors": {"allow_origin": ["*"], "allow_methods": "GET,PUT,DELETE,POST,OPTIONS", "allow_headers": "cache-control,content-transfer-encoding,content-type,grpc-timeout,keep-alive,user-agent,x-accept-content-transfer-encoding,x-accept-response-streaming,x-grpc-web,x-user-agent,custom-header-1", "expose_headers": "grpc-status,grpc-message,custom-header-1", "max_age": "1728000"}}]}, "http_filters": [{"name": "envoy.grpc_web"}, {"name": "envoy.cors"}, {"name": "envoy.router"}]}}]}]}], "clusters": [{"name": "metadata-grpc-service", "type": "LOGICAL_DNS", "connect_timeout": "30.000s", "hosts": [{"socket_address": {"address": "metadata-grpc-service", "port_value": 8080}}], "http2_protocol_options": {}}]}} |
CurlTrying to curl the envoy's endpoint, I see that it returns connection refused:
which is expected to return a
|
WorkaroundWe didn't figure out what are the circumstances that trigger this. However, restarting the container (or deleting the pod) always resulted in a functional workload container. That's why, since 2.0 version is a PodSpec charm, we 'll be adding an |
Adds the condition from integration tests as a `livenessProbe` to the workload container in order to ensure that the workload container restarts if it's not functional. Closes canonical/bundle-kubeflow#966
The fix has been merged in canonical/envoy-operator#124 and promoted to |
Bug Description
running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment shows that
Cannot get MLMD objects from Metadata store.
[0] fe86b4e255c4c695376f70061a6a645301350d5a
To Reproduce
kfp-v2-integration.ipynb
Cannot get MLMD objects from Metadata store.
which details out asEnvironment
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: