-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Nested pipelines fail to run #10039
Comments
/assign @chensun |
I can reproduce this, looks like a backend bug. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
I'm still seeing this issue on kfp 2.7, anyone else? @JosepSampe @chensun |
/reopen |
@KyleKaminky: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@JosepSampe: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/reopen |
@droctothorpe: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I plan to investigate / debug this issue this sprint. |
I'm keeping a log in case anyone else wants to follow along or has any thoughts / suggestion. I'll keep it updated as I make progress. 2024-09-03I'm going to start by validating that I can recreate the failure scenario. I'm from kfp import dsl
from kfp.client import Client
@dsl.component
def inner_comp() -> str:
return "foobar"
@dsl.component
def outer_comp(input: str):
print("input: ", input)
@dsl.pipeline
def inner_pipeline() -> str:
inner_comp_task = inner_comp()
inner_comp_task.set_caching_options(False)
return inner_comp_task.output
@dsl.pipeline
def outer_pipeline():
inner_pipeline_task = inner_pipeline()
outer_comp_task = outer_comp(input=inner_pipeline_task.output)
outer_comp_task.set_caching_options(False)
if __name__ == "__main__":
client = Client()
run = client.create_run_from_pipeline_func(
pipeline_func=outer_pipeline,
enable_caching=False,
) Confirming that this failed. Let's simplify and remove variables. I'm going to only run the inner pipeline: from kfp import dsl
from kfp.client import Client
@dsl.component
def inner_comp() -> str:
return "inner"
@dsl.pipeline
def inner_pipeline() -> str:
inner_comp_task = inner_comp()
inner_comp_task.set_caching_options(False)
return inner_comp_task.output
if __name__ == "__main__":
client = Client()
run = client.create_run_from_pipeline_func(
pipeline_func=inner_pipeline,
enable_caching=False,
) That ran without any issues. What if we modify it slightly such that we have a sub-DAG but we never reference @dsl.pipeline
def outer_pipeline():
inner_pipeline()
outer_comp_task = outer_comp(input="foo") # Note, we never reference the output of inner_pipeline.
outer_comp_task.set_caching_options(False) That worked! Let's copy the successful and failed AWF manifests to a file and diff /
metadata:
annotations:
- pipelines.kubeflow.org/components-root: '{"dag":{"tasks":{"inner-pipeline":{"cachingOptions":{},"componentRef":{"name":"comp-inner-pipeline"},"taskInfo":{"name":"inner-pipeline"}},"outer-comp":{"cachingOptions":{},"componentRef":{"name":"comp-outer-comp"},"dependentTasks":["inner-pipeline"],"inputs":{"parameters":{"input":{"taskOutputParameter":{"outputParameterKey":"Output","producerTask":"inner-pipeline"}}}},"taskInfo":{"name":"outer-comp"}}}}}'
+ pipelines.kubeflow.org/components-root: '{"dag":{"tasks":{"inner-pipeline":{"cachingOptions":{},"componentRef":{"name":"comp-inner-pipeline"},"taskInfo":{"name":"inner-pipeline"}},"outer-comp":{"cachingOptions":{},"componentRef":{"name":"comp-outer-comp"},"inputs":{"parameters":{"input":{"runtimeValue":{"constant":"foo"}}}},"taskInfo":{"name":"outer-comp"}}}}}'
spec:
templates:
dag:
tasks:
arguments:
name: task
value:
- '{"cachingOptions":{},"componentRef":{"name":"comp-outer-comp"},"dependentTasks":["inner-pipeline"],"inputs":{"parameters":{"input":{"taskOutputParameter":{"outputParameterKey":"Output","producerTask":"inner-pipeline"}}}},"taskInfo":{"name":"outer-comp"}}'
+ '{"cachingOptions":{},"componentRef":{"name":"comp-outer-comp"},"inputs":{"parameters":{"input":{"runtimeValue":{"constant":"foo"}}}},"taskInfo":{"name":"outer-comp"}}' I wonder if the compiler is just misconfiguring the workflow manifest somehow. Let's take a minute to parse and grok the failed manifest. Under
The Here are the corresponding logs:
Here's the key line:
Let's confirm that the write is happening. It looks like it is, judging by this log from the inner executor:
I spent some time combing through the mysql databases and tables then asked 2024-09-4@HumairAK generously ran the failing pipeline and confirmed that the output We've confirmed it's not a write problem. It's likely (1) a read problem (the Next steps to follow. |
Added some debug logging to the driver.
Okay, here's the exact line of code that returns the error: This case handles inputs that are outputs from previous tasks. When the previous task is a container, the resulting
In particular, note the following in the
That's our previous / producer task output. Now look at what
In particular, take note of the fact that there is no custom property with a key of We know that the output is in the database, as @HumairAK demonstrated, but for Why not? This might actually be a write problem after all, i.e. it's possible that the Will pick up where I left off tomorrow. |
Apologies for not updating more consistently. We really got into the weeds with this. The abstractions are complex enough that even just communicating about them is quite difficult. The good news is that we have a working POC. All of the updates are restricted to It requires a lot more polish, validation, test file updates, recursive flattening (for when sub-DAGs have sub-DAGs), and we need to test against NamedTuple updates, but having a functional POC is a great milestone. Adding this topic to the community call agenda today for a possible informal design review of sorts, even though it is still a WIP. cc @chensun. |
This actually seems related to my discussion started here: #11181. Thanks! Looking forward to this fix. |
Thanks for bringing that up @ianbenlolo! Right now we're aiming to support PipelineParameters & Artifacts to move between nested DAGs. The retry piece may be a bit out of scope but could definitely be tested against our solution. |
Closed by #11196. |
/close |
@droctothorpe: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm trying to run the "Pipelines as Components" example from the documentation, but it seems I'm not able to run it successfully. it always compile, but it always produces an error after the subpipeline is executed properly. Looking at the logs of the pod that produces the error, I can see:
Is there a different way to access the output of a "pipeline as component"?
Environment
Using a kind cluster created with:
kind create cluster --name kfp
kfp 2.2.0
kfp-pipeline-spec 0.2.2
kfp-server-api 2.0.1
Steps to reproduce
I also tried to explicitly access the output with
but I get the same error
Expected result
Successfull execution
Materials and reference
Labels
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: