Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue #2442

Closed
5 tasks done
DnPlas opened this issue Apr 25, 2023 · 72 comments
Closed
5 tasks done

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue #2442

DnPlas opened this issue Apr 25, 2023 · 72 comments

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Apr 25, 2023

This issue will provide high level updates of Kubeflow 1.8 release.

TODO:

  • Create an issue for the release timeline proposal:
    First release timeline proposal
  • Add a directory for 1.8 release in releases
  • Update the release team md file
  • Create a 1.8 Release Team Board in Github
  • Add link of the release timeline on this description

cc: @kubeflow/release-team @jbottum

@helloericsf
Copy link

After evaluating our engineering roadmap and priorities, the BentoML team has decided to pause the integration with Kubeflow Pipelines in the 1.8 release.
We value our collaboration with Kubeflow and apologize for any disruption this causes. We believe pausing the integration temporarily is the right decision to ensure we deliver quality features. We look forward to resuming our work together in future releases.
I wanted to let you know about this decision openly and transparently. Please reach out if you have any questions or concerns. As the serving work group liaison, I do not anticipate any changes in my role or responsibilities.
cc @DnPlas

@DnPlas
Copy link
Contributor Author

DnPlas commented Jul 24, 2023

Hi community, as we approach our feature freeze (Aug 2nd) I think it is worth to ask about anything that you folks think will require more time to be completed before that date. The release team liaisons have been doing an excellent job at communicating with WG leads, but I extend the question to the rest of the community.

cc: @kubeflow/wg-automl-leads @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-pipeline-leads @jbottum

@juliusvonkohout
Copy link
Member

@tzstoyanov will commit the istio 1.18 upgrade in #2455 tomorrow and i hope that the other additional rootless stuff will not be relevant for the feature freeze as written down in the PR description.

@DnPlas
Copy link
Contributor Author

DnPlas commented Jul 28, 2023

@adriangonz FYI, just so you have the dates and all information about the upcoming release.

@kimwnasptd
Copy link
Member

Hey @DnPlas, I'd like to sincerely ask that we have a 1 week delay. The situation with Notebooks WG is the following:

  1. I'd really like to include Add PVCViewer Frontend Integration kubeflow#7179, since it's the last piece to include @TobiasGoerke's effort
  2. There's an issue with our CI and it's failing to build images, which we are tracking in crud-web-apps can't be build for ppc64le kubeflow#7226

The plan I have in mind for Notebooks WG is to evaluate as soon as possible how can we unblock the CI and do the review on the PVCViewer integration. I'm expecting this to take until this Friday, 4th August.

Then on Monday we cut our v1.8-branch in the kubeflow/kubeflow repo and make some small PRs we need to build all the images and update the manifests. I'd expect this to take 1 day, even though the PRs are small due to the async communication.

Lastly, these are the PRs that I'd also want to finalise during the feature freeze, but am OK to not delay the release for those and cherry-pick afterwards:

@DnPlas
Copy link
Contributor Author

DnPlas commented Aug 7, 2023

ACK @kimwnasptd, I'll share this information with the release team.

@kimwnasptd
Copy link
Member

@DnPlas @NohaIhab from Notebooks WG side we've:

  1. Merged the PR we wanted for the Volumes UI Add PVCViewer Frontend Integration kubeflow#7179
  2. Fixed the CI resolve build exception generated due to latest gevent version. kubeflow#7231

We'll proceed the next couple of days now with cutting the release branch and updating our manifests for the RC

@DnPlas
Copy link
Contributor Author

DnPlas commented Aug 7, 2023

@kimwnasptd thanks for the update, please keep the team posted as we are planning to finish the manifest sync next week (Wednesday).

cc: @NohaIhab

@yhwang
Copy link
Member

yhwang commented Aug 10, 2023

@DnPlas I just created a PR to update the Kubeflow Tekton Pipelines manifest to 2.0.0: #2500
cc @Tomcli

@DnPlas
Copy link
Contributor Author

DnPlas commented Sep 12, 2023

Hi folks, I would like to announce that the Kubeflow 1.8 RC.0 is out 🎉 and that we have started with manifest testing. We expect to finish this process by the end of this week (September 15th).

I'd like to encourage community members to start testing the release and provide feedback, as well as file issues if any. I would also like to remind all Distribution owners that once Manifest Testing ends we will begin Distribution Testing on September 15th (as soon as we have the results of manifest testing). Please get your Distributions and infrastructure ready for that stage.

I also want to take the opportunity to thank all the community members who have helped with getting to this stage. Let's keep working toward a successful release!

@DnPlas
Copy link
Contributor Author

DnPlas commented Sep 21, 2023

The release team is happy to announce that we have released Kubeflow 1.8 RC.1.

This is now the time for Distributions to start with their testing. The release team kindly asks for feedback by the EOW next week. Feel free to submit issues and comment on the various WGs repositories.

I'd like to encourage community members to start testing the release and provide feedback, as well as file issues if any.

I also want to take the opportunity to thank all the community members who have helped with getting to this stage. Let's keep working toward a successful Release!

@pmuilu
Copy link

pmuilu commented Sep 21, 2023

So is it so that kfp v2 will be still broken with KF 1.8? (At least #8733 is still open)

@DnPlas
Copy link
Contributor Author

DnPlas commented Sep 22, 2023

So is it so that kfp v2 will be still broken with KF 1.8? (At least #8733 is still open)

hey @chensun @zijianjoy pinging you folks for getting more accurate information. Do you think this issue deserves more attention?

@kromanow94
Copy link
Contributor

Hello, I have a few issues with 1.8 based on my tests on Kubeflow 1.8 RC.1:

  • If the Pipeline Run Steps are Pending, there is an error shown
    Cannot get MLMD objects from Metadata store.
    
    This is resolved after the Pod for any of the Steps is created (so if I get this right, after metadata-writer creates an entry in MLMD DB based on Running Pod).

image

  • The Pipeline Run Page doesn't show the main-logs output artifact which makes accessing logs from the Steps after the Pod was deleted not possible.
  • In the Pipeline Run Page it's not possible to access the input and output artifacts due to RBAC Access Denied error. Those artifacts are visible through the Artifacts Page though.

image

In general I feel that previous version of the KF Pipeline UI was more informative. For example, if the Pods' Step was Pending, it was possible to see the reason of Pending state in the Pipeline Run Page.

Is there a way to enable back the main-logs artifact?

@Davidnet
Copy link
Member

Tagging @Linchin for visibility, I don't know if we should create an issue in kfp, but I think kfp is working as expected ? Let me know what you think

@kromanow94
Copy link
Contributor

@Davidnet from perspective of running the Steps in the Pipeline Run, I think kfp is working as expected. It's more about the dashboard, although I'm not sure if the exception Cannot get MLMD objects from Metadata store is expected. Is the Pipeline Run dashboard maintained by the Pipelines WG or another one?

@Linchin
Copy link

Linchin commented Sep 26, 2023

Hi @kromanow94, thank you for your feedback. I haven't completed testing rc.1 yet, but I think with my rc.0 deployment I could answer some of your questions.

If the Pipeline Run Steps are Pending, there is an error shown Cannot get MLMD objects from Metadata store.

I have reproduced this error and I will investigate further into it.

The Pipeline Run Page doesn't show the main-logs

This is a v1 feature that is not implemented in v2. I have created an issue in the KFP repo about this.

it's not possible to access the input and output artifacts due to RBAC Access Denied error.

I haven't been able to reproduce this on an rc.0 deployment, but I will double check on rc.1.

@thesuperzapper
Copy link
Member

thesuperzapper commented Sep 29, 2023

@kimwnasptd we need to make sure that ARM support gets merged before the final 1.8 RC:


Also, given how many issues and pending PRs are not going to make it for Kubeflow 1.8, I propose we plan to decouple versions of the kubeflow/kubeflow repo components from the overall Kubeflow 1.X versions.

This will allow us to cut a 1.9.0 release (of the Notebooks WG components) with some of the important fixes/features without waiting literally months. Anyone interested can discuss this proposal here:

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 3, 2023

Hi @kromanow94, thank you for your feedback. I haven't completed testing rc.1 yet, but I think with my rc.0 deployment I could answer some of your questions.

If the Pipeline Run Steps are Pending, there is an error shown Cannot get MLMD objects from Metadata store.

I have reproduced this error and I will investigate further into it.

The Pipeline Run Page doesn't show the main-logs

This is a v1 feature that is not implemented in v2. I have created an issue in the KFP repo about this.

it's not possible to access the input and output artifacts due to RBAC Access Denied error.

I haven't been able to reproduce this on an rc.0 deployment, but I will double check on rc.1.

hey @Linchin , thanks for your reply. Should we consider this issue as a blocker for the release? If so, should we expect an RC2 from the pipelines WG to fix this?

@kimwnasptd
Copy link
Member

Regarding Notebooks, we are very close to merging the following 2 and would like to ask we wait one day tops to get those in.

@yhwang
Copy link
Member

yhwang commented Oct 5, 2023

@DnPlas I'd like to update kfp-tekton from 2.0.0 to 2.0.1 and here is the PR: #2545 . Thanks!

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 26, 2023

Hi folks,

Just so you know, RC4 is ready!

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 26, 2023

I've been experimenting with 1.8-rc2 and kfp=2.3.0 and have run into several issues that are blockers for me and may be for others as well. A list of them is as follows:

1. Unable to use `.after()` when referencing a component within a `ParallelFor`.  [Issue](https://github.com/kubeflow/pipelines/issues/10050) with more details here.

2. Cannot assign variables from kubernetes metadata to be environment variables as we could with `kfp=1.8.21` and earlier backend versions. [Issue](https://github.com/kubeflow/pipelines/issues/10155) with more details here.

3. Cannot assign dynamic node_selector, cpu/memory requests/limits as we could with `kfp=1.8.21` and earlier backend versions. [Issue](https://github.com/kubeflow/pipelines/issues/10154) with more details here.

4. Unable to pass data artifacts from parent node outside a `ParallelFor` to a child node within a `ParallelFor`. [Issue](https://github.com/kubeflow/pipelines/issues/10149) with more details here.

Happy to provide more info if any of these are not clear!

Hi @TristanGreathouse, thanks for the feedback and for filing those issues. It sounds like a SDK issue, but I'd suggest you try RC4 to make sure you have received the latest version of pipelines (2.0.2).

Soft ping to @chensun as he is the WG lead.

@chensun
Copy link
Member

chensun commented Oct 26, 2023

I've been experimenting with 1.8-rc2 and kfp=2.3.0 and have run into several issues that are blockers for me and may be for others as well. A list of them is as follows:

  1. Unable to use .after() when referencing a component within a ParallelFor. Issue with more details here.
  2. Cannot assign variables from kubernetes metadata to be environment variables as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  3. Cannot assign dynamic node_selector, cpu/memory requests/limits as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  4. Unable to pass data artifacts from parent node outside a ParallelFor to a child node within a ParallelFor. Issue with more details here.

Happy to provide more info if any of these are not clear!

Thank you @TristanGreathouse for the detailed issues.

  • 1 is by design that we explicitly blocks after reference across DAGs (ParallelFor, Condition, ExitHandler are all DAGs under the hood). I do see we may be able to lift this restriction for ParallelFor, will follow up in the issue. This, if we were to change, would be SDK only change. So I think we don't need to tie it to the Kubeflow deployment release.
  • 2 and 3 are both known limitations. We don't have a timeline for supporting these features at this moment, but if there's interests in contributing, we'd happy to follow up in the issues.
  • 4 is a bug and I made a fix. Will do a patch release shortly. @DnPlas, I'll get back to you once this release is out. (ETA by end of this week)

@TristanGreathouse
Copy link

TristanGreathouse commented Oct 26, 2023

Thanks @DnPlas and @chensun for the responses!

  • 1 Thanks, this is good to know!
  • 2 & 3 Yeah I am very interested in contributing and have other members on my team that have expressed interest (CC: @sachdevayash1910 @catapulta). We can start looking into fixes for this, although I'm not yet super familiar with the codebase. If you already have suggestions on where to begin please let us know :)
  • 4 Wow, this is great news! Thanks!

@chensun
Copy link
Member

chensun commented Oct 27, 2023

@DnPlas I just cut KFP 2.0.3: https://github.com/kubeflow/pipelines/releases/tag/2.0.3
Can you please update and make another RC? Thank you!

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 27, 2023

Thanks @chensun for the fast reply! Here's the PR for syncing all manifests.

I'll use this opportunity to check on other WGs as we are approaching the release date. Please let us know if there is a need for another manifest sync in the following hours as next week would be too late to integrate changes.

cc: @chensun @kimwnasptd @thesuperzapper @andreyvelich @johnugeorge @yuzisun

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 27, 2023

Hi everyone, RC5 is here!

@yhwang
Copy link
Member

yhwang commented Oct 28, 2023

Hi @DnPlas
Because of KFP 2.0.3, it would be good to also update kfp-tekton to 2.0.3. I have a PR here: #2563

Having kfp-tekton 2.0.3 in the final RC would be great. Thanks!

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 28, 2023

@yhwang ack! I will include it in the final release.

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 31, 2023

Hi community,

In preparation for the release, from the release team we'd like to provide a last chance to give feedback and file issues you consider are potential blockers. As we approach the release (Nov 1st, 2023), we'd like to ensure we don't leave any unattended issues that may happened on the last RC5.
Thanks to all the people that already provided feedback!

@kromanow94
Copy link
Contributor

Hi @DnPlas , great news, thanks!

Is it possible that the [feature] display main.log as artifact for each step will be done within this release? In platforms running hundreds or thousands KF Pipelines everyday there is a high chance for a Garbage Collection on Argo WF to be configured, which means the logs are available only for as long as the Garbage Collection keeps the Pods. Also, having the possibility to see the historical execution for reference for longer than a few hours is something that's already possible in prior versions so this is a feature degradation.

@TristanGreathouse
Copy link

@DnPlas I just filed this issue last night in which ParallelFor / Sub-DAG artifacts from component in the first DAG get overwritten by the same component (different inputs) in a duplicate sub-dag template. This basically means sub-dags cannot be repeated (albeit with different inputs/outputs) within the root dag.

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 31, 2023

Hi @DnPlas , great news, thanks!

Is it possible that the [feature] display main.log as artifact for each step will be done within this release? In platforms running hundreds or thousands KF Pipelines everyday there is a high chance for a Garbage Collection on Argo WF to be configured, which means the logs are available only for as long as the Garbage Collection keeps the Pods. Also, having the possibility to see the historical execution for reference for longer than a few hours is something that's already possible in prior versions so this is a feature degradation.

Hi @kromanow94 this is a feature that has its own tracking issue, but since we have already closed the call window for adding features, I'm afraid this cannot be included in this release. Pinging @Linchin and @chensun so they are aware of this for the next release cycle.

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 31, 2023

@DnPlas I just filed this issue last night in which ParallelFor / Sub-DAG artifacts from component in the first DAG get overwritten by the same component (different inputs) in a duplicate sub-dag template. This basically means sub-dags cannot be repeated (albeit with different inputs/outputs) within the root dag.

hey @chensun, do you mind confirming if this is a blocking issue?

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 31, 2023

Just a quick update on the RC5 testing, the release team has executed the e2e tes as a complementary testing and the results are positive. This plus the testing performed by distributions and users has helped to provide a better version of Kubeflow.

Here are the results of the tests:

(Tested on K8s 1.25)

Run ID:  5ef0267f-8b4f-45a2-8dc2-8f7ee5192ab8
Waiting for mnist-e2e experiments.kubeflow.org to get created...
mnist-e2e experiments.kubeflow.org got created.
Waiting for mnist-e2e tfjobs.kubeflow.org to get created...
mnist-e2e tfjobs.kubeflow.org got created.
Waiting for mnist-e2e tfjobs.kubeflow.org to succeed...
mnist-e2e tfjobs.kubeflow.org succeeded.
Waiting for mnist-e2e experiments.kubeflow.org to succeed...
mnist-e2e experiments.kubeflow.org succeeded.
Waiting for mnist-e2e inferenceservices.serving.kserve.io to get created...
mnist-e2e inferenceservices.serving.kserve.io got created.
Waiting for mnist-e2e inferenceservices.serving.kserve.io to succeed...
mnist-e2e inferenceservices.serving.kserve.io succeeded.
Cleaning up opened processes.

@DnPlas
Copy link
Contributor Author

DnPlas commented Oct 31, 2023

In preparation for the release, these are the components that must be synced into this repository:

The following documentation PRs have to be merged as well:

@johnugeorge
Copy link
Member

https://github.com/kubeflow/training-operator/releases/tag/v1.7.0 is created from v1.7 branch

@chensun
Copy link
Member

chensun commented Nov 1, 2023

@DnPlas I just filed this issue last night in which ParallelFor / Sub-DAG artifacts from component in the first DAG get overwritten by the same component (different inputs) in a duplicate sub-dag template. This basically means sub-dags cannot be repeated (albeit with different inputs/outputs) within the root dag.

hey @chensun, do you mind confirming if this is a blocking issue?

Agree this is a bug, but probably not significant enough to block/delay the release. Will follow up in the issue.

@DnPlas
Copy link
Contributor Author

DnPlas commented Nov 2, 2023

Hello Kubeflow Community,

The release team is happy to announce that Kubeflow 1.8 is now available. The release highlights can be found in the Kubeflow 1.8 blog post, and the specifics about components and their versions are in the Kubeflow 1.8 website entry.

We'd like to thank everyone in the community for supporting during this release cycle by planning, developing, testing, and providing feedback.It would've not been possible without the support of such great community.

If you'd like to provide feedback on the release cycle, we will have retrospective sessions in the next two Release Team Meetings at 18:00 CEST every Monday (Community Calendar).

Kubeflow 1.8 Release Team

@szymek116
Copy link

@DnPlas

KFP 2.0.2 tag is out: https://github.com/kubeflow/pipelines/releases/tag/2.0.2 Can we pull its manifests into the next RC? Thanks!

Hi!
I see this was inlcided in 1.8 finally but for me still does not fix the issue, when running non rooted image I still get:

F1112 19:54:25.274010 34 main.go:49] failed to execute component: unable to create directory "/minio/mlpipeline/v2/artifacts/hello-pipeline/605cef82-4331-4d0b-8f34-888485a3bb77/html-visualization" for output artifact "html_artifact": mkdir /minio: permission denied

Is anything more needed for this to be fixed ? I saw open issue on that but pretty much the same stuff that was merged at least around file permissions on that folder which I believe was root cause

kubeflow/pipelines#6530

I see on 1.9 agenda (kubeflow/kubeflow#6662) point try out kfp v2 rootless, I guess that's what I have done :) I can provide more details on my test, but maybe create a separate ticket for this ?

@DnPlas
Copy link
Contributor Author

DnPlas commented Nov 13, 2023

hi @szymek116, I will let @chensun confirm this information as it is very specific to pipelines. Chen, it would be great if you could also confirm if this will require a patch release from the release team.

cc: @Davidnet

@rvadim
Copy link

rvadim commented Nov 15, 2023

Hi! Confirm problems with permissions for Run of [Tutorial] DSL at a rootless env.

ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/.local'
Check the permissions.
I1115 09:26:26.725677      36 launcher_v2.go:151] publish success.
F1115 09:26:26.725728      36 main.go:49] failed to execute component: exit status 1
time="2023-11-15T09:26:27.224Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
time="2023-11-15T09:26:28.153Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

@chensun
Copy link
Member

chensun commented Nov 16, 2023

@DnPlas
KFP 2.0.2 tag is out: https://github.com/kubeflow/pipelines/releases/tag/2.0.2 Can we pull its manifests into the next RC? Thanks!

Hi! I see this was inlcided in 1.8 finally but for me still does not fix the issue, when running non rooted image I still get:

F1112 19:54:25.274010 34 main.go:49] failed to execute component: unable to create directory "/minio/mlpipeline/v2/artifacts/hello-pipeline/605cef82-4331-4d0b-8f34-888485a3bb77/html-visualization" for output artifact "html_artifact": mkdir /minio: permission denied

Is anything more needed for this to be fixed ? I saw open issue on that but pretty much the same stuff that was merged at least around file permissions on that folder which I believe was root cause

kubeflow/pipelines#6530

I see on 1.9 agenda (kubeflow/kubeflow#6662) point try out kfp v2 rootless, I guess that's what I have done :) I can provide more details on my test, but maybe create a separate ticket for this ?

Rootless kfp is a proposal and contribution from the community that the Pipeline WG agreed and reviewed. Regrettably, it did not reach completion. While we are still open to community assistance in finishing this feature, in that case we can consider a patch release. However, in the absence of community contribution and considering our existing resource constraints and priorities, there's no timeline for shipping this feature.

@DnPlas
Copy link
Contributor Author

DnPlas commented Nov 16, 2023

@rvadim that sounds like a pipelines example specific issue, do you mind filing an issue in kubeflow/pipelines so that it gets more visibility?

cc: @chensun @Davidnet

@DnPlas
Copy link
Contributor Author

DnPlas commented Nov 16, 2023

Closing this issue as the release cycle for 1.8 has come to an end. Thanks to the community that participated by testing and providing feedback during the various phases of the release.

If you need to reach out, please join us in the Community Meeting all Tuesdays (check the Calendar for more details)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests