Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): remove ArtifactGC finalizer when no artifacts. Fixes #13499 #13500

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

chengjoey
Copy link
Contributor

@chengjoey chengjoey commented Aug 24, 2024

Fixes #13499

Motivation

if workflow pods length is zero, anyPodSuccess should be true, then the removeFinalizer can be executed without force

Modifications

anyPodSuccess := len(pods) == 0

Verification

run examples/artifact-gc-workflow.yaml on mac m1 os, and then delete the workflow

@chengjoey chengjoey changed the title fix delete gc-artifact workflow stuck due to finalizers can not be removed fix #13499 fix(controller): delete gc-artifact workflow stuck due to finalizers can not be removed fix #13499 Aug 24, 2024
@agilgur5 agilgur5 changed the title fix(controller): delete gc-artifact workflow stuck due to finalizers can not be removed fix #13499 fix(controller): remove ArtifactGC finalizer when Pods completed. Fixes #13499 Aug 24, 2024
Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is quite correct. A length of 0 could also mean that an ArtifactGC Pod has yet to run.

In your issue, the ArtifactGC Pod failed, and this is exactly what the forceFinalizerRemoval field is for

@chengjoey
Copy link
Contributor Author

chengjoey commented Aug 24, 2024

// Do artifact GC if task result reconciliation is complete.
if woc.wf.Status.Fulfilled() {
if err := woc.garbageCollectArtifacts(ctx); err != nil {
woc.log.WithError(err).Error("failed to GC artifacts")
return
}
} else {
woc.log.Debug("Skipping artifact GC")
}

hi @agilgur5 , artifacts gc will only be running when task reconciliation is completed. I think the pod has finished running at this time. and i think this is a normal operation to delete the workflow, no need to force, increase the cost of the deletion operation, because this is not a gc problem that requires forced deletion

@agilgur5
Copy link
Contributor

agilgur5 commented Aug 24, 2024

TaskResult reconciliation should already be completed. In your issue, I'm pretty sure the ArtifactGC Pod itself fails (due to the architecture). As it failed, the Controller cannot be sure that the artifact was deleted or not

@agilgur5
Copy link
Contributor

agilgur5 commented Aug 24, 2024

I'm pretty sure the ArtifactGC Pod itself fails (due to the architecture)

Hmm, no, ArtifactGC can run on arm64; your Pod that fails due to arch is the argosay image's arch -- see #11613 (comment) .

artifacts gc will only be running when task reconciliation is completed

TaskResult reconciliation should already be completed

Or you're experiencing a different race condition from v3.5.3+ / v3.5.5+. This actually sounds like the root cause is #12993, which was recently fixed in #13454

@chengjoey
Copy link
Contributor Author

chengjoey commented Aug 24, 2024

I'm pretty sure the ArtifactGC Pod itself fails (due to the architecture)

Hmm, no, ArtifactGC can run on arm64; your Pod that fails due to arch is the argosay image's arch -- see #11613 (comment) that I was going to fix recently as I may have permissions to the image repositories.

Yes, I know that the reason why I failed to run is because the image argoproj/argosay:v2 in the examples does not support the linux/arm64 architecture, not because artifact-gc cannot run in arm64. I raised this PR because my workflow failed and no artifact was generated, so there is no need to do GC. I hope to be able to delete the workflow directly without adding additional operations force

@agilgur5
Copy link
Contributor

I raised this PR because my workflow failed and no artifact was generated, so there is no need to do GC.

Yes, the second half of my previous comment is relevant with regard to TaskResult reconciliation. This looks like it is caused by a bug in TaskResult reconciliation and not specific to ArtifactGC

@chengjoey
Copy link
Contributor Author

chengjoey commented Aug 24, 2024

thanks @agilgur5 , i got it

@agilgur5 agilgur5 added area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels Aug 25, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Aug 25, 2024

Ok this makes more sense now given your last comment on the issue, thanks again for elaborating!

From my first comment above though:

A length of 0 could also mean that an ArtifactGC Pod has yet to run

I think we could potentially remove this variable entirely and just replace with a check that all ArtifactGC Pods were recouped?

Otherwise it looks like anyPodSuccess is only used for the one finalizer removal check you linked to, so perhaps want to invert that or rename the variable if it's still needed.

cc @juliev0 here on the PR as well

@agilgur5 agilgur5 requested a review from juliev0 August 25, 2024 03:14
@juliev0 juliev0 self-assigned this Aug 25, 2024
@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

Thanks for finding this.

So, I commented in the issue here, but will add here as well:

Looking at the anyPodSuccess code, I'm pretty sure it was probably just for the purpose of eliminating unnecessary work, to avoid doing the woc.allArtifactsDeleted() check if none of the Artifact GC Pods had even finished and succeeded yet. We can remove it if we need to. I'm looking at the logic for woc.allArtifactsDeleted() and it actually doesn't look that expensive , so I'd be okay to remove the anyPodSuccess altogether if it makes the code simpler.

I'm surprised we haven't encountered this issue before, of a Workflow that creates no artifacts.

@chengjoey
Copy link
Contributor Author

thanks for your patience @agilgur5 @juliev0 , In this PR I removed anyPodSuccess, allArtifactsDeleted is enough to ensure that artifacts can be GCed

@juliev0
Copy link
Contributor

juliev0 commented Aug 26, 2024

thanks for the modification! The e2e tests are sometimes flakey, so if the e2e test failures are unrelated to your changes, feel free to push empty commits until it passes

@chengjoey chengjoey force-pushed the fix/workflow-gc-stuck branch 2 times, most recently from 5ecda25 to 0f672b2 Compare August 26, 2024 02:30
@agilgur5
Copy link
Contributor

agilgur5 commented Aug 26, 2024

so I'd be okay to remove the anyPodSuccess altogether if it makes the code simpler.

@juliev0 did you see my comment above? I thought the check might be to make sure that there were no leftover ArtifactGC Pods that were not yet recouped. But if it's even simpler than that, that works

feel free to push empty commits until it passes

Given that we have 2 approvers here, it'd be more efficient for one of us to just rerun the unsuccessful jobs (and any member can do /retest now)

I'm surprised we haven't encountered this issue before, of a Workflow that creates no artifacts.

It's very possible people have been using forceFinalizerRemoval to workaround this. Although I also just think there aren't that many ArtifactGC users yet.

Speaking of, it'd be nice to add an E2E test for this

Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengjoey do you think you could add a regression test for this scenario? If I'm not mistaken, you should be able to use any main container that intentionally fails while having output.artifacts specified

@chengjoey
Copy link
Contributor Author

Ok, I will add an e2e test

@chengjoey chengjoey force-pushed the fix/workflow-gc-stuck branch 2 times, most recently from cc1c58c to cff95ce Compare August 26, 2024 03:52
@juliev0
Copy link
Contributor

juliev0 commented Aug 26, 2024

I thought the check might be to make sure that there were no leftover ArtifactGC Pods that were not yet recouped.

Hmm, does it accomplish that? I'm having trouble following if so. I think it was just to eliminate work.

Thanks for adding the test @chengjoey. If I'm not mistaken, I think you should be able to add it as another case to this test. This test verifies that the finalizer is removed by checking that the workflow actually gets deleted.

@agilgur5 agilgur5 changed the title fix(controller): remove ArtifactGC finalizer when Pods completed. Fixes #13499 fix(controller): remove ArtifactGC finalizer when no artifacts. Fixes #13499 Aug 26, 2024
test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly consistency comments on the test below

@chengjoey
Copy link
Contributor Author

mostly consistency comments on the test below

thanks @agilgur5 , all has done

Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the iteration!

Will let Julie take another look at the tests before merging

@juliev0
Copy link
Contributor

juliev0 commented Aug 26, 2024

Yes, thank you @chengjoey !

@juliev0 juliev0 merged commit 735739f into argoproj:main Aug 26, 2024
28 checks passed
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Sep 5, 2024
Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot delete failed Workflow due to ArtifactGC finalizer
3 participants