-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ArtifactGC Fails for Stopped Workflows. Fixes #11879 #11947
fix: ArtifactGC Fails for Stopped Workflows. Fixes #11879 #11947
Conversation
argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
…ards argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
…ction to ensure that stopped workflow artifacts are garbage collected. Remove bug in e2e testing WaitForWorkflow() timeout option in when.go. Improve robustness of TestStoppedWorkflow e2e test. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliev0 Could you help review when you get a chance?
Yes. It may be more than a couple of weeks from now, though, as I have a vacation followed by a Hackathon week at work. (It looks like this is still in "Draft" state as well.) BTW, thanks for doing this @Garett-MacGowan ! |
test/e2e/fixtures/when.go
Outdated
@@ -258,7 +258,7 @@ func (w *When) WaitForWorkflow(options ...interface{}) *When { | |||
for _, opt := range options { | |||
switch v := opt.(type) { | |||
case time.Duration: | |||
timeout = v - 30*time.Second + defaultTimeout | |||
timeout = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why you ended up changing this? It does seem less kludgey and more intuitive than the old way, but I also see it was changed on purpose here. defaultTimeout
was set to the E2E_TIMEOUT
environment variable, or otherwise 30 seconds. Looking at that PR, it seems like in the CI the E2E_TIMEOUT
was specifically made longer than the default 30 seconds back then. So, there was some notion that all tests should be delayed by x amount if that environment variable is set. (Your change won't make use of the environment variable if some overall offset is needed.) Of course, now I see the CI doesn't use that environment variable but instead uses something called E2E_SUITE_TIMEOUT cc @agilgur5
In any case, I do see the time.Duration
comment above isn't correct as far as "30s by default"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because I was using different timeouts to debug my test case and I noticed it wasn't using my manually specified timeout. Subtracting 30 seconds and adding the default timeout to a manually specified value makes no sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I'm thinking: this environment variable which used to be called E2E_TIMEOUT
and is now called E2E_WAIT_TIMEOUT appears to be something that will allow things to run more slowly in certain environments. Is it possible to both 1. preserve the ability to run everything more slowly across the board in certain environments, and 2. have the parameters/comments be named in an intuitive way so that it's not a surprise to the caller of the function what the actual behavior is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry, I know this wasn't the original intention of this particular issue, but I'm thinking we shouldn't remove the intention of the E2E_WAIT_TIMEOUT
environment variable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line timeout = v - 30*time.Second + defaultTimeout
only matters when a manual timeout value is specified. So it's basically saying "add the default timeout to the manually specified value" (ignoring the -30 which makes no sense). In this case renaming the variable to include bias
makes sense.
It makes less sense when it's used in the true default case timeout := defaultTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this small modification?:
var timeoutBias = env.LookupEnvDurationOr("E2E_WAIT_TIMEOUT_BIAS", 0*time.Second)
var defaultTimeout = env.LookupEnvDurationOr("E2E_WAIT_TIMEOUT", 60*time.Second) + timeoutBias
in the method:
timeout := defaultTimeout
And then in the switch statement, you could add the bias.
timeout = v + timeoutBias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Just thinking, if you see any of the tests which call this and pass a duration in intermittently fail, then it could possibly be that before they were getting an additional 30 seconds added on (E2E_WAIT_TIMEOUT=60 - 30). If so, could you bump up the duration passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no problem.
workflow/controller/operator.go
Outdated
@@ -276,6 +266,19 @@ func (woc *wfOperationCtx) operate(ctx context.Context) { | |||
woc.computeMetrics(woc.execWf.Spec.Metrics.Prometheus, localScope, realTimeScope, true) | |||
} | |||
|
|||
// Reconciliation of Outputs (Artifacts). See reportOutputs() of executor.go | |||
woc.taskResultReconciliation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see this will now get executed in the case of woc.wf.Status.Phase == wfv1.WorkflowUnknown
which before it wouldn't. This may be right but want to make sure. I also like not having the if
statement if it's unnecessary. I believe wfv1.WorkflowUnknown
would happen for a just submitted workflow, or it appears a resubmitted workflow. It does seem like this probably won't ever be harmful - if there's a WorkflowTaskResult of the same name then we probably ought to use it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unless there's some weird case that a previous Workflow of the same name still had a WorkflowTaskResult lying around from before...but I believe a WorkflowTaskResult should get deleted at the same time as the Workflow it relates to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliev0 if I'm not mistaken, this change resolved some intermittent test failures I was receiving. I believe the issue was that artifacts could be created and garbage collected while wfv1.WorkflowUnknown
was set. This would cause taskReconciliation
to not run and garbage collection to fail since the node status wouldn't get updated with the artifact info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh - that's weird that artifacts were created and garbage collected while wfv1.WorkflowUnknown
was set. I wonder if there could be a case of wfv1.WorkflowUnknown
that I'm unaware of if so. Anyway, it seems to be harmless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm not 100% on this, but as you say the change should be harmless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick update here: irrespective of if artifacts can be written when wfv1.WorkflowUnknown
is set, the more important thing with this change is that the reconciliation runs even when the workflow is marked as complete as per woc.wf.Labels[common.LabelKeyCompleted] == "true"
Failing test is |
…ias for test timeouts in slow environments. Adjust WaitForWorkflow timeout parameter in retry_test.go (added 30 seconds) to be in line with the previously defined timeout before fixing the WaitForWorkflow custom timeout logic in commit 4ffeb5f. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
This was fixed by adding the 30s. Good catch @juliev0 |
Sadly my test ( |
Sorry to hear. Thank you! |
I'm having a hard time debugging this. The failure rate is like 1/30+ test runs on my local machine. The artifacts are being saved, but the task reconciliation never receives the Is there an easy way to write the wait container logs to a file in the same way that the controller logs are written to a file in the logs directory? Without this, It's basically impossible for me to debug it. |
Interesting. So, is the difference between state of "latest" branch and this one that basically it fails for stopped workflows 1/30 times (at least in this unit test) vs 100%? Are you seeing it fail more often than that in the CI, though? As far as the wait container logs, no, I don't know of a wait container log artifact unfortunately. But maybe you could run something like: |
Yeah, it should fail 100% of the time on the latest branch. I'll double check tomorrow by undoing my changes locally and run the test a few times.
I will resubmit the CI job tomorrow to see. If it fails again, then yes.
Thanks for this, I will test it out. |
I've confirmed the test fails 100% of the time on latest.
I ended up using the following command to watch the logs between multiple test runs so that I can rapid fire the tests until I hit a failure. Note that I added the
As a result of the above, I was able to confirm that the workflowtaskresults are being properly created ( Maybe a fresh set of eyes can help. Here is an excerpt of the controller logs of a failure (this is from near the end of the test):
Here is an excerpt of the controller logs of a success (this is from near the end of the test):
The For example, here is a result logged from
At this point, I'm peppering logs in taskResultReconciliation to see if there may be an issue there. Perhaps there is a bug in the reconciliation logic? Will follow up shortly. |
Here's the logs, the ones I added are
|
Is there something between |
…rgoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
…orkflow deletions during testing (in case tests are running in parallel). Fixes argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
@juliev0 All tests passed this time but I think it might be that 29/~30 chance. |
Hmm, any chance that |
I am okay to merge something that improves things from 100% to “rarely” for now if you’d like that |
Good point you make. I think it is possible, and it would explain the issue. The check for the finalizer removal is the following:
where
Clearly this depends on the state of wf.Status.Nodes, which Thoughts on a solution? Maybe we could set up a mutex so that the Edit: to prevent unnecessary performance degradation, we could only enforce the mutex when the workflow status is
It's definitely in a better state than it was regardless of the above. If you're ok to merge it, so am I. It would be nice to get it working 100% though. |
Hmm, |
Yes, it would be best if we could avoid the complexity of the mutex. Thanks, I appreciate the help. |
In Then looking at your log, the next time this is logged:
the TaskResults have all been deleted? (I guess each one is owned by the corresponding Pod, right? so it gets deleted with the Pod) Do you know what ensures that Pods/TaskResults only get deleted after |
@juliev0 Ahh, I see. I missed the logic regarding task result deletion. Currently, task results are being deleted in the event of This happens in The current logic is:
I think it may be sufficient to do.
I don't think we need to protect the pod from being deleted since, as we've discussed, the wait container should finish. |
Thanks for finding that code. It looks like there may be 2 ways for TaskResults to get deleted: 1. the code you found which would occur when the Workflow completes; 2. if the Pod that owns a TaskResult gets deleted I believe it will automatically get deleted through this OwnerReference. See here for information on that. |
And a Pod could be deleted as soon as it completes if the PodGCStrategy is configured as such |
Thanks for the context. If that's the case, what are your thoughts on removing the OwnerReference? I think deletion is already handled by the woc.deleteTaskResults() method, no? I do like the idea of having it as a safeguard though. Alternatively, maybe we could set the OwnerReference to the Workflow? If so, any guidance on best practices for passing that reference? FYI, with the change described above, the tests have passed well over 180 times with no failure. This isn't testing with |
Great. I think setting the OwnerReference to the Workflow seems good just in case. Note, that of course there will be more TaskResults around with this change, but it seems okay to me, and also seems good for debugging. It looks like workflow name is already being passed in as an environment variable here and passed to the Executor for you to use. Can you pass the Workflow UID as well? BTW, I just sent a message to Alex to see if there was ever some other intention for how to handle this. |
…deletion of task results. Change owner reference for task result to Workflow. Add artifactgc test for stopped workflow with pod gc on pod completion. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <[email protected]>
Ok, tests running well locally. Added a test case for PodGC OnPodCompletion. Here's the script i've been using in case anyone comes here in the future and needs to run them:
|
outputs := we.Template.Outputs.DeepCopy() | ||
outputs.Artifacts = append(outputs.Artifacts, logArtifacts...) | ||
return we.reportResult(ctx, wfv1.NodeResult{Outputs: outputs}) | ||
} | ||
|
||
func (we *WorkflowExecutor) reportResult(ctx context.Context, result wfv1.NodeResult) error { | ||
if !result.Outputs.HasOutputs() && !result.Progress.IsValid() { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I wonder if removing this is check is okay or not. I know you wanted to be able to add the annotation, which is fine. But it does seem like if this were to resolve as true, we will do some of the things below which we weren't doing before. But maybe the check was just here to preemptively return in case there seemed to be nothing to do? (I'm not sure what causes Progress to be invalid.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that this was here to return early if there was nothing to do.
I guess progress can be invalid if there are no tasks. Not sure if N can ever be greater than M. I'm pretty sure the progress should always be set at this point (it happens early on in wait container with monitorProgress
)
IsValid logic is
in != "" && in.N() >= 0 && in.N() <= in.M() && in.M() > 0
where Progress in N/M format. N is number of task complete. M is number of tasks.
If there are no tasks and no outputs, labels/annotations for completion won't be set because no task result will be created. Update for finalization would then fail.
At this point, we're requiring a task result for all wait containers to track completion.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for going through the logic. It does seem safe to me to remove this line. If you don't have Outputs and you don't have Tasks, then it seems like it would probably be fine to still call upsertTaskResult()
below, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would just create a shell of a WorkflowTaskResult
workflow/controller/taskresult.go
Outdated
|
||
woc.log.Debugf("task results completed:\n%+v", woc.wf.Status.GetTaskResultsCompleted()) | ||
woc.log.Debugf("task result completed len: %d", len(woc.wf.Status.GetTaskResultsCompleted())) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these 2 debugging statements supposed to be within the for loop or outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside the loop gives a more fine-grained view into what's going on. It's probably sufficient to move this outside the loop though.
workflow/controller/operator.go
Outdated
@@ -779,7 +787,8 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) { | |||
woc.log.WithError(err).Warn("error updating taskset") | |||
} | |||
|
|||
if woc.wf.Status.Phase.Completed() { | |||
// Note that the woc.checkTaskResultsCompleted() resolves a ~1/30 garbage collection failure race conidtion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we say in the comment "make sure the TaskResults are incorporated into WorkflowStatus before we delete them"?
Thanks for adding all these tests! |
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
This is looking good! I think I can probably merge it on Monday! (I won't have time to work between now and Sunday and don't want to merge it in a hurry) |
Signed-off-by: Garett MacGowan <[email protected]> Signed-off-by: Garett MacGowan <[email protected]>
…1879 (argoproj#11947)" This reverts commit c296cf2.
Fixes #11879
Motivation
Currently, stopped workflows will happily save artifacts to a storage bucket but the artifacts won't get garbage collected after workflow deletion. The changes make it so that a stopped workflow's artifacts are properly garbage collected upon workflow deletion.
Modifications
The
operator.go
's call towoc.taskResultReconciliation()
previously occurred after Artifact Garbage Collection (ArtifactGC
) and an early return due toif woc.wf.Labels[common.LabelKeyCompleted] == "true"
, leading to a scenerio in which garbage collection would fail upon workflow deletion. This in turn would cause workflow deletion to get blocked by the ArtifactGC finalizer.woc.taskResultReconciliation()
is responsible for updating the node status in the workflow based on the artifact outputs saved during workflow execution withreportOutputs()
inexecutor.go
.Verification
I tested the changes by adding an
e2e
test calledTestStoppedWorkflow
which uses a custom workflow with the following logic.I made some additional minor tweaks to other files, most notably,
timeout = v - 30*time.Second + defaultTimeout
inwhen.go
which would cause custom timeout settings to be incorrect when using theWaitForWorkflow()
e2e testing utility.